Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Use an Existing Hadoop AMI (https://cwiki.apache.org/confluence/display/MAHOUT/Use+an+Existing+Hadoop+AMI)
Edited by Grant Ingersoll: --------------------------------------------------------------------- The following process was developed for launching Hadoop clusters in EC2 in order to benchmark Mahout's clustering algorithms using a large document set (see Mahout-588). Specifically, we used the ASF mail archives that have been parsed and converted to the Hadoop SequenceFile format (block-compressed) and saved to a public S3 folder: s3://asf-mail-archives/mahout-0.4/sequence-files. Overall, there are 6,094,444 key-value pairs in 283 files taking around 5.7GB of disk. You can also use Amazon's Elastic MapReduce, see [Mahout on Elastic MapReduce|https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce]. However, using EC2 directly is slightly less expensive and provides greater visibility into the state of running jobs via the JobTracker Web UI. You can launch the EC2 cluster from your development machine; the following instructions were generated on Ubuntu workstation. We assume that you have successfully completed the Amazon EC2 Getting Started Guide, see [EC2 Getting Started Guide|http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/]. Note, this work was supported in part by the Amazon Web Services Apache Projects Testing Program. h2. Launch Hadoop Cluster h4. Gather Amazon EC2 keys / security credentials You will need the following: AWS Account ID Access Key ID Secret Access Key X.509 certificate and private key (e.g. cert-aws.pem and pk-aws.pem) EC2 Key-Pair (ssh public and private keys) for the US-EAST region. Please make sure the file permissions are "-rw-------" (e.g. chmod 600 gsg-keypair.pem). You can create a key-pair for the US-East region using the Amazon console. If you are confused about any of these terms, please see: [Understanding Access Credentials for AWS/EC2|http://alestic.com/2009/11/ec2-credentials]. You should also export the EC2_PRIVATE_KEY and EC2_CERT environment variables to point to your AWS Certificate and Private Key files, for example: {noformat} export EC2_PRIVATE_KEY=$DEV/aws/pk-aws.pem export EC2_CERT=$DEV/aws/cert-aws.pem {noformat} These are used by the ec2-api-tools command to interact with Amazon Web Services. h4. Install and Configure the Amazon EC2 API Tools: On Ubuntu, you'll need to enable the multi-verse in /etc/apt/sources.list to find the ec2-api-tools {noformat} apt-get update apt-get install ec2-api-tools {noformat} Once installed, verify you have access to EC2 by executing: {noformat} ec2-describe-images -x all | grep hadoop {noformat} h4. Install Hadoop 0.20.2 Locally You need to install Hadoop locally in order to get access to the EC2 cluster deployment scripts. We use {{/mnt/dev}} as the base working directory because this process was originally conducted on an EC2 instance; be sure to replace this path with the correct path for your environment as you work through these steps. {noformat} sudo mkdir -p /mnt/dev/downloads sudo chown -R ubuntu:ubuntu /mnt/dev cd /mnt/dev/downloads wget http://apache.mirrors.hoobly.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz && cd /mnt/dev && tar zxvf downloads/hadoop-0.20.2.tar.gz ln -s hadoop-0.20.2 hadoop {noformat} The scripts we need are in $HADOOP_HOME/scr/contrib/ec2. There are other approaches to deploying a Hadoop cluster on EC2, such as Cloudera's [CDH3|https://docs.cloudera.com/display/DOC/Cloudera+Documentation+Home+Page]. We chose to use the contrib/ec2 scripts because they are very easy to use provided there is an existing Hadoop AMI available. h4. Edit hadoop-ec2-env.sh Open hadoop/src/contrib/ec2/bin/hadoop-ec2-env.sh in your editor and set the Amazon security variables to match your environment, for example: {noformat} AWS_ACCOUNT_ID=####-####-#### AWS_ACCESS_KEY_ID=??? AWS_SECRET_ACCESS_KEY=??? EC2_KEYDIR=/mnt/dev/aws KEY_NAME=gsg-keypair PRIVATE_KEY_PATH=/mnt/dev/aws/gsg-keypair.pem {noformat} The value of PRIVATE_KEY_PATH should be your EC2 key-pair pem file, such as /mnt/dev/aws/gsg-keypair.pem. This key-pair must be created in the US-East region. For Mahout, we recommended the following settings: {noformat} HADOOP_VERSION=0.20.2 S3_BUCKET=453820947548/bixolabs-public-amis ENABLE_WEB_PORTS=true INSTANCE_TYPE="m1.xlarge" {noformat} You do not need to worry about changing any variables below the comment that reads "The following variables are only used when creating an AMI.". These settings will create a cluster of EC2 xlarge instances using the Hadoop 0.20.2 AMI provided by Bixo Labs. h4. Launch Hadoop Cluster {noformat} cd $HADOOP_HOME/src/contrib/ec2 bin/hadoop-ec2 launch-cluster mahout-clustering 2 {noformat} This will launch 3 xlarge instances (two workers + one for the NameNode aka "master"). It may take up to 5 minutes to launch a cluster named "mahout-clustering"; watch the console for errors. The cluster will launch in the US-East region so you won't incur any data transfer fees to/from US-Standard S3 buckets. You can re-use the cluster name for launching other clusters of different sizes. Behind the scenes, the Hadoop scripts will create two EC2 security groups that configure the firewall for accessing your Hadoop cluster. h4. Launch proxy Assuming your cluster launched successfully, establish a SOCKS tunnel to your master node to access the JobTracker Web UI from your local browser. {noformat} bin/hadoop-ec2 proxy mahout-clustering & {noformat} This command will output the URLs for the JobTracker and NameNode Web UI, such as: {noformat} JobTracker http://ec2-???-???-???-???.compute-1.amazonaws.com:50030 {noformat} h4. Setup FoxyProxy (FireFox plug-in) Once the FoxyProxy plug-in is installed in FireFox, go to Options > FoxyProxy Standard > Options to setup a proxy on localhost:6666 for the JobTracker and NameNode Web UI URLs from the previous step. For more information about FoxyProxy, please see: [FoxyProxy|http://getfoxyproxy.org/downloads.html] Now you are ready to run Mahout jobs in your cluster. h2. Launch Clustering Job from Master server h4. Login to the master server: {noformat} bin/hadoop-ec2 login mahout-clustering {noformat} Hadoop does not start until all EC2 instances are running, look for java processes on the master server using: ps waux | grep java h4. Install Mahout Since this is EC2, you have the most disk space on the master node in /mnt. {noformat} mkdir -p /mnt/dev/downloads cd /mnt/dev/downloads wget http://apache.mesi.com.ar//mahout/0.4/mahout-distribution-0.4.tar.gz && cd /mnt/dev && tar zxvf downloads/mahout-distribution-0.4.tar.gz ln -s mahout-distribution-0.4 mahout {noformat} h4. Configure Hadoop You'll want to increase the Max Heap Size for the data nodes (mapred.child.java.opts) and set the correct number of reduce tasks based on the size of your cluster. {noformat} vi $HADOOP_HOME/conf/hadoop-site.xml {noformat} (NOTE: if this file doesn't exist yet, then the cluster nodes are still starting up. Wait a few minutes and then try again.) Add the following properties: {noformat} <!-- Change 6 to the correct number for your cluster --> <property> <name>mapred.reduce.tasks</name> <value>6</value> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx4096m</value> </property> {noformat} You can safely run 3 reducers per node on EC2 xlarge instances with 4GB of max heap each. If you are using large instances, then you may be able to have 2 per node or only 1 if your jobs are CPU intensive. h4. Copy the vectors from S3 to HDFS Use Hadoop's distcp command to copy the vectors from S3 to HDFS. {noformat} hadoop distcp -Dmapred.task.timeout=1800000 \ s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \ /asf-mail-archives/mahout-0.4/tfidf-vectors {noformat} The files are stored in the US-Standard S3 bucket so there is no charge for data transfer to your EC2 cluster, as it is running in the US-EAST region. h4. Launch the clustering job (from the master server) {noformat} cd /mnt/dev/mahout bin/mahout kmeans -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \ -c /asf-mail-archives/mahout-0.4/initial-clusters/ \ -o /asf-mail-archives/mahout-0.4/kmeans-clusters/ \ --numClusters 100 \ --maxIter 10 \ --distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure \ --convergenceDelta 0.01 & {noformat} You can monitor the job using the JobTracker Web UI through FoxyProxy. h4. Dump Clusters Once completed, you can view the results using Mahout's cluster dumper {noformat} bin/mahout clusterdump --seqFileDir /asf-mail-archives/mahout-0.4/kmeans-clusters/clusters-1/ \ --numWords 20 \ --dictionary s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/dictionary.file-0 \ --dictionaryType sequencefile --output clusters.txt --substring 100 {noformat} Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
