[CONF] Apache Mahout > Use an Existing Hadoop AMI

confluence Sun, 27 Mar 2011 08:42:22 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Use an Existing Hadoop AMI 
(https://cwiki.apache.org/confluence/display/MAHOUT/Use+an+Existing+Hadoop+AMI)



Edited by Grant Ingersoll:
---------------------------------------------------------------------
The following process was developed for launching Hadoop clusters in EC2 in 
order to benchmark Mahout's clustering algorithms using a large document set 
(see Mahout-588). Specifically, we used the ASF mail archives that have been 
parsed and converted to the Hadoop SequenceFile format (block-compressed) and 
saved to a public S3 folder: s3://asf-mail-archives/mahout-0.4/sequence-files. 
Overall, there are 6,094,444 key-value pairs in 283 files taking around 5.7GB 
of disk.

You can also use Amazon's Elastic MapReduce, see [Mahout on Elastic 
MapReduce|https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce].
 However, using EC2 directly is slightly less expensive and provides greater 
visibility into the state of running jobs via the JobTracker Web UI. You can 
launch the EC2 cluster from your development machine; the following 
instructions were generated on Ubuntu workstation. We assume that you have 
successfully completed the Amazon EC2 Getting Started Guide, see [EC2 Getting 
Started 
Guide|http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/].

Note, this work was supported in part by the Amazon Web Services Apache 
Projects Testing Program.

h2. Launch Hadoop Cluster

h4. Gather Amazon EC2 keys / security credentials

You will need the following:
AWS Account ID
Access Key ID
Secret Access Key
X.509 certificate and private key (e.g. cert-aws.pem and pk-aws.pem)
EC2 Key-Pair (ssh public and private keys) for the US-EAST region.

Please make sure the file permissions are "-rw-------" (e.g. chmod 600 
gsg-keypair.pem). You can create a key-pair for the US-East region using the 
Amazon console. If you are confused about any of these terms, please see: 
[Understanding Access Credentials for 
AWS/EC2|http://alestic.com/2009/11/ec2-credentials].

You should also export the EC2_PRIVATE_KEY and EC2_CERT environment variables 
to point to your AWS Certificate and Private Key files, for example:

{noformat}
export EC2_PRIVATE_KEY=$DEV/aws/pk-aws.pem
export EC2_CERT=$DEV/aws/cert-aws.pem
{noformat}

These are used by the ec2-api-tools command to interact with Amazon Web 
Services.

h4. Install and Configure the Amazon EC2 API Tools:

On Ubuntu, you'll need to enable the multi-verse in /etc/apt/sources.list to 
find the ec2-api-tools

{noformat}
apt-get update
apt-get install ec2-api-tools
{noformat}

Once installed, verify you have access to EC2 by executing:

{noformat}
ec2-describe-images -x all | grep hadoop
{noformat}

h4. Install Hadoop 0.20.2 Locally

You need to install Hadoop locally in order to get access to the EC2 cluster 
deployment scripts. We use {{/mnt/dev}} as the base working directory because 
this process was originally conducted on an EC2 instance; be sure to replace 
this path with the correct path for your environment as you work through these 
steps.

{noformat}
sudo mkdir -p /mnt/dev/downloads
sudo chown -R ubuntu:ubuntu /mnt/dev
cd /mnt/dev/downloads
wget 
http://apache.mirrors.hoobly.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
 && cd /mnt/dev && tar zxvf downloads/hadoop-0.20.2.tar.gz
ln -s hadoop-0.20.2 hadoop 
{noformat}

The scripts we need are in $HADOOP_HOME/scr/contrib/ec2. There are other 
approaches to deploying a Hadoop cluster on EC2, such as Cloudera's 
[CDH3|https://docs.cloudera.com/display/DOC/Cloudera+Documentation+Home+Page]. 
We chose to use the contrib/ec2 scripts because they are very easy to use 
provided there is an existing Hadoop AMI available.

h4. Edit hadoop-ec2-env.sh 

Open hadoop/src/contrib/ec2/bin/hadoop-ec2-env.sh in your editor and set the 
Amazon security variables to match your environment, for example:

{noformat}
AWS_ACCOUNT_ID=####-####-####
AWS_ACCESS_KEY_ID=???
AWS_SECRET_ACCESS_KEY=???
EC2_KEYDIR=/mnt/dev/aws
KEY_NAME=gsg-keypair
PRIVATE_KEY_PATH=/mnt/dev/aws/gsg-keypair.pem
{noformat}

The value of PRIVATE_KEY_PATH should be your EC2 key-pair pem file, such as 
/mnt/dev/aws/gsg-keypair.pem. This key-pair must be created in the US-East 
region.

For Mahout, we recommended the following settings:

{noformat}
HADOOP_VERSION=0.20.2
S3_BUCKET=453820947548/bixolabs-public-amis
ENABLE_WEB_PORTS=true
INSTANCE_TYPE="m1.xlarge"
{noformat}

You do not need to worry about changing any variables below the comment that 
reads "The following variables are only used when creating an AMI.".

These settings will create a cluster of EC2 xlarge instances using the Hadoop 
0.20.2 AMI provided by Bixo Labs.

h4. Launch Hadoop Cluster

{noformat}
cd $HADOOP_HOME/src/contrib/ec2
bin/hadoop-ec2 launch-cluster mahout-clustering 2
{noformat}

This will launch 3 xlarge instances (two workers + one for the NameNode aka 
"master"). It may take up to 5 minutes to launch a cluster named 
"mahout-clustering"; watch the console for errors. The cluster will launch in 
the US-East region so you won't incur any data transfer fees to/from 
US-Standard S3 buckets. You can re-use the cluster name for launching other 
clusters of different sizes. Behind the scenes, the Hadoop scripts will create 
two EC2 security groups that configure the firewall for accessing your Hadoop 
cluster.

h4. Launch proxy

Assuming your cluster launched successfully, establish a SOCKS tunnel to your 
master node to access the JobTracker Web UI from your local browser.

{noformat}
bin/hadoop-ec2 proxy mahout-clustering &
{noformat}

This command will output the URLs for the JobTracker and NameNode Web UI, such 
as:

{noformat}
JobTracker http://ec2-???-???-???-???.compute-1.amazonaws.com:50030
{noformat}

h4. Setup FoxyProxy (FireFox plug-in)

Once the FoxyProxy plug-in is installed in FireFox, go to Options > FoxyProxy 
Standard > Options to setup a proxy on localhost:6666 for the JobTracker and 
NameNode Web UI URLs from the previous step. For more information about 
FoxyProxy, please see: [FoxyProxy|http://getfoxyproxy.org/downloads.html]

Now you are ready to run Mahout jobs in your cluster.

h2. Launch Clustering Job from Master server

h4. Login to the master server:

{noformat}
bin/hadoop-ec2 login mahout-clustering
{noformat}

Hadoop does not start until all EC2 instances are running, look for java 
processes on the master server using: ps waux | grep java

h4. Install Mahout

Since this is EC2, you have the most disk space on the master node in /mnt.

{noformat}
mkdir -p /mnt/dev/downloads
cd /mnt/dev/downloads
wget http://apache.mesi.com.ar//mahout/0.4/mahout-distribution-0.4.tar.gz && cd 
/mnt/dev && tar zxvf downloads/mahout-distribution-0.4.tar.gz
ln -s mahout-distribution-0.4 mahout
{noformat}

h4. Configure Hadoop

You'll want to increase the Max Heap Size for the data nodes 
(mapred.child.java.opts) and set the correct number of reduce tasks based on 
the size of your cluster. 

{noformat}
vi $HADOOP_HOME/conf/hadoop-site.xml
{noformat}

(NOTE: if this file doesn't exist yet, then the cluster nodes are still 
starting up. Wait a few minutes and then try again.)

Add the following properties:

{noformat}
<!-- Change 6 to the correct number for your cluster -->
<property>
  <name>mapred.reduce.tasks</name>
  <value>6</value>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx4096m</value>
</property>
{noformat}

You can safely run 3 reducers per node on EC2 xlarge instances with 4GB of max 
heap each. If you are using large instances, then you may be able to have 2 per 
node or only 1 if your jobs are CPU intensive.

h4. Copy the vectors from S3 to HDFS

Use Hadoop's distcp command to copy the vectors from S3 to HDFS.

{noformat}
hadoop distcp -Dmapred.task.timeout=1800000 \
s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
 \
/asf-mail-archives/mahout-0.4/tfidf-vectors
{noformat}

The files are stored in the US-Standard S3 bucket so there is no charge for 
data transfer to your EC2 cluster, as it is running in the US-EAST region.

h4. Launch the clustering job (from the master server)

{noformat}
cd /mnt/dev/mahout
bin/mahout kmeans -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \
  -c /asf-mail-archives/mahout-0.4/initial-clusters/ \
  -o /asf-mail-archives/mahout-0.4/kmeans-clusters/ \
  --numClusters 100 \
  --maxIter 10 \
  --distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure \
  --convergenceDelta 0.01 &
{noformat}
  
You can monitor the job using the JobTracker Web UI through FoxyProxy.

h4. Dump Clusters

Once completed, you can view the results using Mahout's cluster dumper

{noformat}
bin/mahout clusterdump --seqFileDir 
/asf-mail-archives/mahout-0.4/kmeans-clusters/clusters-1/ \
  --numWords 20 \
  --dictionary 
s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/dictionary.file-0
 \
  --dictionaryType sequencefile --output clusters.txt --substring 100
{noformat}


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Use an Existing Hadoop AMI

Reply via email to