[ 
https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992857#comment-12992857
 ] 

Timothy Potter commented on MAHOUT-588:
---------------------------------------

Here are the steps I take to vectorize using Amazon's Elastic MapReduce.

1. Install elastic-mapreduce-ruby tool:

On Debian-based Linux:

sudo apt-get install ruby1.8
sudo apt-get install libopenssl-ruby1.8
sudo apt-get install libruby1.8-extras

Once these dependencies are installed, download and extract the 
elastic-mapreduce-ruby app:

mkdir -p /mnt/dev/elastic-mapreduce /mnt/dev/downloads
cd /mnt/dev/downloads
wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
cd /mnt/dev/elastic-mapreduce
unzip /mnt/dev/downloads/elastic-mapreduce-ruby.zip

# create a file named credentials.json in /mnt/dev/elastic-mapreduce
# see: http://aws.amazon.com/developertools/2264?_encoding=UTF8&jiveRedirect=1
# credentials.json should contain the following, note the region is significant

{
  "access-id":     "ACCESS_KEY",
  "private-key":   "SECRET_KEY",
  "key-pair":      "gsg-keypair",
  "key-pair-file": "/mnt/dev/aws/gsg-keypair.pem",
  "region":        "us-east-1",
  "log-uri":       "s3n://BUCKET/asf-mail-archives/logs/"
}

Also, it's a good idea to add /mnt/dev/elastic-mapreduce to your PATH

2. Once elastic-mapreduce is installed, start a cluster with no jobflow steps 
yet:

elastic-mapreduce --create --alive \
  --log-uri s3n://BUCKET/asf-mail-archives/logs/ \
  --key-pair gsg-keypair \
  --slave-instance-type m1.xlarge \
  --master-instance-type m1.xlarge \
  --num-instances # \
  --name mahout-0.4-vectorize \
  --bootstrap-action 
s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

This will create an EMR Job Flow named "mahout-0.4-vectorize" in the US-East 
region. Take note of
the Job ID returned as you will need it to add the "seq2sparse" step to the Job 
Flow.

I'll leave it to you to decide how many instances to allocate, but keep in mind 
that one will be
dedicated as the master. Also, it took about 75 minutes to run the seq2sparse 
job on 19 xlarge 
instances (~190 normalized instance hours -- not cheap). I think you'll be safe 
to use about 10-13
instances and still finish in under 2 hours.

Also, notice I'm using Amazon's bootstrap-action for configuring the cluster to 
run memory intensive
jobs. For more information about this, see:
http://buyitnw.appspot.com/forums.aws.amazon.com/ann.jspa?annID=834

3. Mahout JAR

The Mahout 0.4 Jobs JAR with our TamingAnalyzer is available at:
s3://thelabdude/mahout-examples-0.4-job-tt.jar

If you need to change other Mahout code, then you'll need to post your own JAR 
to S3.
Remember to reference the JAR using the s3n Hadoop protocol.

4. Schedule a jobflow step to vectorize using Mahout's seq2sparse:

elastic-mapreduce --jar s3n://thelabdude/mahout-examples-0.4-job-tt.jar \
--main-class org.apache.mahout.driver.MahoutDriver \
--arg seq2sparse \
--arg -i --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files/ \
--arg -o --arg /asf-mail-archives/mahout-0.4/vectors/ \
--arg --weight --arg tfidf \
--arg --chunkSize --arg 100 \
--arg --minSupport --arg 400 \
--arg --minDF --arg 20 \
--arg --maxDFPercent --arg 80 \
--arg --norm --arg 2 \
--arg --numReducers --arg ## \
--arg --analyzerName --arg org.apache.mahout.text.TamingAnalyzer \
--arg --maxNGramSize --arg 2 \
--arg --minLLR --arg 50 \
--enable-debugging \
-j JOB_ID

These settings are pretty aggressive in order to reduce the vectors
to around 100,000 dimensions.

IMPORTANT: Set the number of reducers to 2 x (N-1) (where N is the size of your 
cluster)

The job will send output to HDFS instead of S3 (see Mahout-598). Once the job
completes, we'll copy the results to S3 from our cluster's HDFS using distcp.

NOTE: To monitor the status of the job, use:
elastic-mapreduce --logs -j JOB_ID

5. Save log after completion

Once the job completes, save the log output for further analysis:

elastic-mapreduce --logs -j JOB_ID > seq2sparse.log

6. SSH into the master node to run distcp:

elastic-mapreduce --ssh -j JOB_ID

hadoop fs -lsr /asf-mail-archives/mahout-0.4/vectors/
hadoop distcp /asf-mail-archives/mahout-0.4/vectors/ 
s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/
 &

Note: You will need all the output from the vectorize step in order to run 
Mahout's clusterdump.

7. Shut down your cluster

Once you've copied the seq2sparse output to S3, you can shutdown your cluster.

elastic-mapreduce --terminate -j JOB_ID

Verify the cluster is terminated in your Amazon console.

8. Make the vectors public in S3 using the Amazon console or s3cmd:

s3cmd setacl --acl-public --recursive 
s3://BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/

9. Dump out the size of the vectors

bin/mahout vectordump --seqFile 
s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/tfidf-vectors/part-r-00000
 --sizeOnly | more


> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-588
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-588
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Grant Ingersoll
>         Attachments: 60_clusters_kmeans_10_iterations_100K_coordinates.txt, 
> SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives2.java, 
> TamingAnalyzer.java, TamingAnalyzer.java, TamingAnalyzerTest.java, 
> TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java, 
> TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, 
> TamingTFIDF.java, TamingTokenizer.java, Top1000Tokens_maybe_stopWords, 
> Uncompress.java, clusters1.txt, clusters_kMeans.txt, 
> distcp_large_to_s3_failed.log, ec2_setup_notes.txt, 
> seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's 
> clustering algorithms.  I've asked the two doing the project to do all the 
> work in the open here.  The goal is to use a publicly reusable dataset (for 
> now, the ASF mail archives, assuming it is big enough) and run on EC2 and 
> make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done 
> as a Vectorizer) and the publication of the results will be put up on the 
> Wiki as well as in the book.  This issue is to track the patches, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to