[ 
https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated MAHOUT-588:
----------------------------------

    Attachment: distcp_large_to_s3_failed.log
                seq2sparse_small_failed.log
                seq2sparse_xlarge_ok.log

Vectorization process using seq2sparse is complete and are available in my S3 
bucket :

s3://thelabdude/asf-mail-archives/vectors/

(Note: I'll move to Grant's asf-mail-archives bucket once we have some of the 
clustering algorithms working as I didn't want to move all this data around if 
it's not correct)

Here are the parameters I used to create the vectors:

org.apache.mahout.driver.MahoutDriver   seq2sparse \
  -i s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files/ \
  -o /asf-mail-archives/mahout-0.4/vectors/ \
  --weight tfidf --chunkSize 100 --minSupport 2 \
  --minDF 1 --maxDFPercent 90 --norm 2 \
  --numReducers 31 --sequentialAccessVector

Vectorizing the sequence files took some serious horse-power; took 52 minutes 
on a 19 Node Cluster of Extra Large instances in EMR with 31 reducers. The log 
from the successful run is attached -- seq2sparse_xlarge_ok.log. Notice that I 
built sequential access vectors (which I've heard may help with kmeans 
performance).

A few lessons learned:

 - The resulting tf-vectors or tfidf-vectors files are large (~11.5GB) so you 
need to have at least 3 reducers if you intend to load the vectors into S3 as 
the max file size is 5GB. I'm storing the vectors in S3 so that we can re-use 
them for multiple clustering job runs.

- The MR job has 20 steps and benefits greatly from distributing the 
processing; don't try to vectorize this much data on a single node and multiple 
reducers is a must!

 - After failing to get this working on a development machine, I started with a 
cluster of 9 m1.small instances (in Amazon EMR) and the job crashed (see 
attached log - seq2sparse_small_failed.log). Then I used a cluster of 13 large 
instances and the process completed successfully after a couple of hours, but I 
wasn't able to "distcp" the results to S3 -- real bummer! (see attached log - 
distcp_large_to_s3_failed.log). This may be a configuration issue with Amazon's 
EMR large instance as xlarge works as expected???

Here is the ls output for the aforementioned bucket:

$ s3cmd ls s3://thelabdude/asf-mail-archives/vectors/

                       DIR   s3://thelabdude/asf-mail-archives/vectors/df-count/
                       DIR   
s3://thelabdude/asf-mail-archives/vectors/tf-vectors/
                       DIR   
s3://thelabdude/asf-mail-archives/vectors/tfidf-vectors/
                       DIR   
s3://thelabdude/asf-mail-archives/vectors/tokenized-documents/
                       DIR   
s3://thelabdude/asf-mail-archives/vectors/wordcount/
2011-01-30 15:29         0   
s3://thelabdude/asf-mail-archives/vectors/df-count_$folder$
2011-01-30 15:30  70926210   
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-0
2011-01-30 15:32  70863447   
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-1
2011-01-30 15:33  70892506   
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-2
2011-01-30 15:31  70877571   
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-3
2011-01-30 15:32  70824816   
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-4
2011-01-30 15:34  70895476   
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-5
2011-01-30 15:35  40982506   
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-6
2011-01-30 15:36  37160153   
s3://thelabdude/asf-mail-archives/vectors/frequency.file-0
2011-01-30 15:36  37160173   
s3://thelabdude/asf-mail-archives/vectors/frequency.file-1
2011-01-30 15:30  37160173   
s3://thelabdude/asf-mail-archives/vectors/frequency.file-2
2011-01-30 15:31  37160173   
s3://thelabdude/asf-mail-archives/vectors/frequency.file-3
2011-01-30 15:32  37160173   
s3://thelabdude/asf-mail-archives/vectors/frequency.file-4
2011-01-30 15:33  37160173   
s3://thelabdude/asf-mail-archives/vectors/frequency.file-5
2011-01-30 15:34  37160173   
s3://thelabdude/asf-mail-archives/vectors/frequency.file-6
2011-01-30 15:34   1727033   
s3://thelabdude/asf-mail-archives/vectors/frequency.file-7

Now on to running the clustering algorithms! 

Szymon plans to start on the algorithms in the following order:

canopy -> k-means -> fuzzy -> mean-shift -> dirichlet

I'll start on the other end and begin with dirichlet and work backwards.


> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-588
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-588
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Grant Ingersoll
>         Attachments: distcp_large_to_s3_failed.log, 
> seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log, 
> SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives2.java, 
> Uncompress.java
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's 
> clustering algorithms.  I've asked the two doing the project to do all the 
> work in the open here.  The goal is to use a publicly reusable dataset (for 
> now, the ASF mail archives, assuming it is big enough) and run on EC2 and 
> make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done 
> as a Vectorizer) and the publication of the results will be put up on the 
> Wiki as well as in the book.  This issue is to track the patches, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to