[
https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Timothy Potter updated MAHOUT-588:
----------------------------------
Attachment: distcp_large_to_s3_failed.log
seq2sparse_small_failed.log
seq2sparse_xlarge_ok.log
Vectorization process using seq2sparse is complete and are available in my S3
bucket :
s3://thelabdude/asf-mail-archives/vectors/
(Note: I'll move to Grant's asf-mail-archives bucket once we have some of the
clustering algorithms working as I didn't want to move all this data around if
it's not correct)
Here are the parameters I used to create the vectors:
org.apache.mahout.driver.MahoutDriver seq2sparse \
-i s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files/ \
-o /asf-mail-archives/mahout-0.4/vectors/ \
--weight tfidf --chunkSize 100 --minSupport 2 \
--minDF 1 --maxDFPercent 90 --norm 2 \
--numReducers 31 --sequentialAccessVector
Vectorizing the sequence files took some serious horse-power; took 52 minutes
on a 19 Node Cluster of Extra Large instances in EMR with 31 reducers. The log
from the successful run is attached -- seq2sparse_xlarge_ok.log. Notice that I
built sequential access vectors (which I've heard may help with kmeans
performance).
A few lessons learned:
- The resulting tf-vectors or tfidf-vectors files are large (~11.5GB) so you
need to have at least 3 reducers if you intend to load the vectors into S3 as
the max file size is 5GB. I'm storing the vectors in S3 so that we can re-use
them for multiple clustering job runs.
- The MR job has 20 steps and benefits greatly from distributing the
processing; don't try to vectorize this much data on a single node and multiple
reducers is a must!
- After failing to get this working on a development machine, I started with a
cluster of 9 m1.small instances (in Amazon EMR) and the job crashed (see
attached log - seq2sparse_small_failed.log). Then I used a cluster of 13 large
instances and the process completed successfully after a couple of hours, but I
wasn't able to "distcp" the results to S3 -- real bummer! (see attached log -
distcp_large_to_s3_failed.log). This may be a configuration issue with Amazon's
EMR large instance as xlarge works as expected???
Here is the ls output for the aforementioned bucket:
$ s3cmd ls s3://thelabdude/asf-mail-archives/vectors/
DIR s3://thelabdude/asf-mail-archives/vectors/df-count/
DIR
s3://thelabdude/asf-mail-archives/vectors/tf-vectors/
DIR
s3://thelabdude/asf-mail-archives/vectors/tfidf-vectors/
DIR
s3://thelabdude/asf-mail-archives/vectors/tokenized-documents/
DIR
s3://thelabdude/asf-mail-archives/vectors/wordcount/
2011-01-30 15:29 0
s3://thelabdude/asf-mail-archives/vectors/df-count_$folder$
2011-01-30 15:30 70926210
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-0
2011-01-30 15:32 70863447
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-1
2011-01-30 15:33 70892506
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-2
2011-01-30 15:31 70877571
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-3
2011-01-30 15:32 70824816
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-4
2011-01-30 15:34 70895476
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-5
2011-01-30 15:35 40982506
s3://thelabdude/asf-mail-archives/vectors/dictionary.file-6
2011-01-30 15:36 37160153
s3://thelabdude/asf-mail-archives/vectors/frequency.file-0
2011-01-30 15:36 37160173
s3://thelabdude/asf-mail-archives/vectors/frequency.file-1
2011-01-30 15:30 37160173
s3://thelabdude/asf-mail-archives/vectors/frequency.file-2
2011-01-30 15:31 37160173
s3://thelabdude/asf-mail-archives/vectors/frequency.file-3
2011-01-30 15:32 37160173
s3://thelabdude/asf-mail-archives/vectors/frequency.file-4
2011-01-30 15:33 37160173
s3://thelabdude/asf-mail-archives/vectors/frequency.file-5
2011-01-30 15:34 37160173
s3://thelabdude/asf-mail-archives/vectors/frequency.file-6
2011-01-30 15:34 1727033
s3://thelabdude/asf-mail-archives/vectors/frequency.file-7
Now on to running the clustering algorithms!
Szymon plans to start on the algorithms in the following order:
canopy -> k-means -> fuzzy -> mean-shift -> dirichlet
I'll start on the other end and begin with dirichlet and work backwards.
> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
> Key: MAHOUT-588
> URL: https://issues.apache.org/jira/browse/MAHOUT-588
> Project: Mahout
> Issue Type: Task
> Reporter: Grant Ingersoll
> Attachments: distcp_large_to_s3_failed.log,
> seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log,
> SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives2.java,
> Uncompress.java
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's
> clustering algorithms. I've asked the two doing the project to do all the
> work in the open here. The goal is to use a publicly reusable dataset (for
> now, the ASF mail archives, assuming it is big enough) and run on EC2 and
> make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done
> as a Vectorizer) and the publication of the results will be put up on the
> Wiki as well as in the book. This issue is to track the patches, etc.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.