[
https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990207#comment-12990207
]
Szymon Chojnacki edited comment on MAHOUT-588 at 2/3/11 6:10 PM:
-----------------------------------------------------------------
One iteration was the most I could get for a few hours of struggling with both
Mahout and Hadoop. The problem is a tricky one. Now I run 10-iterations task. A
short description of the problem:
0. I started a typical 10-iterations job
1. 60 'canopies' were initialized with RandomSeedGeneration successfully
2. First iteration run successfully over the canopies and output /clusters-1
3. The second iteration threw Error: Heap stack overflow
- I suspected memory leak in KMeansDriver,
- I set up 1 iteration of KMeansDriver with canopies in /clusters-1
- the memory problem appeared again
- it was surprising to me because there is virtually no difference between
Iteration-1 and Iteration-2 (at least I thought so)
4. The problem turned out to be the fact that random seed centroids are very
sparse, and centroids we get after the first iteration are very dense. The size
of 60-random seeds is 114KB, the size of 60-centroids after the first iteration
is >400MB! I had mapred.tasktracker.map.tasks.maximum=40. So I run out of
memory quickly during the setup of KMeansMapper.
5. I played with variuos XMx vs maxMappers configurations and I dont get an
error with:
-Xmx3500 and mapred.tasktracker.map.tasks.maximum=1
I get an error with
-Xmx2000 and mapred.tasktracker.map.tasks.maximum=2
I think I can not put more than 2 mappers with more than Xmx2000, as I have 6GB
nodes :-(
was (Author: szymek):
One iteration was the most I could get for a few hours of struggling with
both Mahout and Hadoop. The problem is a tricky one. Now I run 10-iterations
task. A short description of the problem:
0. I started a typical 10-iterations job
1. 60 'canopies' were initialized with RandomSeedGeneration successfully
2. First iteration run successfully over the canopies and output /clusters-1
3. The second iteration threw Error: Heap stack overflow
- I suspected memory leak in KMeansDriver,
- I set up 1 iteration of KMeansDriver with canopies in /clusters-1
- the memory problem appeared again
- it was surprising to me because
4. The problem turned out to be the fact that random seed centroids are very
sparse, and centroids we get after the first iteration are very dense. The size
of 60-random seeds is 114KB, the size of 60-centroids after the first iteration
is >400MB! I had mapred.tasktracker.map.tasks.maximum=40. So I run out of
memory quickly during the setup of KMeansMapper.
5. I played with variuos XMx vs maxMappers configurations and I dont get an
error with:
-Xmx3500 and mapred.tasktracker.map.tasks.maximum=1
I get an error with
-Xmx2000 and mapred.tasktracker.map.tasks.maximum=2
I think I can not put more than 2 mappers with more than Xmx2000, as I have 6GB
nodes :-(
> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
> Key: MAHOUT-588
> URL: https://issues.apache.org/jira/browse/MAHOUT-588
> Project: Mahout
> Issue Type: Task
> Reporter: Grant Ingersoll
> Attachments: SequenceFilesFromMailArchives.java,
> SequenceFilesFromMailArchives2.java, TamingAnalyzer.java,
> TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java,
> TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java,
> TamingTFIDF.java, TamingTokenizer.java, Top1000Tokens_maybe_stopWords,
> Uncompress.java, clusters1.txt, clusters_kMeans.txt,
> distcp_large_to_s3_failed.log, seq2sparse_small_failed.log,
> seq2sparse_xlarge_ok.log
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's
> clustering algorithms. I've asked the two doing the project to do all the
> work in the open here. The goal is to use a publicly reusable dataset (for
> now, the ASF mail archives, assuming it is big enough) and run on EC2 and
> make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done
> as a Vectorizer) and the publication of the results will be put up on the
> Wiki as well as in the book. This issue is to track the patches, etc.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira