[jira] Issue Comment Edited: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results

Szymon Chojnacki (JIRA) Thu, 03 Feb 2011 10:10:55 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990207#comment-12990207
 ]


Szymon Chojnacki edited comment on MAHOUT-588 at 2/3/11 6:10 PM:
-----------------------------------------------------------------

One iteration was the most I could get for a few hours of struggling with both 
Mahout and Hadoop. The problem is a tricky one. Now I run 10-iterations task. A 
short description of the problem:

0. I started a typical 10-iterations job
1. 60 'canopies' were initialized with RandomSeedGeneration successfully
2. First iteration run successfully over the canopies and output /clusters-1
3. The second iteration threw Error: Heap stack overflow
- I suspected memory leak in KMeansDriver,
- I set up 1 iteration of KMeansDriver with canopies in /clusters-1
- the memory problem appeared again
- it was surprising to me because there is virtually no difference between 
Iteration-1 and Iteration-2 (at least I thought so) 

4. The problem turned out to be the fact that random seed centroids are very 
sparse, and centroids we get after the first iteration are very dense. The size 
of 60-random seeds is 114KB, the size of 60-centroids after the first iteration 
is >400MB! I had mapred.tasktracker.map.tasks.maximum=40. So I run out of 
memory quickly during the setup of KMeansMapper.

5. I played with variuos XMx vs maxMappers configurations and I dont get an 
error with:
 -Xmx3500 and mapred.tasktracker.map.tasks.maximum=1
I get an error with
-Xmx2000 and  mapred.tasktracker.map.tasks.maximum=2

I think I can not put more than 2 mappers with more than Xmx2000, as I have 6GB 
nodes :-(



      was (Author: szymek):
    One iteration was the most I could get for a few hours of struggling with 
both Mahout and Hadoop. The problem is a tricky one. Now I run 10-iterations 
task. A short description of the problem:

0. I started a typical 10-iterations job
1. 60 'canopies' were initialized with RandomSeedGeneration successfully
2. First iteration run successfully over the canopies and output /clusters-1
3. The second iteration threw Error: Heap stack overflow
- I suspected memory leak in KMeansDriver,
- I set up 1 iteration of KMeansDriver with canopies in /clusters-1
- the memory problem appeared again
- it was surprising to me because 

4. The problem turned out to be the fact that random seed centroids are very 
sparse, and centroids we get after the first iteration are very dense. The size 
of 60-random seeds is 114KB, the size of 60-centroids after the first iteration 
is >400MB! I had mapred.tasktracker.map.tasks.maximum=40. So I run out of 
memory quickly during the setup of KMeansMapper.

5. I played with variuos XMx vs maxMappers configurations and I dont get an 
error with:
 -Xmx3500 and mapred.tasktracker.map.tasks.maximum=1
I get an error with
-Xmx2000 and  mapred.tasktracker.map.tasks.maximum=2

I think I can not put more than 2 mappers with more than Xmx2000, as I have 6GB 
nodes :-(


  
> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-588
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-588
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Grant Ingersoll
>         Attachments: SequenceFilesFromMailArchives.java, 
> SequenceFilesFromMailArchives2.java, TamingAnalyzer.java, 
> TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java, 
> TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, 
> TamingTFIDF.java, TamingTokenizer.java, Top1000Tokens_maybe_stopWords, 
> Uncompress.java, clusters1.txt, clusters_kMeans.txt, 
> distcp_large_to_s3_failed.log, seq2sparse_small_failed.log, 
> seq2sparse_xlarge_ok.log
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's 
> clustering algorithms.  I've asked the two doing the project to do all the 
> work in the open here.  The goal is to use a publicly reusable dataset (for 
> now, the ASF mail archives, assuming it is big enough) and run on EC2 and 
> make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done 
> as a Vectorizer) and the publication of the results will be put up on the 
> Wiki as well as in the book.  This issue is to track the patches, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Issue Comment Edited: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results

Reply via email to