[jira] Commented: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results

Isabel Drost (JIRA) Thu, 24 Feb 2011 01:28:08 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998780#comment-12998780
 ]


Isabel Drost commented on MAHOUT-588:
-------------------------------------

I think there are really three interesting views on your implementation that 
should be documented:

Anything special that you found needed to be done to get Mahout up and running 
on EC2/EMR that is not yet included in the respective wiki pages would be great 
to have integrated and updated there.

I'd suggest adding your findings wrt. benchmarking (running times, experimental 
results, size of the corpus used for testing, any fancy performance comparison 
graphs you generated) to the Benchmark Wiki page:
https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Benchmarks

As for the general benchmarking setup (design of your implementation, how to 
install and run it, limitations and constraints) - that I think would be nice 
to have on a separate wiki page linked to from the "Implementations" section:

https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground

Might make sense to provide links between those pages to make discovering 
information easier.


> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-588
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-588
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Grant Ingersoll
>         Attachments: 60_clusters_kmeans_10_iterations_100K_coordinates.txt, 
> SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives2.java, 
> TamingAnalyzer.java, TamingAnalyzer.java, TamingAnalyzerTest.java, 
> TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java, 
> TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, 
> TamingSubset.java, TamingSubsetMapper.java, TamingTFIDF.java, 
> TamingTokenizer.java, Top1000Tokens_maybe_stopWords, Uncompress.java, 
> clusters1.txt, clusters_kMeans.txt, distcp_large_to_s3_failed.log, 
> ec2_setup_notes.txt, seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's 
> clustering algorithms.  I've asked the two doing the project to do all the 
> work in the open here.  The goal is to use a publicly reusable dataset (for 
> now, the ASF mail archives, assuming it is big enough) and run on EC2 and 
> make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done 
> as a Vectorizer) and the publication of the results will be put up on the 
> Wiki as well as in the book.  This issue is to track the patches, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results

Reply via email to