[ https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010173#comment-13010173 ]
Grant Ingersoll commented on MAHOUT-588: ---------------------------------------- See https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute > Benchmark Mahout's clustering performance on EC2 and publish the results > ------------------------------------------------------------------------ > > Key: MAHOUT-588 > URL: https://issues.apache.org/jira/browse/MAHOUT-588 > Project: Mahout > Issue Type: Task > Reporter: Grant Ingersoll > Attachments: 60_clusters_kmeans_10_iterations_100K_coordinates.txt, > MailArchivesClusteringAnalyzer.java, MailArchivesClusteringAnalyzerTest.java, > SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives.java, > SequenceFilesFromMailArchives2.java, SequenceFilesFromMailArchivesTest.java, > TamingAnalyzer.java, TamingAnalyzer.java, TamingAnalyzerTest.java, > TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java, > TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, > TamingSubset.java, TamingSubsetMapper.java, TamingTFIDF.java, > TamingTokenizer.java, Top1000Tokens_maybe_stopWords, Uncompress.java, > clusters1.txt, clusters_kMeans.txt, distcp_large_to_s3_failed.log, > ec2_setup_notes.txt, ec2_setup_notes_v2.txt, ec2_setup_notes_v2.txt, > mahout-588_canopy.pdf, mahout-588_distribution.pdf, > prep_asf_mail_archives.sh, prep_asf_mail_archives.sh, > seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log > > > For Taming Text, I've commissioned some benchmarking work on Mahout's > clustering algorithms. I've asked the two doing the project to do all the > work in the open here. The goal is to use a publicly reusable dataset (for > now, the ASF mail archives, assuming it is big enough) and run on EC2 and > make all resources available so others can reproduce/improve. > I'd like to add the setup code to utils (although it could possibly be done > as a Vectorizer) and the publication of the results will be put up on the > Wiki as well as in the book. This issue is to track the patches, etc. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira