[ https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990202#comment-12990202 ]
Grant Ingersoll commented on MAHOUT-588: ---------------------------------------- I'm guessing he means one run of the clustering algorithm, not one iteration of the k-means algorithm, but I'll let him say for sure > Benchmark Mahout's clustering performance on EC2 and publish the results > ------------------------------------------------------------------------ > > Key: MAHOUT-588 > URL: https://issues.apache.org/jira/browse/MAHOUT-588 > Project: Mahout > Issue Type: Task > Reporter: Grant Ingersoll > Attachments: SequenceFilesFromMailArchives.java, > SequenceFilesFromMailArchives2.java, TamingAnalyzer.java, > TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java, > TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, > TamingTFIDF.java, TamingTokenizer.java, Top1000Tokens_maybe_stopWords, > Uncompress.java, clusters1.txt, clusters_kMeans.txt, > distcp_large_to_s3_failed.log, seq2sparse_small_failed.log, > seq2sparse_xlarge_ok.log > > > For Taming Text, I've commissioned some benchmarking work on Mahout's > clustering algorithms. I've asked the two doing the project to do all the > work in the open here. The goal is to use a publicly reusable dataset (for > now, the ASF mail archives, assuming it is big enough) and run on EC2 and > make all resources available so others can reproduce/improve. > I'd like to add the setup code to utils (although it could possibly be done > as a Vectorizer) and the publication of the results will be put up on the > Wiki as well as in the book. This issue is to track the patches, etc. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira