[ https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126482#comment-13126482 ]
Grant Ingersoll commented on MAHOUT-588: ---------------------------------------- I've turned off access to mine. You should now use the Amazon Public Dataset: http://aws.amazon.com/datasets/7791434387204566 > Benchmark Mahout's clustering performance on EC2 and publish the results > ------------------------------------------------------------------------ > > Key: MAHOUT-588 > URL: https://issues.apache.org/jira/browse/MAHOUT-588 > Project: Mahout > Issue Type: Task > Affects Versions: 0.5 > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Fix For: 0.5 > > Attachments: 60_clusters_kmeans_10_iterations_100K_coordinates.txt, > MAHOUT-588.patch, MailArchivesClusteringAnalyzer.java, > MailArchivesClusteringAnalyzerTest.java, SequenceFilesFromMailArchives.java, > SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives2.java, > SequenceFilesFromMailArchivesTest.java, TamingAnalyzer.java, > TamingAnalyzer.java, TamingAnalyzerTest.java, TamingCollocDriver.java, > TamingCollocMapper.java, TamingDictVect.java, > TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, > TamingSubset.java, TamingSubsetMapper.java, TamingTFIDF.java, > TamingTokenizer.java, Top1000Tokens_maybe_stopWords, Uncompress.java, > clusters1.txt, clusters_kMeans.txt, distcp_large_to_s3_failed.log, > ec2_setup_notes.txt, ec2_setup_notes_v2.txt, ec2_setup_notes_v2.txt, > mahout-588_canopy.pdf, mahout-588_distribution.pdf, > prep_asf_mail_archives.sh, prep_asf_mail_archives.sh, > prep_asf_mail_archives.sh, seq2sparse_small_failed.log, > seq2sparse_xlarge_ok.log > > > For Taming Text, I've commissioned some benchmarking work on Mahout's > clustering algorithms. I've asked the two doing the project to do all the > work in the open here. The goal is to use a publicly reusable dataset (for > now, the ASF mail archives, assuming it is big enough) and run on EC2 and > make all resources available so others can reproduce/improve. > I'd like to add the setup code to utils (although it could possibly be done > as a Vectorizer) and the publication of the results will be put up on the > Wiki as well as in the book. This issue is to track the patches, etc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira