[ 
https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated MAHOUT-588:
----------------------------------

    Attachment: SequenceFilesFromMailArchivesTest.java
                MailArchivesClusteringAnalyzerTest.java
                SequenceFilesFromMailArchives.java
                MailArchivesClusteringAnalyzer.java

Updated the EMR page in the Mahout wiki with the steps we used to create 
vectors for benchmarking. Also, as requested by Grant, I've renamed the text 
analyzer we're using to MailArchivesClusteringAnalyzer instead of 
TamingAnalyzer. Added test cases for the new code.

> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-588
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-588
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Grant Ingersoll
>         Attachments: 60_clusters_kmeans_10_iterations_100K_coordinates.txt, 
> MailArchivesClusteringAnalyzer.java, MailArchivesClusteringAnalyzerTest.java, 
> SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives.java, 
> SequenceFilesFromMailArchives2.java, SequenceFilesFromMailArchivesTest.java, 
> TamingAnalyzer.java, TamingAnalyzer.java, TamingAnalyzerTest.java, 
> TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java, 
> TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, 
> TamingSubset.java, TamingSubsetMapper.java, TamingTFIDF.java, 
> TamingTokenizer.java, Top1000Tokens_maybe_stopWords, Uncompress.java, 
> clusters1.txt, clusters_kMeans.txt, distcp_large_to_s3_failed.log, 
> ec2_setup_notes.txt, ec2_setup_notes_v2.txt, ec2_setup_notes_v2.txt, 
> mahout-588_canopy.pdf, mahout-588_distribution.pdf, 
> prep_asf_mail_archives.sh, prep_asf_mail_archives.sh, 
> seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's 
> clustering algorithms.  I've asked the two doing the project to do all the 
> work in the open here.  The goal is to use a publicly reusable dataset (for 
> now, the ASF mail archives, assuming it is big enough) and run on EC2 and 
> make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done 
> as a Vectorizer) and the publication of the results will be put up on the 
> Wiki as well as in the book.  This issue is to track the patches, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to