[jira] Commented: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results

Szymon Chojnacki (JIRA) Thu, 03 Feb 2011 13:07:54 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990302#comment-12990302
 ]


Szymon Chojnacki commented on MAHOUT-588:
-----------------------------------------

Thank you for the advice,
as I see currently CosineDistance is partially optimized in the context of 
kMeans and centroidLengthSquare is computed only once:

public double distance(double centroidLengthSquare, Vector centroid, Vector v) 

which is significantly faster that standard

public double distance(Vector v1, Vector v2) 

however it is assumed that v1 and v2 are sparse and time of dotProduct is 
proportional to the number of non-empty coordinates in both vectors:

double dotProduct = v2.dot(v1); 

Your suggestion to implement v2 (centroid vector) by means of a hashmap would 
definitelly improve the speed of calculating the distance between points and 
centroids and as a result the kMeans itself.

Regards

> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-588
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-588
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Grant Ingersoll
>         Attachments: SequenceFilesFromMailArchives.java, 
> SequenceFilesFromMailArchives2.java, TamingAnalyzer.java, 
> TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java, 
> TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, 
> TamingTFIDF.java, TamingTokenizer.java, Top1000Tokens_maybe_stopWords, 
> Uncompress.java, clusters1.txt, clusters_kMeans.txt, 
> distcp_large_to_s3_failed.log, seq2sparse_small_failed.log, 
> seq2sparse_xlarge_ok.log
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's 
> clustering algorithms.  I've asked the two doing the project to do all the 
> work in the open here.  The goal is to use a publicly reusable dataset (for 
> now, the ASF mail archives, assuming it is big enough) and run on EC2 and 
> make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done 
> as a Vectorizer) and the publication of the results will be put up on the 
> Wiki as well as in the book.  This issue is to track the patches, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results

Reply via email to