[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720839#action_12720839
 ] 

Jeff Eastman commented on MAHOUT-121:
-------------------------------------

The patch fails to install. More hand-editing?

jeff-eastmans-mac-pro:Mahout jeff$ patch -p0 < MAHOUT-121\(2\).patch 
patching file 
core/src/test/java/org/apache/mahout/matrix/TestOrderedIntDoubleMapping.java
patching file 
core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java
Hunk #1 FAILED at 242.
Hunk #2 FAILED at 259.
Hunk #3 FAILED at 274.
Hunk #4 FAILED at 294.
Hunk #5 FAILED at 416.
5 out of 5 hunks FAILED -- saving rejects to file 
core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java.rej
patching file core/src/test/java/org/apache/mahout/matrix/VectorTest.java
patching file core/src/test/java/org/apache/mahout/matrix/TestDenseVector.java
patching file 
core/src/main/java/org/apache/mahout/clustering/dirichlet/models/NormalModel.java
patching file core/src/main/java/org/apache/mahout/matrix/VectorView.java
patching file 
core/src/main/java/org/apache/mahout/clustering/dirichlet/models/AsymmetricSampledNormalModel.java
patching file 
core/src/main/java/org/apache/mahout/utils/ManhattanDistanceMeasure.java
patching file 
core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/SoftCluster.java
can't find file to patch at input line 598
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|Index: 
trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.java
|===================================================================
|--- 
trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.java
    (revision 780473)
|+++ 
trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.java
    Tue Jun 16 20:58:30 BST 2009
--------------------------
File to patch: ^C


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
> MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. 
> The complete dataset has few tens of thousands of feature items but each 
> vector has only couple of hundred feature items. For this, there is an 
> optimization in distance calculation, a link to which I found the archives of 
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors 
> with few hundred items.  I ran canopy generation with Euclidean distance and 
> t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives 
> is computationally expensive. I changed the sparse vector  implementation to 
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
>  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to