[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740681#action_12740681
 ] 

Nicolás Fantone commented on MAHOUT-121:
----------------------------------------

Sean, we are definitely not following each other. Probably due to my lack of 
communication skills.

{quote}
Your point about Strings is undeniable, but is irrelevant to this question. See 
my counter-example for an example which would be relevant to the point I am 
trying to make.
{quote}

How could it be irrelevant when it's the exact same point I tried to make in my 
patch?  There's a Vector being instantiated and allocated in every for 
iteration, in every task, in every reduce job, in every node of the cluster. 
And it is not necessary. Every time. The very same thing goes for my String 
example... except with Strings.

{quote}
Declaring outside the loop only incurs an extra initialization.
{quote}

Extra? There's only ONE initialization. If the declaration is done inside the 
loop, thousands of initializations are going to be done. That's thousands minus 
one "extra" initializations.

Grant, maybe you should leave the size comparison for now. It won't impact 
speed noticeably and, as of now, KMeans is only using the optimized distance 
calculation for both computing convergence and emitting points. Is there 
anywhere else a size check is done between input vectors? I believe there isn't.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
> MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
> MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
> MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
> MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. 
> The complete dataset has few tens of thousands of feature items but each 
> vector has only couple of hundred feature items. For this, there is an 
> optimization in distance calculation, a link to which I found the archives of 
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors 
> with few hundred items.  I ran canopy generation with Euclidean distance and 
> t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives 
> is computationally expensive. I changed the sparse vector  implementation to 
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
>  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to