[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740587#action_12740587
 ] 

Nicolás Fantone commented on MAHOUT-121:
----------------------------------------

{quote}
The point illustrated by the String loop example has nothing to do with how 
variables declared, and everything to do with the difference between String and 
StringBuilder. It doesn't seem to address the point previously raised.
{quote}

Not quite right. The difference between String and StringBuilder IS EXACTLY the 
difference between instantiating thousands of objects and re-using just one, 
which is, I believe, the matter at hand here.

{quote}
In fact the first is ever so slightly worse since it sets s to null, but the 
value is unused. But it is worse for another reason: s continues to point to 
"anotherString: 249999" after the loop terminates, which is also pointless.
{quote}

If you create new Strings in a loop, then you'll have as many objects as 
iterations pointing to "anotherString: 0", "anotherString: 1", ..., 
"anotherString: 121410", and so on, waiting to be gcollected - which may not 
even happen in the short term. Even more pointless, following your logic.

{quote}
Hence I would undo that part of the patch unless there is another purpose to it 
I missed.
{quote}

Perhaps someone could run a profiler with and without the latest patch? I tend 
to think the gain in execution speed would not be significant if any at all, as 
some of you have stated. However, unless code readability is a priority, I see 
no harm in changing something that can only help performance.

{quote}
This isn't an example of unrolling is it?
{quote}

That's right. It is not. It is about the cost of instatiation vs. reusability 
of short-lived objects.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
> MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
> MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
> MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
> MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. 
> The complete dataset has few tens of thousands of feature items but each 
> vector has only couple of hundred feature items. For this, there is an 
> optimization in distance calculation, a link to which I found the archives of 
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors 
> with few hundred items.  I ran canopy generation with Euclidean distance and 
> t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives 
> is computationally expensive. I changed the sparse vector  implementation to 
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
>  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to