[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

JIRA Fri, 07 Aug 2009 13:03:46 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740694#action_12740694
 ]


Nicolás Fantone commented on MAHOUT-121:
----------------------------------------

{quote}
Where in the patch do you move an object allocation from inside the loop to 
outside? I do not see this. 
{quote}

Here:
{code:title=getDistanceSquared() in SparseVector.java - Original 
|borderStyle=solid}
    double result = 0.0;
    Iterator<Vector.Element> iter = iterateNonZero();
    while (iter.hasNext()) {
      Vector.Element elt = iter.next();
      double centroidValue = v.getQuick(elt.index());
      double delta = elt.get() - centroidValue;
      result += (delta * delta) - (centroidValue * centroidValue);
    }
    return result;
{code}

{code:title=getDistanceSquared() in SparseVector.java - Patched 
|borderStyle=solid}
  ... 
    double result = 0.0;
    Iterator<Vector.Element> iter = iterateNonZero();
    Vector.Element elt = null;
    double centroidValue = 0.0;
    double delta = 0.0;
    while (iter.hasNext()) {
      elt = iter.next();
      centroidValue = v.getQuick(elt.index());
      delta = elt.get() - centroidValue;
      result += (delta * delta) - (centroidValue * centroidValue);
    }
...
{code}

And here:

{code:title=emitPointToNearestCluster() in Cluster.java - Patched 
|borderStyle=solid}
 ...
    double distance = 0.0;
    Vector clusterCenter = null;
     for (Cluster cluster : clusters) {
      clusterCenter = cluster.getCenter();
      distance = measure.distance(clusterCenter.getLengthSquared(), 
clusterCenter, point);
      if (distance < nearestDistance) {
         nearestCluster = cluster;
         nearestDistance = distance;
       }
...
{code}

{code:title=emitPointToNearestCluster() in Cluster.java - From Grant patch 
|borderStyle=solid}
     Cluster nearestCluster = null;
     double nearestDistance = Double.MAX_VALUE;
     for (Cluster cluster : clusters) {
      Vector clusterCenter = cluster.getCenter();
      double distance = measure.distance(clusterCenter.getLengthSquared(), 
clusterCenter, point);
       if (nearestCluster == null || distance < nearestDistance) {
         nearestCluster = cluster;
         nearestDistance = distance;
{code}

And in outputPointWithClusterInfo(), which is actually the same bit of code as 
this last one.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
> MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
> MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
> MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
> MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. 
> The complete dataset has few tens of thousands of feature items but each 
> vector has only couple of hundred feature items. For this, there is an 
> optimization in distance calculation, a link to which I found the archives of 
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors 
> with few hundred items.  I ran canopy generation with Euclidean distance and 
> t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives 
> is computationally expensive. I changed the sparse vector  implementation to 
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
>  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Reply via email to