[ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740589#action_12740589 ]
Sean Owen edited comment on MAHOUT-121 at 8/7/09 8:38 AM: ---------------------------------------------------------- Are we talking about the same patch? sections like this? {noformat} + double distance = 0.0; + Vector clusterCenter = null; for (Cluster cluster : clusters) { - double distance = measure.distance(point, cluster.getCenter()); - if (nearestCluster == null || distance < nearestDistance) { + clusterCenter = cluster.getCenter(); + distance = measure.distance(clusterCenter.getLengthSquared(), clusterCenter, point); + if (distance < nearestDistance) { nearestCluster = cluster; nearestDistance = distance; } {noformat} The question here is indeed whether to declare distance and clusterCenter inside or outside the loop. While I completely agree with your statements, they do not seem to pertain to this. I think we're not talking about the same thing, since you pick up on the point of creating Strings in a loop -- which both my examples did. That is not a difference and not germane to the point I have in mind. I am further making the point that there is not a tradeoff between readability and speed here -- putting the declaration outside the loop actually makes both worse. I am quite sure this should be reversed. was (Author: srowen): Are we talking about the same patch? sections like this? + double distance = 0.0; + Vector clusterCenter = null; for (Cluster cluster : clusters) { - double distance = measure.distance(point, cluster.getCenter()); - if (nearestCluster == null || distance < nearestDistance) { + clusterCenter = cluster.getCenter(); + distance = measure.distance(clusterCenter.getLengthSquared(), clusterCenter, point); + if (distance < nearestDistance) { nearestCluster = cluster; nearestDistance = distance; } The question here is indeed whether to declare distance and clusterCenter inside or outside the loop. While I completely agree with your statements, they do not seem to pertain to this. I think we're not talking about the same thing, since you pick up on the point of creating Strings in a loop -- which both my examples did. That is not a difference and not germane to the point I have in mind. I am further making the point that there is not a tradeoff between readability and speed here -- putting the declaration outside the loop actually makes both worse. I am quite sure this should be reversed. > Speed up distance calculations for sparse vectors > ------------------------------------------------- > > Key: MAHOUT-121 > URL: https://issues.apache.org/jira/browse/MAHOUT-121 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Reporter: Shashikant Kore > Assignee: Grant Ingersoll > Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, > MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, > MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, > MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, > MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch > > > From my mail to the Mahout mailing list. > I am working on clustering a dataset which has thousands of sparse vectors. > The complete dataset has few tens of thousands of feature items but each > vector has only couple of hundred feature items. For this, there is an > optimization in distance calculation, a link to which I found the archives of > Mahout mailing list. > http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ > I tried out this optimization. The test setup had 2000 document vectors > with few hundred items. I ran canopy generation with Euclidean distance and > t1, t2 values as 250 and 200. > > Current Canopy Generation: 28 min 15 sec. > Canopy Generation with distance optimization: 1 min 38 sec. > I know by experience that using Integer, Double objects instead of primitives > is computationally expensive. I changed the sparse vector implementation to > used primitive collections by Trove [ > http://trove4j.sourceforge.net/ ]. > Distance optimization with Trove: 59 sec > Current canopy generation with Trove: 21 min 55 sec > To sum, these two optimizations reduced cluster generation time by a 97%. > Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. > > Licensing of Trove seems to be an issue which needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.