Hi Nicolas, We took the idea of optimized distance calculation from LingPipe. I suggest you to read this as I won't be able to communicate the idea as crisply. After reading this post, you will be able to relate the code.
http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ Yes, some of the method names are misnomers (or downright misleading) as there is no good term to describe the intermediate values. A minor correrction in the code you quoted. > SquaredEuclideanDistanceMeasure.java, currently, does the following: > > if (centroid.size() != v.size()) { > throw new CardinalityException(); > } > > double result = centroidLengthSquare; > result += v.getDistanceSquared(centroid); > return centroidLengthSquare + v.getDistanceSquared(centroid); Here, we don't call v.getDistanceSquared(centroid) again (which is redundant). It simply returns result calculated in the prvious step. Now coming to your implementation. > SparseVector.java > > @Override > public double getDistanceSquared(Vector v) { > double result = 0.0; > Iterator<Vector.Element> iter = iterateNonZero(); > Vector.Element elt; > double delta; > > while (iter.hasNext()) { > elt = iter.next(); > delta = elt.get() - v.getQuick(elt.index()); > result += (delta * delta); //- (centroidValue * centroidValue); > } > return result; > } The distance calculation here is not correct. In this code, the unique features of Vector v are ignored. To put differently, only the features of "this" vector are considered for calculation. For example, let's say, current vector v1 is [0:1, 3:1, 4:1, 6:1, 8:1] and the incoming parameter "v" is [0:1, 3:1, 5:1, 7:1, 8:1]. The code given above ignores the unique features of "v" namely 5 & 7. To get this correct, you have to consider the features of "v" as well. That, essentially, is SquaredEuclideanDistanceMeasure.distance(Vector v1, Vector v2). With the optimization of LingPipe, you will incur the calculations equal to the non-zero features in only one vector. You don't need to iterate on cetroid vector for every distance calculation. Let me know if I am off the mark here. --shashi On Tue, Jul 28, 2009 at 7:46 PM, nfantone<[email protected]> wrote: > Ok, this post is going to be a long one, and so it deserves its own > thread. My apologies beforehand. > > Here's what I have tried to ease the distance calculation problem. I > know it's quite nasty, but I wanted to ensure our bottleneck was > there, in the euclidean distance computation. > > But first, a little excursus: > > /* EXCURSUS */ > The "optimized" version for SparseVectors of distance() in > SquaredEuclideanDistanceMeasure.java, currently, does the following: > > if (centroid.size() != v.size()) { > throw new CardinalityException(); > } > > double result = centroidLengthSquare; > result += v.getDistanceSquared(centroid); > return centroidLengthSquare + v.getDistanceSquared(centroid); > > It expects a vector squared length, which is finally added to what > getDistanceSquared() returns. However, that method computes this > inside a while loop (for the comments, assume two 2D vectors [X0, X1] > and [Y0, Y1]): > > elt = iter.next(); > centroidValue = v.getQuick(elt.index()); > delta = elt.get() - centroidValue; > result += (delta * delta) - (centroidValue * centroidValue); // > This is (X0 - Y0)*(X0 - Y0) - Y0*Y0 + (X1 - Y1)*(X1 - Y1) - Y1*Y1; I > don't have the slightest idea what to call this value. > > I certainly couldn't understand the reason behind that (centroidValue > * centroidValue) subtraction. Being called getDistanceSquared, the > method should return just that... and that is the value one gets by > just ignoring that substraction, that is: > > ... > result += (delta * delta); //This is (X0 - Y0)*(X0 - Y0) + (X1 - > Y1)*(X1 - Y1); the ACTUAL squared distance. > ... > > Moreover, the sum of every (centroidValue * centroidValue) in each > iteration (Y0*Y0 + Y1*Y1, in the examples) corresponds to > centroidLengthSquare, which was previously calculated before calling > distance(centroidLengthSquare, centroid, v) and then added to the > result. So, to sum up: first, a certain length is calculated (the > squared norm of what the signature from distance() calls "centroid"); > second, that SAME value gets calculated again in an another method and > subtracted from a certain total; last, those values cancel each other > by summing both totals, leaving just the wanted squared distance, > here: > > return centroidLengthSquare + v.getDistanceSquared(centroid); > > Which could be re-written as: > > return centroidLengthSquare + squaredDistanceBetweenCentroidAndV - > centroidLengthSquare; > > Maybe this has been done on purpose, though that purpose eludes me for now. > > /* END EXCURSUS */ > > And now, the fun part: code disrupting. Here are the changes I applied > (remember: just for testing performance's sake). I left the changed > bits commented. > > EuclideanDistanceMeasure.java > Renamed distance(double centroidLengthSquare, Vector centroid, Vector > v) to sparseDistance and eliminate the first param. We don't need the > sqrt to compare real distances for emitting points to clusters (but we > do need them to compute a cluster's convergence), so I took that out, > as well (I know the whole point of this class is to sqrt the results > from SquaredEuclideanDistance, but I needed the distance function > that's compromising performance to do a minimal set of calculations) . > > �...@override > public double sparseDistance(/*double centroidLengthSquare,*/ Vector > centroid, Vector v) { > return /*Math.sqrt(*/super.sparseDistance(/*centroidLengthSquare,*/ > centroid, v);//*); > } > > > SquaredEuclideanDistanceMeasure.java > Rewrote and renamed the implementation of distance(double > centroidLengthSquare, Vector centroid, Vector v). Ignored size check > (again: just for the benefit of performance). Just return the result > of getDistanceSquared(). > > @Override > public double sparseDistance(/*double centroidLengthSquare,*/ Vector > centroid, Vector v) { > /* if (centroid.size() != v.size()) { > throw new CardinalityException(); > } > > double result = centroidLengthSquare; > result += v.getDistanceSquared(centroid); > */ return /*centroidLengthSquare +*/ v.getDistanceSquared(centroid); > } > > SparseVector.java > Refactor of getDistanceSquared(). Variables should not be created in a > loop, if there's no need to do so. Eliminate the centroidValue^2 > calculation in each iteration. > > @Override > public double getDistanceSquared(Vector v) { > //TODO: Check sizes? > > double result = 0.0; > Iterator<Vector.Element> iter = iterateNonZero(); > Vector.Element elt; > double delta; > > while (iter.hasNext()) { > elt = iter.next(); > delta = elt.get() - v.getQuick(elt.index()); > result += (delta * delta); //- (centroidValue * centroidValue); > } > return result; > } > > Cluster.java > Refactor of emitPointToNearestCluster(). null comparison eliminated > (not necessary anymore). sparseDistance() is called now, instead. > > ... > Cluster nearestCluster = null; > double nearestDistance = Double.MAX_VALUE; > double distance = 0; > for (Cluster cluster : clusters) { > distance = measure.sparseDistance(cluster.getCenter(), point); > if (distance < nearestDistance) { > nearestCluster = cluster; > nearestDistance = distance; > } > } > ... > > Add a Math.sqrt call to computeConvergence(), which now uses sparseDistance(). > > public boolean computeConvergence() { > Vector centroid = computeCentroid(); > converged = > Math.sqrt(measure.sparseDistance(/*centroid.getLengthSquared(),*/ > centroid, center)) <= convergenceDelta; > return converged; > } > > > After all those modifications, kMeans now runs noticeably faster. > Running the whole thing locally -in a pseudo-distributed cluster-, > every iteration is taking me about ~7 minutes to complete using the > very same dataset I uploaded and posted some days ago, which used to > last 3 hours. Again, this is not meant to be a final, closed solution > at all. But, I think it could very well serve as a first step towards > that. > -- http://www.bandhan.com/
