I'll check things up tomorrow, at work, make the patch, and I'll let you know. Thanks, Grant.
On Tue, Jul 28, 2009 at 5:29 PM, Grant Ingersoll<[email protected]> wrote: > Funny, I did this same thing on the plane by revisiting Shashi's patch on > MAHOUT-121 and properly applying it to K-Means (I missed one critical line > that uses the centroid based distance method instead of the (Vector, Vector) > one). I think your data set ran, for 10 iterations, in just over 2 minutes > and that was with the profiler hooked up, too. > > I will post a patch on MAHOUT-121, can you take a look? Also, if you can > post your changes as a patch, then we can likely compare/merge, etc. and > come up with the best way of doing all of this. > > Thanks for the insight/analysis. > > -Grant > > > On Jul 28, 2009, at 10:16 AM, nfantone wrote: > >> Ok, this post is going to be a long one, and so it deserves its own >> thread. My apologies beforehand. >> >> Here's what I have tried to ease the distance calculation problem. I >> know it's quite nasty, but I wanted to ensure our bottleneck was >> there, in the euclidean distance computation. >> >> But first, a little excursus: >> >> /* EXCURSUS */ >> The "optimized" version for SparseVectors of distance() in >> SquaredEuclideanDistanceMeasure.java, currently, does the following: >> >> if (centroid.size() != v.size()) { >> throw new CardinalityException(); >> } >> >> double result = centroidLengthSquare; >> result += v.getDistanceSquared(centroid); >> return centroidLengthSquare + v.getDistanceSquared(centroid); >> >> It expects a vector squared length, which is finally added to what >> getDistanceSquared() returns. However, that method computes this >> inside a while loop (for the comments, assume two 2D vectors [X0, X1] >> and [Y0, Y1]): >> >> elt = iter.next(); >> centroidValue = v.getQuick(elt.index()); >> delta = elt.get() - centroidValue; >> result += (delta * delta) - (centroidValue * centroidValue); // >> This is (X0 - Y0)*(X0 - Y0) - Y0*Y0 + (X1 - Y1)*(X1 - Y1) - Y1*Y1; I >> don't have the slightest idea what to call this value. >> >> I certainly couldn't understand the reason behind that (centroidValue >> * centroidValue) subtraction. Being called getDistanceSquared, the >> method should return just that... and that is the value one gets by >> just ignoring that substraction, that is: >> >> ... >> result += (delta * delta); //This is (X0 - Y0)*(X0 - Y0) + (X1 - >> Y1)*(X1 - Y1); the ACTUAL squared distance. >> ... >> >> Moreover, the sum of every (centroidValue * centroidValue) in each >> iteration (Y0*Y0 + Y1*Y1, in the examples) corresponds to >> centroidLengthSquare, which was previously calculated before calling >> distance(centroidLengthSquare, centroid, v) and then added to the >> result. So, to sum up: first, a certain length is calculated (the >> squared norm of what the signature from distance() calls "centroid"); >> second, that SAME value gets calculated again in an another method and >> subtracted from a certain total; last, those values cancel each other >> by summing both totals, leaving just the wanted squared distance, >> here: >> >> return centroidLengthSquare + v.getDistanceSquared(centroid); >> >> Which could be re-written as: >> >> return centroidLengthSquare + squaredDistanceBetweenCentroidAndV - >> centroidLengthSquare; >> >> Maybe this has been done on purpose, though that purpose eludes me for >> now. >> >> /* END EXCURSUS */ >> >> And now, the fun part: code disrupting. Here are the changes I applied >> (remember: just for testing performance's sake). I left the changed >> bits commented. >> >> EuclideanDistanceMeasure.java >> Renamed distance(double centroidLengthSquare, Vector centroid, Vector >> v) to sparseDistance and eliminate the first param. We don't need the >> sqrt to compare real distances for emitting points to clusters (but we >> do need them to compute a cluster's convergence), so I took that out, >> as well (I know the whole point of this class is to sqrt the results >> from SquaredEuclideanDistance, but I needed the distance function >> that's compromising performance to do a minimal set of calculations) . >> >> �...@override >> public double sparseDistance(/*double centroidLengthSquare,*/ Vector >> centroid, Vector v) { >> return /*Math.sqrt(*/super.sparseDistance(/*centroidLengthSquare,*/ >> centroid, v);//*); >> } >> >> >> SquaredEuclideanDistanceMeasure.java >> Rewrote and renamed the implementation of distance(double >> centroidLengthSquare, Vector centroid, Vector v). Ignored size check >> (again: just for the benefit of performance). Just return the result >> of getDistanceSquared(). >> >> @Override >> public double sparseDistance(/*double centroidLengthSquare,*/ Vector >> centroid, Vector v) { >> /* if (centroid.size() != v.size()) { >> throw new CardinalityException(); >> } >> >> double result = centroidLengthSquare; >> result += v.getDistanceSquared(centroid); >> */ return /*centroidLengthSquare +*/ v.getDistanceSquared(centroid); >> } >> >> SparseVector.java >> Refactor of getDistanceSquared(). Variables should not be created in a >> loop, if there's no need to do so. Eliminate the centroidValue^2 >> calculation in each iteration. >> >> @Override >> public double getDistanceSquared(Vector v) { >> //TODO: Check sizes? >> >> double result = 0.0; >> Iterator<Vector.Element> iter = iterateNonZero(); >> Vector.Element elt; >> double delta; >> >> while (iter.hasNext()) { >> elt = iter.next(); >> delta = elt.get() - v.getQuick(elt.index()); >> result += (delta * delta); //- (centroidValue * centroidValue); >> } >> return result; >> } >> >> Cluster.java >> Refactor of emitPointToNearestCluster(). null comparison eliminated >> (not necessary anymore). sparseDistance() is called now, instead. >> >> ... >> Cluster nearestCluster = null; >> double nearestDistance = Double.MAX_VALUE; >> double distance = 0; >> for (Cluster cluster : clusters) { >> distance = measure.sparseDistance(cluster.getCenter(), point); >> if (distance < nearestDistance) { >> nearestCluster = cluster; >> nearestDistance = distance; >> } >> } >> ... >> >> Add a Math.sqrt call to computeConvergence(), which now uses >> sparseDistance(). >> >> public boolean computeConvergence() { >> Vector centroid = computeCentroid(); >> converged = >> Math.sqrt(measure.sparseDistance(/*centroid.getLengthSquared(),*/ >> centroid, center)) <= convergenceDelta; >> return converged; >> } >> >> >> After all those modifications, kMeans now runs noticeably faster. >> Running the whole thing locally -in a pseudo-distributed cluster-, >> every iteration is taking me about ~7 minutes to complete using the >> very same dataset I uploaded and posted some days ago, which used to >> last 3 hours. Again, this is not meant to be a final, closed solution >> at all. But, I think it could very well serve as a first step towards >> that. > >
