2011/11/4 Grant Ingersoll <[email protected]> > > Hi All I'm reading the code of SquaredEuclideanDistanceMeasure, the > "distance(double centroidLengthSquare, Vector centroid, Vector v)" method > confused me a lot, i don't know why we choose this expression > "centroidLengthSquare - 2 * v.dot(centroid) + v.getLengthSquared()" to > calculate the distance? > > Math is as Sebastian said, the purpose is that it can save some steps in > the computation, for instance, see the KMeans implementation where you are > often calculating the distance between some point and a centroid. We have > the centroidLengthSquare handy, so it saves a fair amount of work.
Grant is right, but he is soft-pedaling the savings. c^2 is known ahead. v is very, very sparse. Thus c v is very sparse and has the same sparsity pattern as v^2 (which can even be cached). On the other hand c-v is much less sparse. The difference looks minor when you first write it down, but it can easily be a factor of 1000 or more.
