Actually, I realize I was being a bit misleading. There are several concepts swirling around here that fall into groups
distance, metric, norm: A *metric* is some real-valued, non-negative function of pairs of things. A metric should only give 0 for identical points. A *distance* is a metric that satisfies the triangle inequality. A *norm* is a thing that assigns lengths to vectors. If you have subtraction, then norm(x-y) is a distance. Often, the term norm is used loosely for functions where f(x-y) only gives a metric rather than a norm. centroid, medoid, mean, median, average: A *centroid* is a value z that minimizes sum_i norm(x_i - z) where norm is almost always L^2 A *medoid* is a single point x_m from a set of points that minimizes sum_i norm(x_i - x_m) for some norm (usually L^2) A *mean* is a one-dimensional centroid using L^2 An *median* is a one-dimensional centroid using L^1 instead of L^2. This value is not normally unique so there are various methods for tie-breaking. If you have a continuous space (like we have been talking about), you can generalize centroids, but for L^1, the corresponding centroid can be costly to compute except in one dimension. The place that I ran things off the rails was in conflating centroid with the norm that defines the centroid. The L^2 norm is indeed computed as the root sum of squares for our purposes. But the centroid based on this norm is computed by averaging all of the points and the original algorithm was correct except that numPoints must be the number of points involved in the centroid computation rather than the number of non-zero elements in a single point. It would indeed be a good idea to think about alternative clustering. The move from L^2 to L^1 generally is a good thing if you want to decrease the sometimes overwhelming impact of outliers, but it can make things much harder to compute. The move from L^2 to L^1 converts k-means to k-medoids and I expect that wikipedia's article on same will be pretty good. There is also a probabilistic interpretation for all of this. For L^1 and L^2, you can interpret minimizing the distance in the centroid computation as finding a probability distribution that has maximum likelihood. For L^2, the form of the distribution is the normal distribution: exp(-(x - mu)^2/(2s^2)). For L^1, the distribution is the so-called Laplace distribution: exp( -abs(x-mu)/s ). In general, the Laplace distribution is fat-tailed which is what gives it its outlier forgiving nature. Hope this clarification helps. Sorry for adding to confusion. On Thu, May 28, 2009 at 5:51 AM, Jeff Eastman <[email protected]>wrote: > I found this (http://en.wikipedia.org/wiki/Lp_space) with a quick Google > search of "L2 norm" which turned up a number of other citations too. Since > we already use L1 (Manhattan) and L2 (Euclidean) norms in the respective > DistanceMeasures it might also be appropriate to allow the cluster centroid > normalization to be pluggable. I had never considered that. Another mental > lightbulb has just turned on. Thanks Ted. > -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
