Thanks for the clarifications, they make the nomenclature much clearer and expose its composition. This is a great thread and I'd like to incorporate this information in the wiki so it's easier for readers to find going forward.

Jeff


Ted Dunning wrote:
Actually, I realize I was being a bit misleading.  There are several
concepts swirling around here that fall into groups

distance, metric, norm:

A *metric* is some real-valued, non-negative function of pairs of things.  A
metric should only give 0 for identical points.

A *distance* is a metric that satisfies the triangle inequality.

A *norm* is a thing that assigns lengths to vectors.  If you have
subtraction, then norm(x-y) is a distance.  Often, the term norm is used
loosely for functions where f(x-y) only gives a metric rather than a norm.

centroid, medoid, mean, median, average:

A *centroid* is a value z that minimizes sum_i norm(x_i - z) where norm is
almost always L^2

A *medoid* is a single point x_m from a set of points that minimizes sum_i
norm(x_i - x_m) for some norm (usually L^2)

A *mean* is a one-dimensional centroid using L^2

An *median* is a one-dimensional centroid using L^1 instead of L^2.  This
value is not normally unique so there are various methods for tie-breaking.

If you have a continuous space (like we have been talking about), you can
generalize centroids, but for L^1, the corresponding centroid can be costly
to compute except in one dimension.

The place that I ran things off the rails was in conflating centroid with
the norm that defines the centroid.  The L^2 norm is indeed computed as the
root sum of squares for our purposes.  But the centroid based on this norm
is computed by averaging all of the points and the original algorithm was
correct except that numPoints must be the number of points involved in the
centroid computation rather than the number of non-zero elements in a single
point.

It would indeed be a good idea to think about alternative clustering.  The
move from L^2 to L^1 generally is a good thing if you want to decrease the
sometimes overwhelming impact of outliers, but it can make things much
harder to compute.  The move from L^2 to L^1 converts k-means to k-medoids
and I expect that wikipedia's article on same will be pretty good.

There is also a probabilistic interpretation for all of this.  For L^1 and
L^2, you can interpret minimizing the distance in the centroid computation
as finding a probability distribution that has maximum likelihood.  For L^2,
the form of the distribution is the normal distribution: exp(-(x -
mu)^2/(2s^2)).  For L^1, the distribution is the so-called Laplace
distribution: exp( -abs(x-mu)/s ).  In general, the Laplace distribution is
fat-tailed which is what gives it its outlier forgiving nature.

Hope this clarification helps.  Sorry for adding to confusion.

On Thu, May 28, 2009 at 5:51 AM, Jeff Eastman <[email protected]>wrote:

I found this (http://en.wikipedia.org/wiki/Lp_space) with a quick Google
search of "L2 norm" which turned up a number of other citations too. Since
we already use L1 (Manhattan) and L2 (Euclidean) norms in the respective
DistanceMeasures it might also be appropriate to allow the cluster centroid
normalization to be pluggable. I had never considered that. Another mental
lightbulb has just turned on. Thanks Ted.





Reply via email to