Clustering in the k-means style can be viewed as fitting a mixture model to
the data.  A mixture model is a model of the generation of random data where
you first pick one of n sub-models and then generate a point from the
sub-model.  There are parameters that express the probability of each
sub-model and then the parameters for each sub-model.

With k-means, you don't explicitly compute the mixing parameters and you
assume that the sub-models are symmetric Gaussians with fixed variance.

One of the simplest algorithms for this is the so-called E-M algorithm.  For
mixture distributions, this algorithm breaks into two passes.  The first
assigns data points to models, the second estimates each sub-model from the
data assigned to that model.  This algorithm is easy to implement and works
well in some cases, particularly for k-means.

Unfortunately, the E-M algorithm as it stands is doing what is called a
maximum likelihood fit.  If you take k-means and try to use a more elaborate
model than identical Gaussians, the use of maximum likelihood instantly
causes horrible problems because if you allow the variance to vary, the
algorithm will position some clusters on a single point and drive the
variance to zero.  The result is some point-like clusters that each describe
a single observation, and a few diffuse clusters to handle the rest of the
data.  Generalization goes out the window because the point-like clusters
won't describe new data at all.

So... back to your question.  k-means, LDA and the Dirichlet clustering are
all implementations of E-M algorithms for fitting mixture distributions (or
first cousins to avoid the max-likelihood traps).  This allows very nice
mathematical frameworks to be brought to bear on the problem of how to make
these algorithms work well.  For the second question, the distinction is
based on the fact that k-means is pretty inflexible with regard to the
sub-models because of the maximum likelihood problems.  The Dirichlet
clustering tackles this problem head-on and thus allows much more general
models.

This transition from purely distance based metaphors to probabilistic
mixture models as a fundamental metaphor for clustering is a difficult one
to make.  You can work to interpret the mixture models in terms of the
distance based metaphor, but I find that counter-productive because the
analogies used lead to some very dangerous algorithms without obvious
repair.  Viewing the problem as a statistical learning problems not only
makes the extension of the algorithms to new domains easier, but it provides
nice fixes to the problems encountered.  If you want to stick with the
distance (similarity) based metaphor, you can view the probability p(X |
m_i) of an observed data point X given the i-th model m_i as the similarity
of the data X to model m_i.  Then the estimation of the new version of the
model from a number of data points can be considered to be the analog of the
centroid computation in k-means.

On Sun, Jan 3, 2010 at 8:10 AM, Grant Ingersoll <[email protected]> wrote:

>
> On Dec 31, 2009, at 3:39 PM, Ted Dunning wrote:
>
>
> > - can the clustering algorithm be viewed in a probabilistic framework
> > (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest
> > neighbors = not so much)
> >
> > - is the definition of a cluster abstract enough to be flexible with
> regard
> > to whether a cluster is a model or does it require stronger limits.
> > (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any
> > probabilistic model)
>
> Can you elaborate a bit more on these two?  I can see a bit on the
> probability side, as those approaches play a factor in how similarity is
> determined, but I don't get the significance of "cluster as a model".  Is it
> just a simplification that then makes it easier to ask: does this document
> fit into the model?
>
> -Grant




-- 
Ted Dunning, CTO
DeepDyve

Reply via email to