On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> Okay, understood. For a lot of clusters (which i don't necessarily
> attribute to big data problems but definetly to nearest neighbour like
> usage of clusters), the every "cluster sees every point" doesnt scale well.
>


Nearest neighbor sorts of problems.  Or market segmentation kinds of
problems.  Or feature generation kinds of problems.  Or volume quantization
kinds of problems.

The point is that with more data, you can derive a more detailed model.
 That means more clusters.

The ideal case is that we have an infinite mixture model to describe the
data distribution, but we haven't seen most of the mixture components yet.
 As we get more data, we can justify saying we have more components.

Even if you think that there are *really* 10 clusters in the data, I can
make the case that you are going to get a better unsupervised description
of those clusters by using 100 or more clusters in the k-means algorithm.
 The idea is that each of your real clusters will be composed of several of
the k-means clusters and finding the attachments is much easier because 100
clusters is much smaller than the original number of samples.

As a small data example, suppose you are looking at Anderson's Iris data
which is available built into R.  If we plot the 150 data points against
the features we have in pairs, we can see various patterns and see that a
non-linear classifier should do quite well in separating the classes (I
hope the image makes it through the emailer):

[image: Inline image 1]

But if we try to do k-means on these data with only 3 clusters, we get very
poor assignment to the three species:


*> k = kmeans(iris[,1:4], centers=3, nstart=10)*
*> table(iris$Species, k$cluster)*

              cluster
              1  2  3
  setosa*     50  0  0*
  versicolor*  0 48  2*
  virginica*   0 14 36*

Cluster 1 captured the isolated setosa species rather well, but versicolor
and virginica are not well separated because cluster 2 has 80% of
versicolor and 20% of virginica.

On the other hand, if we use 7 clusters,


*> k = kmeans(iris[,1:4], centers=7, nstart=10)*

*> table(iris$Species, k$cluster)*

                   cluster
              1  2  3  4  5  6  7
  setosa*      0  0 28  0 22  0  0*
  versicolor*  0  7  0 20  0  0 23*
  virginica*  12  0  0  1  0 24 13*

Each cluster is now composed of almost exactly one species.  Only cluster 4
has any impurity and it is 95% composed of just versicolor samples.

What this means is that we can use the 7 cluster k-means results to build a
classifier that has a 1 of 7 input feature (cluster id) instead of 4 real
values.  That is, we have compressed the 4 original continuous features
down to about 2.7 bits on average and this compressed representation
actually makes building a classifier nearly trivial.

*> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)*
*2.670288*

This is pretty cool compression given that the original continuous data had
about 22 bits of raw information capacity based on the range and precision
given.

Now, in the real world, we would need to hold out some data for cross
validation of the clustering, but the fact remains that if you want to use
the k-means clustering for some machine oriented purpose, it pays to have
as many clusters as you can have, subject to the held-out data agreeing
with the original data on distance distributions and counts for the
clusters.

Reply via email to