On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte < johannes.schu...@gmail.com> wrote:
> Okay, understood. For a lot of clusters (which i don't necessarily > attribute to big data problems but definetly to nearest neighbour like > usage of clusters), the every "cluster sees every point" doesnt scale well. > Nearest neighbor sorts of problems. Or market segmentation kinds of problems. Or feature generation kinds of problems. Or volume quantization kinds of problems. The point is that with more data, you can derive a more detailed model. That means more clusters. The ideal case is that we have an infinite mixture model to describe the data distribution, but we haven't seen most of the mixture components yet. As we get more data, we can justify saying we have more components. Even if you think that there are *really* 10 clusters in the data, I can make the case that you are going to get a better unsupervised description of those clusters by using 100 or more clusters in the k-means algorithm. The idea is that each of your real clusters will be composed of several of the k-means clusters and finding the attachments is much easier because 100 clusters is much smaller than the original number of samples. As a small data example, suppose you are looking at Anderson's Iris data which is available built into R. If we plot the 150 data points against the features we have in pairs, we can see various patterns and see that a non-linear classifier should do quite well in separating the classes (I hope the image makes it through the emailer): [image: Inline image 1] But if we try to do k-means on these data with only 3 clusters, we get very poor assignment to the three species: *> k = kmeans(iris[,1:4], centers=3, nstart=10)* *> table(iris$Species, k$cluster)* cluster 1 2 3 setosa* 50 0 0* versicolor* 0 48 2* virginica* 0 14 36* Cluster 1 captured the isolated setosa species rather well, but versicolor and virginica are not well separated because cluster 2 has 80% of versicolor and 20% of virginica. On the other hand, if we use 7 clusters, *> k = kmeans(iris[,1:4], centers=7, nstart=10)* *> table(iris$Species, k$cluster)* cluster 1 2 3 4 5 6 7 setosa* 0 0 28 0 22 0 0* versicolor* 0 7 0 20 0 0 23* virginica* 12 0 0 1 0 24 13* Each cluster is now composed of almost exactly one species. Only cluster 4 has any impurity and it is 95% composed of just versicolor samples. What this means is that we can use the 7 cluster k-means results to build a classifier that has a 1 of 7 input feature (cluster id) instead of 4 real values. That is, we have compressed the 4 original continuous features down to about 2.7 bits on average and this compressed representation actually makes building a classifier nearly trivial. *> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)* *2.670288* This is pretty cool compression given that the original continuous data had about 22 bits of raw information capacity based on the range and precision given. Now, in the real world, we would need to hold out some data for cross validation of the clustering, but the fact remains that if you want to use the k-means clustering for some machine oriented purpose, it pays to have as many clusters as you can have, subject to the held-out data agreeing with the original data on distance distributions and counts for the clusters.