Ted, thanks for the long response! I agree with you on the benefit of a lot of clusters. the reason i chose 10 is not because i think there are truly 10 clusters in the data (there are probably thousands, e-commerce site behavioural data), but for technical reasons:
- the cluster distances are used in a probability prediction model with rare events data, so i want every cluster to contain at least some positive examples - the cluster centroids need to be kept in memory online for real time feature vector generation and olr scoring. as the vectors are quite big and there are ~100 clients to handle simultaneously (100 clients * 10 clusters * ~10000 non zero features per cluster centroid * 8 byte per vector element) - when "selling" the clusters, visualized, it's easier to plot something with only 10 clusters i'll give it another try with more clusters on the sketch phase and see what i can achieve. thanks for your help! On Sun, Dec 29, 2013 at 1:32 AM, Ted Dunning <[email protected]> wrote: > > On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte < > [email protected]> wrote: > >> Okay, understood. For a lot of clusters (which i don't necessarily >> attribute to big data problems but definetly to nearest neighbour like >> usage of clusters), the every "cluster sees every point" doesnt scale >> well. >> > > > Nearest neighbor sorts of problems. Or market segmentation kinds of > problems. Or feature generation kinds of problems. Or volume quantization > kinds of problems. > > The point is that with more data, you can derive a more detailed model. > That means more clusters. > > The ideal case is that we have an infinite mixture model to describe the > data distribution, but we haven't seen most of the mixture components yet. > As we get more data, we can justify saying we have more components. > > Even if you think that there are *really* 10 clusters in the data, I can > make the case that you are going to get a better unsupervised description > of those clusters by using 100 or more clusters in the k-means algorithm. > The idea is that each of your real clusters will be composed of several of > the k-means clusters and finding the attachments is much easier because 100 > clusters is much smaller than the original number of samples. > > As a small data example, suppose you are looking at Anderson's Iris data > which is available built into R. If we plot the 150 data points against > the features we have in pairs, we can see various patterns and see that a > non-linear classifier should do quite well in separating the classes (I > hope the image makes it through the emailer): > > [image: Inline image 1] > > But if we try to do k-means on these data with only 3 clusters, we get > very poor assignment to the three species: > > > *> k = kmeans(iris[,1:4], centers=3, nstart=10)* > *> table(iris$Species, k$cluster)* > > cluster > 1 2 3 > setosa* 50 0 0* > versicolor* 0 48 2* > virginica* 0 14 36* > > Cluster 1 captured the isolated setosa species rather well, but versicolor > and virginica are not well separated because cluster 2 has 80% of > versicolor and 20% of virginica. > > On the other hand, if we use 7 clusters, > > > *> k = kmeans(iris[,1:4], centers=7, nstart=10)* > > *> table(iris$Species, k$cluster)* > > cluster > 1 2 3 4 5 6 7 > setosa* 0 0 28 0 22 0 0* > versicolor* 0 7 0 20 0 0 23* > virginica* 12 0 0 1 0 24 13* > > Each cluster is now composed of almost exactly one species. Only cluster > 4 has any impurity and it is 95% composed of just versicolor samples. > > What this means is that we can use the 7 cluster k-means results to build > a classifier that has a 1 of 7 input feature (cluster id) instead of 4 real > values. That is, we have compressed the 4 original continuous features > down to about 2.7 bits on average and this compressed representation > actually makes building a classifier nearly trivial. > > *> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)* > *2.670288* > > This is pretty cool compression given that the original continuous data > had about 22 bits of raw information capacity based on the range and > precision given. > > Now, in the real world, we would need to hold out some data for cross > validation of the clustering, but the fact remains that if you want to use > the k-means clustering for some machine oriented purpose, it pays to have > as many clusters as you can have, subject to the held-out data agreeing > with the original data on distance distributions and counts for the > clusters. >
