Johannes, One thing that might be of real interest is something that I haven't tried, nor read about. It should still be interesting.
The idea is L_1 regularized clustering. This will give you (hopefully) sparse centroids that still are useful. In your example, where you want defensible (aka salable) clusters, it would make the clusters much easier to understand. It would also make them take less memory. On Sun, Dec 29, 2013 at 1:55 PM, Johannes Schulte < [email protected]> wrote: > Ted, thanks for the long response! I agree with you on the benefit of a lot > of clusters. the reason i chose 10 is not because i think there are truly > 10 clusters in the data (there are probably thousands, e-commerce site > behavioural data), but for technical reasons: > > - the cluster distances are used in a probability prediction model with > rare events data, so i want every cluster to contain at least some positive > examples > - the cluster centroids need to be kept in memory online for real time > feature vector generation and olr scoring. as the vectors are quite big and > there are ~100 clients to handle simultaneously (100 clients * 10 clusters > * ~10000 non zero features per cluster centroid * 8 byte per vector > element) > - when "selling" the clusters, visualized, it's easier to plot something > with only 10 clusters > > i'll give it another try with more clusters on the sketch phase and see > what i can achieve. thanks for your help! > > > On Sun, Dec 29, 2013 at 1:32 AM, Ted Dunning <[email protected]> > wrote: > > > > > On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte < > > [email protected]> wrote: > > > >> Okay, understood. For a lot of clusters (which i don't necessarily > >> attribute to big data problems but definetly to nearest neighbour like > >> usage of clusters), the every "cluster sees every point" doesnt scale > >> well. > >> > > > > > > Nearest neighbor sorts of problems. Or market segmentation kinds of > > problems. Or feature generation kinds of problems. Or volume > quantization > > kinds of problems. > > > > The point is that with more data, you can derive a more detailed model. > > That means more clusters. > > > > The ideal case is that we have an infinite mixture model to describe the > > data distribution, but we haven't seen most of the mixture components > yet. > > As we get more data, we can justify saying we have more components. > > > > Even if you think that there are *really* 10 clusters in the data, I can > > make the case that you are going to get a better unsupervised description > > of those clusters by using 100 or more clusters in the k-means algorithm. > > The idea is that each of your real clusters will be composed of several > of > > the k-means clusters and finding the attachments is much easier because > 100 > > clusters is much smaller than the original number of samples. > > > > As a small data example, suppose you are looking at Anderson's Iris data > > which is available built into R. If we plot the 150 data points against > > the features we have in pairs, we can see various patterns and see that a > > non-linear classifier should do quite well in separating the classes (I > > hope the image makes it through the emailer): > > > > [image: Inline image 1] > > > > But if we try to do k-means on these data with only 3 clusters, we get > > very poor assignment to the three species: > > > > > > *> k = kmeans(iris[,1:4], centers=3, nstart=10)* > > *> table(iris$Species, k$cluster)* > > > > cluster > > 1 2 3 > > setosa* 50 0 0* > > versicolor* 0 48 2* > > virginica* 0 14 36* > > > > Cluster 1 captured the isolated setosa species rather well, but > versicolor > > and virginica are not well separated because cluster 2 has 80% of > > versicolor and 20% of virginica. > > > > On the other hand, if we use 7 clusters, > > > > > > *> k = kmeans(iris[,1:4], centers=7, nstart=10)* > > > > *> table(iris$Species, k$cluster)* > > > > cluster > > 1 2 3 4 5 6 7 > > setosa* 0 0 28 0 22 0 0* > > versicolor* 0 7 0 20 0 0 23* > > virginica* 12 0 0 1 0 24 13* > > > > Each cluster is now composed of almost exactly one species. Only cluster > > 4 has any impurity and it is 95% composed of just versicolor samples. > > > > What this means is that we can use the 7 cluster k-means results to build > > a classifier that has a 1 of 7 input feature (cluster id) instead of 4 > real > > values. That is, we have compressed the 4 original continuous features > > down to about 2.7 bits on average and this compressed representation > > actually makes building a classifier nearly trivial. > > > > *> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)* > > *2.670288* > > > > This is pretty cool compression given that the original continuous data > > had about 22 bits of raw information capacity based on the range and > > precision given. > > > > Now, in the real world, we would need to hold out some data for cross > > validation of the clustering, but the fact remains that if you want to > use > > the k-means clustering for some machine oriented purpose, it pays to have > > as many clusters as you can have, subject to the held-out data agreeing > > with the original data on distance distributions and counts for the > > clusters. > > >
