Right. Up until now i'am helping myself with some minDf truncation since i am using tf idf weighted vectors anyway and have the idf counts at hand. But having a true loss driven sparsification might be better
On Sun, Dec 29, 2013 at 11:43 PM, Ted Dunning <[email protected]> wrote: > Johannes, > > One thing that might be of real interest is something that I haven't tried, > nor read about. It should still be interesting. > > The idea is L_1 regularized clustering. This will give you (hopefully) > sparse centroids that still are useful. In your example, where you want > defensible (aka salable) clusters, it would make the clusters much easier > to understand. It would also make them take less memory. > > > > On Sun, Dec 29, 2013 at 1:55 PM, Johannes Schulte < > [email protected]> wrote: > > > Ted, thanks for the long response! I agree with you on the benefit of a > lot > > of clusters. the reason i chose 10 is not because i think there are truly > > 10 clusters in the data (there are probably thousands, e-commerce site > > behavioural data), but for technical reasons: > > > > - the cluster distances are used in a probability prediction model with > > rare events data, so i want every cluster to contain at least some > positive > > examples > > - the cluster centroids need to be kept in memory online for real time > > feature vector generation and olr scoring. as the vectors are quite big > and > > there are ~100 clients to handle simultaneously (100 clients * 10 > clusters > > * ~10000 non zero features per cluster centroid * 8 byte per vector > > element) > > - when "selling" the clusters, visualized, it's easier to plot something > > with only 10 clusters > > > > i'll give it another try with more clusters on the sketch phase and see > > what i can achieve. thanks for your help! > > > > > > On Sun, Dec 29, 2013 at 1:32 AM, Ted Dunning <[email protected]> > > wrote: > > > > > > > > On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte < > > > [email protected]> wrote: > > > > > >> Okay, understood. For a lot of clusters (which i don't necessarily > > >> attribute to big data problems but definetly to nearest neighbour like > > >> usage of clusters), the every "cluster sees every point" doesnt scale > > >> well. > > >> > > > > > > > > > Nearest neighbor sorts of problems. Or market segmentation kinds of > > > problems. Or feature generation kinds of problems. Or volume > > quantization > > > kinds of problems. > > > > > > The point is that with more data, you can derive a more detailed model. > > > That means more clusters. > > > > > > The ideal case is that we have an infinite mixture model to describe > the > > > data distribution, but we haven't seen most of the mixture components > > yet. > > > As we get more data, we can justify saying we have more components. > > > > > > Even if you think that there are *really* 10 clusters in the data, I > can > > > make the case that you are going to get a better unsupervised > description > > > of those clusters by using 100 or more clusters in the k-means > algorithm. > > > The idea is that each of your real clusters will be composed of > several > > of > > > the k-means clusters and finding the attachments is much easier because > > 100 > > > clusters is much smaller than the original number of samples. > > > > > > As a small data example, suppose you are looking at Anderson's Iris > data > > > which is available built into R. If we plot the 150 data points > against > > > the features we have in pairs, we can see various patterns and see > that a > > > non-linear classifier should do quite well in separating the classes (I > > > hope the image makes it through the emailer): > > > > > > [image: Inline image 1] > > > > > > But if we try to do k-means on these data with only 3 clusters, we get > > > very poor assignment to the three species: > > > > > > > > > *> k = kmeans(iris[,1:4], centers=3, nstart=10)* > > > *> table(iris$Species, k$cluster)* > > > > > > cluster > > > 1 2 3 > > > setosa* 50 0 0* > > > versicolor* 0 48 2* > > > virginica* 0 14 36* > > > > > > Cluster 1 captured the isolated setosa species rather well, but > > versicolor > > > and virginica are not well separated because cluster 2 has 80% of > > > versicolor and 20% of virginica. > > > > > > On the other hand, if we use 7 clusters, > > > > > > > > > *> k = kmeans(iris[,1:4], centers=7, nstart=10)* > > > > > > *> table(iris$Species, k$cluster)* > > > > > > cluster > > > 1 2 3 4 5 6 7 > > > setosa* 0 0 28 0 22 0 0* > > > versicolor* 0 7 0 20 0 0 23* > > > virginica* 12 0 0 1 0 24 13* > > > > > > Each cluster is now composed of almost exactly one species. Only > cluster > > > 4 has any impurity and it is 95% composed of just versicolor samples. > > > > > > What this means is that we can use the 7 cluster k-means results to > build > > > a classifier that has a 1 of 7 input feature (cluster id) instead of 4 > > real > > > values. That is, we have compressed the 4 original continuous features > > > down to about 2.7 bits on average and this compressed representation > > > actually makes building a classifier nearly trivial. > > > > > > *> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)* > > > *2.670288* > > > > > > This is pretty cool compression given that the original continuous data > > > had about 22 bits of raw information capacity based on the range and > > > precision given. > > > > > > Now, in the real world, we would need to hold out some data for cross > > > validation of the clustering, but the fact remains that if you want to > > use > > > the k-means clustering for some machine oriented purpose, it pays to > have > > > as many clusters as you can have, subject to the held-out data agreeing > > > with the original data on distance distributions and counts for the > > > clusters. > > > > > >
