Right. Up until now i'am helping myself with some minDf truncation since i
am using tf idf weighted vectors anyway and have the idf counts at hand.
But having a true loss driven sparsification might be better


On Sun, Dec 29, 2013 at 11:43 PM, Ted Dunning <[email protected]> wrote:

> Johannes,
>
> One thing that might be of real interest is something that I haven't tried,
> nor read about.  It should still be interesting.
>
> The idea is L_1 regularized clustering.  This will give you (hopefully)
> sparse centroids that still are useful.  In your example, where you want
> defensible (aka salable) clusters, it would make the clusters much easier
> to understand.  It would also make them take less memory.
>
>
>
> On Sun, Dec 29, 2013 at 1:55 PM, Johannes Schulte <
> [email protected]> wrote:
>
> > Ted, thanks for the long response! I agree with you on the benefit of a
> lot
> > of clusters. the reason i chose 10 is not because i think there are truly
> > 10 clusters in the data (there are probably thousands, e-commerce site
> > behavioural data), but for technical reasons:
> >
> > - the cluster distances are used in a probability prediction model with
> > rare events data, so i want every cluster to contain at least some
> positive
> > examples
> > - the cluster centroids need to be kept in memory online for real time
> > feature vector generation and olr scoring. as the vectors are quite big
> and
> > there are ~100 clients to handle simultaneously (100 clients * 10
> clusters
> > * ~10000 non zero features per cluster centroid * 8 byte per vector
> > element)
> > - when "selling" the clusters, visualized, it's easier to plot something
> > with only 10 clusters
> >
> > i'll give it another try with more clusters on the sketch phase and see
> > what i can achieve. thanks for your help!
> >
> >
> > On Sun, Dec 29, 2013 at 1:32 AM, Ted Dunning <[email protected]>
> > wrote:
> >
> > >
> > > On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte <
> > > [email protected]> wrote:
> > >
> > >> Okay, understood. For a lot of clusters (which i don't necessarily
> > >> attribute to big data problems but definetly to nearest neighbour like
> > >> usage of clusters), the every "cluster sees every point" doesnt scale
> > >> well.
> > >>
> > >
> > >
> > > Nearest neighbor sorts of problems.  Or market segmentation kinds of
> > > problems.  Or feature generation kinds of problems.  Or volume
> > quantization
> > > kinds of problems.
> > >
> > > The point is that with more data, you can derive a more detailed model.
> > >  That means more clusters.
> > >
> > > The ideal case is that we have an infinite mixture model to describe
> the
> > > data distribution, but we haven't seen most of the mixture components
> > yet.
> > >  As we get more data, we can justify saying we have more components.
> > >
> > > Even if you think that there are *really* 10 clusters in the data, I
> can
> > > make the case that you are going to get a better unsupervised
> description
> > > of those clusters by using 100 or more clusters in the k-means
> algorithm.
> > >  The idea is that each of your real clusters will be composed of
> several
> > of
> > > the k-means clusters and finding the attachments is much easier because
> > 100
> > > clusters is much smaller than the original number of samples.
> > >
> > > As a small data example, suppose you are looking at Anderson's Iris
> data
> > > which is available built into R.  If we plot the 150 data points
> against
> > > the features we have in pairs, we can see various patterns and see
> that a
> > > non-linear classifier should do quite well in separating the classes (I
> > > hope the image makes it through the emailer):
> > >
> > > [image: Inline image 1]
> > >
> > > But if we try to do k-means on these data with only 3 clusters, we get
> > > very poor assignment to the three species:
> > >
> > >
> > > *> k = kmeans(iris[,1:4], centers=3, nstart=10)*
> > > *> table(iris$Species, k$cluster)*
> > >
> > >               cluster
> > >               1  2  3
> > >   setosa*     50  0  0*
> > >   versicolor*  0 48  2*
> > >   virginica*   0 14 36*
> > >
> > > Cluster 1 captured the isolated setosa species rather well, but
> > versicolor
> > > and virginica are not well separated because cluster 2 has 80% of
> > > versicolor and 20% of virginica.
> > >
> > > On the other hand, if we use 7 clusters,
> > >
> > >
> > > *> k = kmeans(iris[,1:4], centers=7, nstart=10)*
> > >
> > > *> table(iris$Species, k$cluster)*
> > >
> > >                    cluster
> > >               1  2  3  4  5  6  7
> > >   setosa*      0  0 28  0 22  0  0*
> > >   versicolor*  0  7  0 20  0  0 23*
> > >   virginica*  12  0  0  1  0 24 13*
> > >
> > > Each cluster is now composed of almost exactly one species.  Only
> cluster
> > > 4 has any impurity and it is 95% composed of just versicolor samples.
> > >
> > > What this means is that we can use the 7 cluster k-means results to
> build
> > > a classifier that has a 1 of 7 input feature (cluster id) instead of 4
> > real
> > > values.  That is, we have compressed the 4 original continuous features
> > > down to about 2.7 bits on average and this compressed representation
> > > actually makes building a classifier nearly trivial.
> > >
> > > *> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)*
> > > *2.670288*
> > >
> > > This is pretty cool compression given that the original continuous data
> > > had about 22 bits of raw information capacity based on the range and
> > > precision given.
> > >
> > > Now, in the real world, we would need to hold out some data for cross
> > > validation of the clustering, but the fact remains that if you want to
> > use
> > > the k-means clustering for some machine oriented purpose, it pays to
> have
> > > as many clusters as you can have, subject to the held-out data agreeing
> > > with the original data on distance distributions and counts for the
> > > clusters.
> > >
> >
>

Reply via email to