Re: Streaming KMeans clustering

Johannes Schulte Sun, 29 Dec 2013 13:56:45 -0800

Ted, thanks for the long response! I agree with you on the benefit of a lot
of clusters. the reason i chose 10 is not because i think there are truly
10 clusters in the data (there are probably thousands, e-commerce site
behavioural data), but for technical reasons:


- the cluster distances are used in a probability prediction model with
rare events data, so i want every cluster to contain at least some positive
examples
- the cluster centroids need to be kept in memory online for real time
feature vector generation and olr scoring. as the vectors are quite big and
there are ~100 clients to handle simultaneously (100 clients * 10 clusters
* ~10000 non zero features per cluster centroid * 8 byte per vector
element)
- when "selling" the clusters, visualized, it's easier to plot something
with only 10 clusters

i'll give it another try with more clusters on the sketch phase and see
what i can achieve. thanks for your help!


On Sun, Dec 29, 2013 at 1:32 AM, Ted Dunning <[email protected]> wrote:

>
> On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte <
> [email protected]> wrote:
>
>> Okay, understood. For a lot of clusters (which i don't necessarily
>> attribute to big data problems but definetly to nearest neighbour like
>> usage of clusters), the every "cluster sees every point" doesnt scale
>> well.
>>
>
>
> Nearest neighbor sorts of problems.  Or market segmentation kinds of
> problems.  Or feature generation kinds of problems.  Or volume quantization
> kinds of problems.
>
> The point is that with more data, you can derive a more detailed model.
>  That means more clusters.
>
> The ideal case is that we have an infinite mixture model to describe the
> data distribution, but we haven't seen most of the mixture components yet.
>  As we get more data, we can justify saying we have more components.
>
> Even if you think that there are *really* 10 clusters in the data, I can
> make the case that you are going to get a better unsupervised description
> of those clusters by using 100 or more clusters in the k-means algorithm.
>  The idea is that each of your real clusters will be composed of several of
> the k-means clusters and finding the attachments is much easier because 100
> clusters is much smaller than the original number of samples.
>
> As a small data example, suppose you are looking at Anderson's Iris data
> which is available built into R.  If we plot the 150 data points against
> the features we have in pairs, we can see various patterns and see that a
> non-linear classifier should do quite well in separating the classes (I
> hope the image makes it through the emailer):
>
> [image: Inline image 1]
>
> But if we try to do k-means on these data with only 3 clusters, we get
> very poor assignment to the three species:
>
>
> *> k = kmeans(iris[,1:4], centers=3, nstart=10)*
> *> table(iris$Species, k$cluster)*
>
>               cluster
>               1  2  3
>   setosa*     50  0  0*
>   versicolor*  0 48  2*
>   virginica*   0 14 36*
>
> Cluster 1 captured the isolated setosa species rather well, but versicolor
> and virginica are not well separated because cluster 2 has 80% of
> versicolor and 20% of virginica.
>
> On the other hand, if we use 7 clusters,
>
>
> *> k = kmeans(iris[,1:4], centers=7, nstart=10)*
>
> *> table(iris$Species, k$cluster)*
>
>                    cluster
>               1  2  3  4  5  6  7
>   setosa*      0  0 28  0 22  0  0*
>   versicolor*  0  7  0 20  0  0 23*
>   virginica*  12  0  0  1  0 24 13*
>
> Each cluster is now composed of almost exactly one species.  Only cluster
> 4 has any impurity and it is 95% composed of just versicolor samples.
>
> What this means is that we can use the 7 cluster k-means results to build
> a classifier that has a 1 of 7 input feature (cluster id) instead of 4 real
> values.  That is, we have compressed the 4 original continuous features
> down to about 2.7 bits on average and this compressed representation
> actually makes building a classifier nearly trivial.
>
> *> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)*
> *2.670288*
>
> This is pretty cool compression given that the original continuous data
> had about 22 bits of raw information capacity based on the range and
> precision given.
>
> Now, in the real world, we would need to hold out some data for cross
> validation of the clustering, but the fact remains that if you want to use
> the k-means clustering for some machine oriented purpose, it pays to have
> as many clusters as you can have, subject to the held-out data agreeing
> with the original data on distance distributions and counts for the
> clusters.
>

Re: Streaming KMeans clustering

Reply via email to