Hi everyone,

Ted and me noticed some issues with the BallKMeans implementation.
When splitting the 20 newsgoups dataset into training and testing as
per the "official" split and measuring average cluster distances in
the training and test set (I fixed the clusters after training), we
got two errors e_in and e_out and we got a scatterplot of the results
expecting to find a correlation between e_in and e_out.
It turns out that this is not the case for BallKMeans despite being
true for StreamingKMeans and the default Mahout KMeans.

The vectors used for clustering are TFIDF generated using seq2sparse
and projected down to 100 dimensions with random projections.

Now, I'm trying to pinpoint the issues with BallKMeans.

Based on what I've seen there doesn't seem to be anything necessarily
wrong with the algorithm. I think the data itself might be the actual
issue.

I tried:
[1. a basic run]
2. when initializing the seeds using k-means++ the distance being used
was always SquaredEuclideanDistance as in the paper; this seems wrong
when the distance being used to get the nearest points is different. I
made the distance metric be the same as in the nearest neighbor
searcher. 1. and 2. were performed using EuclideanDistance as the
metric.
3. I switched to SquaredEuclideanDistance as the distance metric.
4. I abandoned kmeans++ and reverted to a random sampling of the
initial centroids.

I have some plots and data, but the bottom line is that 1, 2 and 3 are
very close (I don't really think anything was gained from clustering
using Euclidean or SquaredEuclidean).

The interesting bit is that doing 4 "fixes" BallKMeans. It looks like
the principled seeding is "hurting" the clusters (at least according
to the correlation signal).

Here are 2 box plots of all the types of clustering:

[1] is using kmeans++ for the initial centroids in all BallKMeans runs.
[2] is using random centroids in all BallKMeans runs.

Here's what the cryptic 3-letter abbreviations mean:
km = Mahout's KMeans
skm = just StreamingKMeans with ~200 clusters
skm0 = just StreamingKMeans with 20 clusters
oskm = just StreamingKMeans giving it one point at a time ("online")
bkm = just BallKMeans
boskm = BallKMeans on the centroids from StreamingKMeans with one
point at a time (oskm)
bskm = BallKMeans on the centroids from StreamingKmeans with all
points at the same time (skm)

My conclusions from this are:
- it's unclear what is better: having a good mean distance, having a
good median distance, having fewer outliers... ?
- kmeans++ is doing worse at initializing the centroids as bkm is a
lot more compact (mean and median lower) when using random centroids;
however there are many more outliers
- the effect is less dramatic for boskm and bskm but they have a
smaller median; the mean however is a bit larger; there are a lot more
outliers too

I'm pretty sure the data and k are not optimal and the assumptions in
the paper don't really hold for this data set.
There don't really seem to be any clear clusters and the random
projections probably mixed them even further.

So, I think this is why kmeans++ is 'worse' than random centroids
(depending on how you define worse...).

[1] https://docs.google.com/file/d/0B4chdzPoBmWoWmpLOHNzOXB0c1k/edit?usp=sharing
[2] https://docs.google.com/file/d/0B4chdzPoBmWoLUtNVVN3UzU3a1k/edit?usp=sharing

Reply via email to