You may have better results if you used a learnt embedding space instead if
random
On Mar 22, 2013 3:43 AM, "Dan Filimon" <dangeorge.fili...@gmail.com> wrote:

> Hi everyone,
>
> Ted and me noticed some issues with the BallKMeans implementation.
> When splitting the 20 newsgoups dataset into training and testing as
> per the "official" split and measuring average cluster distances in
> the training and test set (I fixed the clusters after training), we
> got two errors e_in and e_out and we got a scatterplot of the results
> expecting to find a correlation between e_in and e_out.
> It turns out that this is not the case for BallKMeans despite being
> true for StreamingKMeans and the default Mahout KMeans.
>
> The vectors used for clustering are TFIDF generated using seq2sparse
> and projected down to 100 dimensions with random projections.
>
> Now, I'm trying to pinpoint the issues with BallKMeans.
>
> Based on what I've seen there doesn't seem to be anything necessarily
> wrong with the algorithm. I think the data itself might be the actual
> issue.
>
> I tried:
> [1. a basic run]
> 2. when initializing the seeds using k-means++ the distance being used
> was always SquaredEuclideanDistance as in the paper; this seems wrong
> when the distance being used to get the nearest points is different. I
> made the distance metric be the same as in the nearest neighbor
> searcher. 1. and 2. were performed using EuclideanDistance as the
> metric.
> 3. I switched to SquaredEuclideanDistance as the distance metric.
> 4. I abandoned kmeans++ and reverted to a random sampling of the
> initial centroids.
>
> I have some plots and data, but the bottom line is that 1, 2 and 3 are
> very close (I don't really think anything was gained from clustering
> using Euclidean or SquaredEuclidean).
>
> The interesting bit is that doing 4 "fixes" BallKMeans. It looks like
> the principled seeding is "hurting" the clusters (at least according
> to the correlation signal).
>
> Here are 2 box plots of all the types of clustering:
>
> [1] is using kmeans++ for the initial centroids in all BallKMeans runs.
> [2] is using random centroids in all BallKMeans runs.
>
> Here's what the cryptic 3-letter abbreviations mean:
> km = Mahout's KMeans
> skm = just StreamingKMeans with ~200 clusters
> skm0 = just StreamingKMeans with 20 clusters
> oskm = just StreamingKMeans giving it one point at a time ("online")
> bkm = just BallKMeans
> boskm = BallKMeans on the centroids from StreamingKMeans with one
> point at a time (oskm)
> bskm = BallKMeans on the centroids from StreamingKmeans with all
> points at the same time (skm)
>
> My conclusions from this are:
> - it's unclear what is better: having a good mean distance, having a
> good median distance, having fewer outliers... ?
> - kmeans++ is doing worse at initializing the centroids as bkm is a
> lot more compact (mean and median lower) when using random centroids;
> however there are many more outliers
> - the effect is less dramatic for boskm and bskm but they have a
> smaller median; the mean however is a bit larger; there are a lot more
> outliers too
>
> I'm pretty sure the data and k are not optimal and the assumptions in
> the paper don't really hold for this data set.
> There don't really seem to be any clear clusters and the random
> projections probably mixed them even further.
>
> So, I think this is why kmeans++ is 'worse' than random centroids
> (depending on how you define worse...).
>
> [1]
> https://docs.google.com/file/d/0B4chdzPoBmWoWmpLOHNzOXB0c1k/edit?usp=sharing
> [2]
> https://docs.google.com/file/d/0B4chdzPoBmWoLUtNVVN3UzU3a1k/edit?usp=sharing
>

Reply via email to