You may have better results if you used a learnt embedding space instead if random On Mar 22, 2013 3:43 AM, "Dan Filimon" <dangeorge.fili...@gmail.com> wrote:
> Hi everyone, > > Ted and me noticed some issues with the BallKMeans implementation. > When splitting the 20 newsgoups dataset into training and testing as > per the "official" split and measuring average cluster distances in > the training and test set (I fixed the clusters after training), we > got two errors e_in and e_out and we got a scatterplot of the results > expecting to find a correlation between e_in and e_out. > It turns out that this is not the case for BallKMeans despite being > true for StreamingKMeans and the default Mahout KMeans. > > The vectors used for clustering are TFIDF generated using seq2sparse > and projected down to 100 dimensions with random projections. > > Now, I'm trying to pinpoint the issues with BallKMeans. > > Based on what I've seen there doesn't seem to be anything necessarily > wrong with the algorithm. I think the data itself might be the actual > issue. > > I tried: > [1. a basic run] > 2. when initializing the seeds using k-means++ the distance being used > was always SquaredEuclideanDistance as in the paper; this seems wrong > when the distance being used to get the nearest points is different. I > made the distance metric be the same as in the nearest neighbor > searcher. 1. and 2. were performed using EuclideanDistance as the > metric. > 3. I switched to SquaredEuclideanDistance as the distance metric. > 4. I abandoned kmeans++ and reverted to a random sampling of the > initial centroids. > > I have some plots and data, but the bottom line is that 1, 2 and 3 are > very close (I don't really think anything was gained from clustering > using Euclidean or SquaredEuclidean). > > The interesting bit is that doing 4 "fixes" BallKMeans. It looks like > the principled seeding is "hurting" the clusters (at least according > to the correlation signal). > > Here are 2 box plots of all the types of clustering: > > [1] is using kmeans++ for the initial centroids in all BallKMeans runs. > [2] is using random centroids in all BallKMeans runs. > > Here's what the cryptic 3-letter abbreviations mean: > km = Mahout's KMeans > skm = just StreamingKMeans with ~200 clusters > skm0 = just StreamingKMeans with 20 clusters > oskm = just StreamingKMeans giving it one point at a time ("online") > bkm = just BallKMeans > boskm = BallKMeans on the centroids from StreamingKMeans with one > point at a time (oskm) > bskm = BallKMeans on the centroids from StreamingKmeans with all > points at the same time (skm) > > My conclusions from this are: > - it's unclear what is better: having a good mean distance, having a > good median distance, having fewer outliers... ? > - kmeans++ is doing worse at initializing the centroids as bkm is a > lot more compact (mean and median lower) when using random centroids; > however there are many more outliers > - the effect is less dramatic for boskm and bskm but they have a > smaller median; the mean however is a bit larger; there are a lot more > outliers too > > I'm pretty sure the data and k are not optimal and the assumptions in > the paper don't really hold for this data set. > There don't really seem to be any clear clusters and the random > projections probably mixed them even further. > > So, I think this is why kmeans++ is 'worse' than random centroids > (depending on how you define worse...). > > [1] > https://docs.google.com/file/d/0B4chdzPoBmWoWmpLOHNzOXB0c1k/edit?usp=sharing > [2] > https://docs.google.com/file/d/0B4chdzPoBmWoLUtNVVN3UzU3a1k/edit?usp=sharing >