Indeed. Dan and I have discussed this. The space that he starts in is TF-IDF weighted and the projections is random so it should preserve much of the metric in the original. Based on the experience that we had with SSVD, using a properly learned projection would definitely give modest improvements, but the random projection should still leave lots of information for the clustering to use.
For now the exigent question is to determine just why different k-means produce noticeably different results. The streaming k-means results seem consistently good, but the original ball k-means with fancy seed finding doesn't seem quite so good. The best result that ball k-means produces appears considerably better than standard Mahout k-means, but performance on training does not predict performance on new data so finding the best one is hard. Random starts somewhat better than Mahout km but have lots of variance. Happily, the quality on training data for random starts seems to predict quality on test data very well. I have played a bit with best of 10 starts on synthetic data and results seem to bear that out. Happily, all tests seem to show so far that any clustering algorithm seems to produce essentially identical results on the original data or on the output of streaming k-means. That means that the basic approach is sound. Upcoming topic is to test with larger and different data-sets. On Fri, Mar 22, 2013 at 5:17 PM, Hector Yee <hector....@gmail.com> wrote: > You may have better results if you used a learnt embedding space instead if > random > On Mar 22, 2013 3:43 AM, "Dan Filimon" <dangeorge.fili...@gmail.com> > wrote: > > > Hi everyone, > > > > Ted and me noticed some issues with the BallKMeans implementation. > > When splitting the 20 newsgoups dataset into training and testing as > > per the "official" split and measuring average cluster distances in > > the training and test set (I fixed the clusters after training), we > > got two errors e_in and e_out and we got a scatterplot of the results > > expecting to find a correlation between e_in and e_out. > > It turns out that this is not the case for BallKMeans despite being > > true for StreamingKMeans and the default Mahout KMeans. > > > > The vectors used for clustering are TFIDF generated using seq2sparse > > and projected down to 100 dimensions with random projections. > > > > Now, I'm trying to pinpoint the issues with BallKMeans. > > > > Based on what I've seen there doesn't seem to be anything necessarily > > wrong with the algorithm. I think the data itself might be the actual > > issue. > > > > I tried: > > [1. a basic run] > > 2. when initializing the seeds using k-means++ the distance being used > > was always SquaredEuclideanDistance as in the paper; this seems wrong > > when the distance being used to get the nearest points is different. I > > made the distance metric be the same as in the nearest neighbor > > searcher. 1. and 2. were performed using EuclideanDistance as the > > metric. > > 3. I switched to SquaredEuclideanDistance as the distance metric. > > 4. I abandoned kmeans++ and reverted to a random sampling of the > > initial centroids. > > > > I have some plots and data, but the bottom line is that 1, 2 and 3 are > > very close (I don't really think anything was gained from clustering > > using Euclidean or SquaredEuclidean). > > > > The interesting bit is that doing 4 "fixes" BallKMeans. It looks like > > the principled seeding is "hurting" the clusters (at least according > > to the correlation signal). > > > > Here are 2 box plots of all the types of clustering: > > > > [1] is using kmeans++ for the initial centroids in all BallKMeans runs. > > [2] is using random centroids in all BallKMeans runs. > > > > Here's what the cryptic 3-letter abbreviations mean: > > km = Mahout's KMeans > > skm = just StreamingKMeans with ~200 clusters > > skm0 = just StreamingKMeans with 20 clusters > > oskm = just StreamingKMeans giving it one point at a time ("online") > > bkm = just BallKMeans > > boskm = BallKMeans on the centroids from StreamingKMeans with one > > point at a time (oskm) > > bskm = BallKMeans on the centroids from StreamingKmeans with all > > points at the same time (skm) > > > > My conclusions from this are: > > - it's unclear what is better: having a good mean distance, having a > > good median distance, having fewer outliers... ? > > - kmeans++ is doing worse at initializing the centroids as bkm is a > > lot more compact (mean and median lower) when using random centroids; > > however there are many more outliers > > - the effect is less dramatic for boskm and bskm but they have a > > smaller median; the mean however is a bit larger; there are a lot more > > outliers too > > > > I'm pretty sure the data and k are not optimal and the assumptions in > > the paper don't really hold for this data set. > > There don't really seem to be any clear clusters and the random > > projections probably mixed them even further. > > > > So, I think this is why kmeans++ is 'worse' than random centroids > > (depending on how you define worse...). > > > > [1] > > > https://docs.google.com/file/d/0B4chdzPoBmWoWmpLOHNzOXB0c1k/edit?usp=sharing > > [2] > > > https://docs.google.com/file/d/0B4chdzPoBmWoLUtNVVN3UzU3a1k/edit?usp=sharing > > >