(Cross post with http://stackoverflow.com/questions/32936380/k-means-clustering-is-biased-to-one-center)
I have a corpus of wiki pages (baseball, hockey, music, football) which I'm running through tfidf and then through kmeans. After a couple issues to start (you can see my previous questions), I'm finally getting a KMeansModel...but when I try to predict, I keep getting the same center. Is this because of the small dataset, or because I'm comparing a multi-word document against a smaller amount of words(1-20) query? Or is there something else I'm doing wrong? See the below code: //Preprocessing of data includes splitting into words //and removing words with only 1 or 2 characters val corpus: RDD[Seq[String]] val hashingTF = new HashingTF(100000) val tf = hashingTF.transform(corpus) val idf = new IDF().fit(tf) val tfidf = idf.transform(tf).cache val kMeansModel = KMeans.train(tfidf, 3, 10) val queryTf = hashingTF.transform(List("music")) val queryTfidf = idf.transform(queryTf) kMeansModel.predict(queryTfidf) //Always the same, no matter the term supplied