(Cross post with
http://stackoverflow.com/questions/32936380/k-means-clustering-is-biased-to-one-center)


I have a corpus of wiki pages (baseball, hockey, music, football) which I'm
running through tfidf and then through kmeans. After a couple issues to
start (you can see my previous questions), I'm finally getting a
KMeansModel...but
when I try to predict, I keep getting the same center. Is this because of
the small dataset, or because I'm comparing a multi-word document against a
smaller amount of words(1-20) query? Or is there something else I'm doing
wrong? See the below code:

//Preprocessing of data includes splitting into words
//and removing words with only 1 or 2 characters
val corpus: RDD[Seq[String]]
val hashingTF = new HashingTF(100000)
val tf = hashingTF.transform(corpus)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf).cache
val kMeansModel = KMeans.train(tfidf, 3, 10)

val queryTf = hashingTF.transform(List("music"))
val queryTfidf = idf.transform(queryTf)
kMeansModel.predict(queryTfidf) //Always the same, no matter the term supplied

Reply via email to