As I said in my previous reply, I don't think k-means is the right tool to
start with. Try LDA with k (number of latent topics) set to 3 and go up to
say 20. The problem likely lies is the feature vectors, on which you
provided almost no information. Text is not taken from a continuous space,
so
Hi,
Do you mean that you'e running K-Means directly on tf-idf bag-of-word
vectors? I think your results are expected because of the general lack of
big overlap between one hot encoded vectors. The similarity between most
vectors is expected to be very close to zero. Those that do end up in the
Hi,
I am using spark k mean for clustering records that consist of news
documents, vectors are created by applying tf-idf. Dataset that I am using
for testing right now is the gold-truth classified
http://qwone.com/~jason/20Newsgroups/
Issue is all the documents are getting assigned to same