Re: KMean clustering resulting Skewed Issue

2017-03-29 Thread Asher Krim
As I said in my previous reply, I don't think k-means is the right tool to start with. Try LDA with k (number of latent topics) set to 3 and go up to say 20. The problem likely lies is the feature vectors, on which you provided almost no information. Text is not taken from a continuous space, so

Re: KMean clustering resulting Skewed Issue

2017-03-26 Thread Asher Krim
Hi, Do you mean that you'e running K-Means directly on tf-idf bag-of-word vectors? I think your results are expected because of the general lack of big overlap between one hot encoded vectors. The similarity between most vectors is expected to be very close to zero. Those that do end up in the

KMean clustering resulting Skewed Issue

2017-03-24 Thread Reth RM
Hi, I am using spark k mean for clustering records that consist of news documents, vectors are created by applying tf-idf. Dataset that I am using for testing right now is the gold-truth classified http://qwone.com/~jason/20Newsgroups/ Issue is all the documents are getting assigned to same