Hi,

Do you mean that you'e running K-Means directly on tf-idf bag-of-word
vectors? I think your results are expected because of the general lack of
big overlap between one hot encoded vectors. The similarity between most
vectors is expected to be very close to zero. Those that do end up in the
same cluster likely have a lot of similar boilerplate text (assuming the
training data comes from crawled new articles, they likely have similar
menus and header/footer text)

I would suggest you try some dimensionality reduction on the tf-idf vectors
first. You have many options to choose from (LSA, LDA, document2vec, etc).
Other than that, this isn't a Spark question.

Asher Krim
Senior Software Engineer

On Fri, Mar 24, 2017 at 9:37 PM, Reth RM <reth.ik...@gmail.com> wrote:

> Hi,
>
>   I am using spark k mean for clustering records that consist of news
> documents, vectors are created by applying tf-idf. Dataset that I am using
> for testing right now is the gold-truth classified http://qwone.com/~
> jason/20Newsgroups/
>
> Issue is all the documents are getting assigned to same cluster and others
> just have the vector(doc) picked as cluster center(skewed clustering). What
> could be the possible reasons for the issue, any suggestions? Should I be
> retuning the epsilon?
>
>
>
>
>

Reply via email to