As I said in my previous reply, I don't think k-means is the right tool to start with. Try LDA with k (number of latent topics) set to 3 and go up to say 20. The problem likely lies is the feature vectors, on which you provided almost no information. Text is not taken from a continuous space, so any bag-of-words approach to clustering will likely fail unless you first convert the features to a smaller and denser space
Asher Krim Senior Software Engineer On Wed, Mar 29, 2017 at 5:49 PM, Reth RM <reth.ik...@gmail.com> wrote: > Hi Krim, > > The dataset that I am experimenting with is gold-truth and it has 3 > types of docs, one with terms relevant to topic1(sports) other with topic2 > (technology) and thirdly, topic3 with biology, so k setting is 3 and > features are distinct in each topic(total features close to 1230). I think > the issue is with centroids convergence. I have been testing with different > iteration counts and I was assuming that with higher iteration count, the > centroids will converge at one point and will not shift after that, and the > 'computeCost' will remain close to same. However, when I test with > incremental iteration counts and obtain 'cost' at each iteration (or window > of 5 iterations each) the cost keeps shifting invariably. Below table is > iteration count vs cost. I passed the different epsilon value thinking if > that will lead to consistent convergence, but no luck. Screenshot > <https://s04.justpaste.it/files/justpaste/d417/a15312908/screen_shot_2017-03-29_at_2_46_42_pm.png>[1] > with different iteration count, epsilon vs cost > > > Any thoughts on what am I doing wrong here? > > > *3* *1.841406859* > *4* *1.750348983* > *5* *1.514564993* > 6 1.514564993 > 7 1.514564993 > 8 1.514564993 > 9 1.514564993 > 10 1.514564993 > 11 1.514564993 > 12 1.514564993 > *13* *1.750348983* > *14* *1.750348983* > *15* *1.514564993* > 16 1.514564993 > 17 1.514564993 > 18 1.514564993 > *19* *1.514564993* > *20* *1.750348983* > > [1]https://s04.justpaste.it/files/justpaste/d417/ > a15312908/screen_shot_2017-03-29_at_2_46_42_pm.png > > > > > On Sun, Mar 26, 2017 at 4:46 AM, Asher Krim <ak...@hubspot.com> wrote: > >> Hi, >> >> Do you mean that you'e running K-Means directly on tf-idf bag-of-word >> vectors? I think your results are expected because of the general lack of >> big overlap between one hot encoded vectors. The similarity between most >> vectors is expected to be very close to zero. Those that do end up in the >> same cluster likely have a lot of similar boilerplate text (assuming the >> training data comes from crawled new articles, they likely have similar >> menus and header/footer text) >> >> I would suggest you try some dimensionality reduction on the tf-idf >> vectors first. You have many options to choose from (LSA, LDA, >> document2vec, etc). Other than that, this isn't a Spark question. >> >> Asher Krim >> Senior Software Engineer >> >> On Fri, Mar 24, 2017 at 9:37 PM, Reth RM <reth.ik...@gmail.com> wrote: >> >>> Hi, >>> >>> I am using spark k mean for clustering records that consist of news >>> documents, vectors are created by applying tf-idf. Dataset that I am using >>> for testing right now is the gold-truth classified http://qwone.com/~j >>> ason/20Newsgroups/ >>> >>> Issue is all the documents are getting assigned to same cluster and >>> others just have the vector(doc) picked as cluster center(skewed >>> clustering). What could be the possible reasons for the issue, any >>> suggestions? Should I be retuning the epsilon? >>> >>> >>> >>> >>> >> >> >