As I said in my previous reply, I don't think k-means is the right tool to
start with. Try LDA with k (number of latent topics) set to 3 and go up to
say 20. The problem likely lies is the feature vectors, on which you
provided almost no information. Text is not taken from a continuous space,
so any bag-of-words approach to clustering will likely fail unless you
first convert the features to a smaller and denser space

Asher Krim
Senior Software Engineer

On Wed, Mar 29, 2017 at 5:49 PM, Reth RM <reth.ik...@gmail.com> wrote:

> Hi Krim,
>
>   The dataset that I am experimenting with is gold-truth and it has 3
> types of docs, one with terms relevant to topic1(sports) other with topic2
> (technology) and thirdly, topic3 with biology, so k setting is 3 and
> features are distinct in each topic(total features close to 1230). I think
> the issue is with centroids convergence. I have been testing with different
> iteration counts and I was assuming that with higher iteration count, the
> centroids will converge at one point and will not shift after that, and the
> 'computeCost' will remain close to same. However, when I test with
> incremental iteration counts and obtain 'cost' at each iteration (or window
> of 5 iterations each) the cost keeps shifting invariably. Below table is
> iteration count vs cost.  I passed the different epsilon value thinking if
> that will lead to consistent convergence, but no luck.  Screenshot
> <https://s04.justpaste.it/files/justpaste/d417/a15312908/screen_shot_2017-03-29_at_2_46_42_pm.png>[1]
> with different iteration count, epsilon vs cost
>
>
> Any thoughts on what am I doing wrong here?
>
>
> *3* *1.841406859*
> *4* *1.750348983*
> *5* *1.514564993*
> 6 1.514564993
> 7 1.514564993
> 8 1.514564993
> 9 1.514564993
> 10 1.514564993
> 11 1.514564993
> 12 1.514564993
> *13* *1.750348983*
> *14* *1.750348983*
> *15* *1.514564993*
> 16 1.514564993
> 17 1.514564993
> 18 1.514564993
> *19* *1.514564993*
> *20* *1.750348983*
>
> [1]https://s04.justpaste.it/files/justpaste/d417/
> a15312908/screen_shot_2017-03-29_at_2_46_42_pm.png
>
>
>
>
> On Sun, Mar 26, 2017 at 4:46 AM, Asher Krim <ak...@hubspot.com> wrote:
>
>> Hi,
>>
>> Do you mean that you'e running K-Means directly on tf-idf bag-of-word
>> vectors? I think your results are expected because of the general lack of
>> big overlap between one hot encoded vectors. The similarity between most
>> vectors is expected to be very close to zero. Those that do end up in the
>> same cluster likely have a lot of similar boilerplate text (assuming the
>> training data comes from crawled new articles, they likely have similar
>> menus and header/footer text)
>>
>> I would suggest you try some dimensionality reduction on the tf-idf
>> vectors first. You have many options to choose from (LSA, LDA,
>> document2vec, etc). Other than that, this isn't a Spark question.
>>
>> Asher Krim
>> Senior Software Engineer
>>
>> On Fri, Mar 24, 2017 at 9:37 PM, Reth RM <reth.ik...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>   I am using spark k mean for clustering records that consist of news
>>> documents, vectors are created by applying tf-idf. Dataset that I am using
>>> for testing right now is the gold-truth classified http://qwone.com/~j
>>> ason/20Newsgroups/
>>>
>>> Issue is all the documents are getting assigned to same cluster and
>>> others just have the vector(doc) picked as cluster center(skewed
>>> clustering). What could be the possible reasons for the issue, any
>>> suggestions? Should I be retuning the epsilon?
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Reply via email to