Cosine Similarity of Word2Vec algo more than 1?

Manish Tripathi Thu, 29 Dec 2016 13:49:38 -0800

I used a word2vec algorithm of spark to compute documents vector of a text.


I then used the findSynonyms function of the model object to get synonyms
of few words.

I see something like this:




I do not understand why the cosine similarity is being calculated as more
than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
(taking negative angles).

Why it is more than 1 here? What's going wrong here?.

Please note, normalization of the vectors should not be changing the cosine
similarity values since the formula remains the same. If you normalise it's
just a dot product then, if you don't it's dot product/ (normA)*(normB).

I am facing lot of issues with respect to understanding or interpreting the
output of Spark's ml algos. The documentation is not very clear and there
is hardly anything mentioned with respect to how and what is being
returned.

For ex. word2vec algorithm is to convert word to vector form. So I would
expect .transform method would give me vector of each word in the text.

However .transform basically returns doc2vec (averages all word vectors of
a text). This is confusing since nothing of this is mentioned in the docs
and I keep thinking why I have only one word vector instead of word vectors
for all words.

I do understand by returning doc2vec it is helpful since now one doesn't
have to average out each word vector for the whole text. But the docs don't
help or explicitly say that.

This ends up wasting lot of time in just figuring out what is being
returned from an algorithm from Spark.

Does someone have a better solution for this?

I have read the Spark book. That is not about Mllib. I am not sure if
Sean's book would cover all the documentation aspect better than what we
have currently on the docs page.

Thanks



ᐧ

Cosine Similarity of Word2Vec algo more than 1?

Reply via email to