Re: Cosine Similarity of Word2Vec algo more than 1?

Sean Owen Thu, 29 Dec 2016 14:28:49 -0800

Yes, the vectors are not otherwise normalized.
You are basically getting the cosine similarity, but times the norm of the
word vector you supplied, because it's not divided through. You could just
divide the results yourself.
I don't think it will be back-ported because the the behavior was intended
in 1.x, just wrongly documented, and we don't want to change the behavior
in 1.x. The results are still correctly ordered anyway.


On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi <tr.man...@gmail.com>
wrote:

> Sean,
>
> Thanks for answer. I am using Spark 1.6 so are you saying the output I am
> getting is cos(A,B)=dot(A,B)/norm(A) ?
>
> My point with respect to normalization was that if you normalise or don't
> normalize both vectors A,B, the output would be same. Since if I normalize
> A and B, then
>
> Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If
> we don't normalize it would have a norm in the denominator so output is
> same.
>
> But I understand you are saying in Spark 1.x, one vector was not
> normalized. If that is the case then it makes sense.
>
> Any idea how to fix this (get the right cosine similarity) in Spark 1.x? .
> If the input word in findSynonyms is not normalized while calculating
> cosine, then doing w2vmodel.transform(input_word) to get a vector
> representation and then diving the current result by the norm of this
> vector should be correct?
>
> Also, I am very open to editing the docs on things I find not properly
> documented or wrong, but I need to know if that is allowed (is it like a
> Wiki)?.
> ᐧ
>
> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen <so...@cloudera.com> wrote:
>
> It should be the cosine similarity, yes. I think this is what was fixed in
> https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
> The normalization most certainly matters here, and it's the opposite:
> dividing the dot by vec norms gives you the cosine.
>
> Although docs can always be better (and here was a case where it was
> wrong) all of this comes with javadoc and examples. Right now at least,
> .transform() describes the operation as you do, so it is documented. I'd
> propose you invest in improving the docs rather than saying 'this isn't
> what I expected'.
>
> (No, our book isn't a reference for MLlib, more like worked examples)
>
> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi <tr.man...@gmail.com>
> wrote:
>
> I used a word2vec algorithm of spark to compute documents vector of a text.
>
> I then used the findSynonyms function of the model object to get synonyms
> of few words.
>
> I see something like this:
>
>
> 
>
> I do not understand why the cosine similarity is being calculated as more
> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
> (taking negative angles).
>
> Why it is more than 1 here? What's going wrong here?.
>
> Please note, normalization of the vectors should not be changing the
> cosine similarity values since the formula remains the same. If you
> normalise it's just a dot product then, if you don't it's dot product/
> (normA)*(normB).
>
> I am facing lot of issues with respect to understanding or interpreting
> the output of Spark's ml algos. The documentation is not very clear and
> there is hardly anything mentioned with respect to how and what is being
> returned.
>
> For ex. word2vec algorithm is to convert word to vector form. So I would
> expect .transform method would give me vector of each word in the text.
>
> However .transform basically returns doc2vec (averages all word vectors of
> a text). This is confusing since nothing of this is mentioned in the docs
> and I keep thinking why I have only one word vector instead of word vectors
> for all words.
>
> I do understand by returning doc2vec it is helpful since now one doesn't
> have to average out each word vector for the whole text. But the docs don't
> help or explicitly say that.
>
> This ends up wasting lot of time in just figuring out what is being
> returned from an algorithm from Spark.
>
> Does someone have a better solution for this?
>
> I have read the Spark book. That is not about Mllib. I am not sure if
> Sean's book would cover all the documentation aspect better than what we
> have currently on the docs page.
>
> Thanks
>
>
>
> ᐧ
>
>
>

Re: Cosine Similarity of Word2Vec algo more than 1?

Reply via email to