Re: Cosine Similarity of Word2Vec algo more than 1?

Manish Tripathi Thu, 29 Dec 2016 14:35:22 -0800

ok got that. I understand that the ordering won't change. I just wanted to
make sure I am getting the right thing or I understand what I am getting
since it didn't make sense going by the cosine calculation.


One last confirmation and I appreciate all the time you are spending to
reply:

In the github link issue , it's mentioned fVector is not normalised. By
fVector it's meant the word vector we want to find synonyms for?. So in my
example, it would be vector for 'science' word which I passed to the
method?.

if yes, then I guess, solution should be simple. Just divide the current
cosine output by the norm of this vector. And this vector we can get by
doing model.transform('science') if I am right?

Lastly, I would be very happy to update to docs if it is editable for all
the things I encounter as not mentioned or not very clear.
ᐧ

On Thu, Dec 29, 2016 at 2:28 PM, Sean Owen <so...@cloudera.com> wrote:

> Yes, the vectors are not otherwise normalized.
> You are basically getting the cosine similarity, but times the norm of the
> word vector you supplied, because it's not divided through. You could just
> divide the results yourself.
> I don't think it will be back-ported because the the behavior was intended
> in 1.x, just wrongly documented, and we don't want to change the behavior
> in 1.x. The results are still correctly ordered anyway.
>
> On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi <tr.man...@gmail.com>
> wrote:
>
>> Sean,
>>
>> Thanks for answer. I am using Spark 1.6 so are you saying the output I am
>> getting is cos(A,B)=dot(A,B)/norm(A) ?
>>
>> My point with respect to normalization was that if you normalise or don't
>> normalize both vectors A,B, the output would be same. Since if I normalize
>> A and B, then
>>
>> Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If
>> we don't normalize it would have a norm in the denominator so output is
>> same.
>>
>> But I understand you are saying in Spark 1.x, one vector was not
>> normalized. If that is the case then it makes sense.
>>
>> Any idea how to fix this (get the right cosine similarity) in Spark 1.x?
>> . If the input word in findSynonyms is not normalized while calculating
>> cosine, then doing w2vmodel.transform(input_word) to get a vector
>> representation and then diving the current result by the norm of this
>> vector should be correct?
>>
>> Also, I am very open to editing the docs on things I find not properly
>> documented or wrong, but I need to know if that is allowed (is it like a
>> Wiki)?.
>> ᐧ
>>
>> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> It should be the cosine similarity, yes. I think this is what was fixed
>> in https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
>> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
>> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
>> The normalization most certainly matters here, and it's the opposite:
>> dividing the dot by vec norms gives you the cosine.
>>
>> Although docs can always be better (and here was a case where it was
>> wrong) all of this comes with javadoc and examples. Right now at least,
>> .transform() describes the operation as you do, so it is documented. I'd
>> propose you invest in improving the docs rather than saying 'this isn't
>> what I expected'.
>>
>> (No, our book isn't a reference for MLlib, more like worked examples)
>>
>> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi <tr.man...@gmail.com>
>> wrote:
>>
>> I used a word2vec algorithm of spark to compute documents vector of a
>> text.
>>
>> I then used the findSynonyms function of the model object to get
>> synonyms of few words.
>>
>> I see something like this:
>>
>>
>> 
>>
>> I do not understand why the cosine similarity is being calculated as more
>> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
>> (taking negative angles).
>>
>> Why it is more than 1 here? What's going wrong here?.
>>
>> Please note, normalization of the vectors should not be changing the
>> cosine similarity values since the formula remains the same. If you
>> normalise it's just a dot product then, if you don't it's dot product/
>> (normA)*(normB).
>>
>> I am facing lot of issues with respect to understanding or interpreting
>> the output of Spark's ml algos. The documentation is not very clear and
>> there is hardly anything mentioned with respect to how and what is being
>> returned.
>>
>> For ex. word2vec algorithm is to convert word to vector form. So I would
>> expect .transform method would give me vector of each word in the text.
>>
>> However .transform basically returns doc2vec (averages all word vectors
>> of a text). This is confusing since nothing of this is mentioned in the
>> docs and I keep thinking why I have only one word vector instead of word
>> vectors for all words.
>>
>> I do understand by returning doc2vec it is helpful since now one doesn't
>> have to average out each word vector for the whole text. But the docs don't
>> help or explicitly say that.
>>
>> This ends up wasting lot of time in just figuring out what is being
>> returned from an algorithm from Spark.
>>
>> Does someone have a better solution for this?
>>
>> I have read the Spark book. That is not about Mllib. I am not sure if
>> Sean's book would cover all the documentation aspect better than what we
>> have currently on the docs page.
>>
>> Thanks
>>
>>
>>
>> ᐧ
>>
>>
>>
ᐧ
ᐧ

Re: Cosine Similarity of Word2Vec algo more than 1?

Reply via email to