ok got that. I understand that the ordering won't change. I just wanted to make sure I am getting the right thing or I understand what I am getting since it didn't make sense going by the cosine calculation.
One last confirmation and I appreciate all the time you are spending to reply: In the github link issue , it's mentioned fVector is not normalised. By fVector it's meant the word vector we want to find synonyms for?. So in my example, it would be vector for 'science' word which I passed to the method?. if yes, then I guess, solution should be simple. Just divide the current cosine output by the norm of this vector. And this vector we can get by doing model.transform('science') if I am right? Lastly, I would be very happy to update to docs if it is editable for all the things I encounter as not mentioned or not very clear. ᐧ On Thu, Dec 29, 2016 at 2:28 PM, Sean Owen <so...@cloudera.com> wrote: > Yes, the vectors are not otherwise normalized. > You are basically getting the cosine similarity, but times the norm of the > word vector you supplied, because it's not divided through. You could just > divide the results yourself. > I don't think it will be back-ported because the the behavior was intended > in 1.x, just wrongly documented, and we don't want to change the behavior > in 1.x. The results are still correctly ordered anyway. > > On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi <tr.man...@gmail.com> > wrote: > >> Sean, >> >> Thanks for answer. I am using Spark 1.6 so are you saying the output I am >> getting is cos(A,B)=dot(A,B)/norm(A) ? >> >> My point with respect to normalization was that if you normalise or don't >> normalize both vectors A,B, the output would be same. Since if I normalize >> A and B, then >> >> Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If >> we don't normalize it would have a norm in the denominator so output is >> same. >> >> But I understand you are saying in Spark 1.x, one vector was not >> normalized. If that is the case then it makes sense. >> >> Any idea how to fix this (get the right cosine similarity) in Spark 1.x? >> . If the input word in findSynonyms is not normalized while calculating >> cosine, then doing w2vmodel.transform(input_word) to get a vector >> representation and then diving the current result by the norm of this >> vector should be correct? >> >> Also, I am very open to editing the docs on things I find not properly >> documented or wrong, but I need to know if that is allowed (is it like a >> Wiki)?. >> ᐧ >> >> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen <so...@cloudera.com> wrote: >> >> It should be the cosine similarity, yes. I think this is what was fixed >> in https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was >> really just outputting the 'unnormalized' similarity (dot / norm(a) only) >> but the docs said cosine similarity. Now it's cosine similarity in Spark 2. >> The normalization most certainly matters here, and it's the opposite: >> dividing the dot by vec norms gives you the cosine. >> >> Although docs can always be better (and here was a case where it was >> wrong) all of this comes with javadoc and examples. Right now at least, >> .transform() describes the operation as you do, so it is documented. I'd >> propose you invest in improving the docs rather than saying 'this isn't >> what I expected'. >> >> (No, our book isn't a reference for MLlib, more like worked examples) >> >> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi <tr.man...@gmail.com> >> wrote: >> >> I used a word2vec algorithm of spark to compute documents vector of a >> text. >> >> I then used the findSynonyms function of the model object to get >> synonyms of few words. >> >> I see something like this: >> >> >> >> >> I do not understand why the cosine similarity is being calculated as more >> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 >> (taking negative angles). >> >> Why it is more than 1 here? What's going wrong here?. >> >> Please note, normalization of the vectors should not be changing the >> cosine similarity values since the formula remains the same. If you >> normalise it's just a dot product then, if you don't it's dot product/ >> (normA)*(normB). >> >> I am facing lot of issues with respect to understanding or interpreting >> the output of Spark's ml algos. The documentation is not very clear and >> there is hardly anything mentioned with respect to how and what is being >> returned. >> >> For ex. word2vec algorithm is to convert word to vector form. So I would >> expect .transform method would give me vector of each word in the text. >> >> However .transform basically returns doc2vec (averages all word vectors >> of a text). This is confusing since nothing of this is mentioned in the >> docs and I keep thinking why I have only one word vector instead of word >> vectors for all words. >> >> I do understand by returning doc2vec it is helpful since now one doesn't >> have to average out each word vector for the whole text. But the docs don't >> help or explicitly say that. >> >> This ends up wasting lot of time in just figuring out what is being >> returned from an algorithm from Spark. >> >> Does someone have a better solution for this? >> >> I have read the Spark book. That is not about Mllib. I am not sure if >> Sean's book would cover all the documentation aspect better than what we >> have currently on the docs page. >> >> Thanks >> >> >> >> ᐧ >> >> >> ᐧ ᐧ