Yes you're just dividing by the norm of the vector you passed in. You can look at the change on that JIRA and probably see how this was added into the method itself.
On Thu, Dec 29, 2016 at 10:34 PM Manish Tripathi <tr.man...@gmail.com> wrote: > ok got that. I understand that the ordering won't change. I just wanted to > make sure I am getting the right thing or I understand what I am getting > since it didn't make sense going by the cosine calculation. > > One last confirmation and I appreciate all the time you are spending to > reply: > > In the github link issue , it's mentioned fVector is not normalised. By > fVector it's meant the word vector we want to find synonyms for?. So in my > example, it would be vector for 'science' word which I passed to the > method?. > > if yes, then I guess, solution should be simple. Just divide the current > cosine output by the norm of this vector. And this vector we can get by > doing model.transform('science') if I am right? > > Lastly, I would be very happy to update to docs if it is editable for all > the things I encounter as not mentioned or not very clear. > ᐧ > > On Thu, Dec 29, 2016 at 2:28 PM, Sean Owen <so...@cloudera.com> wrote: > > Yes, the vectors are not otherwise normalized. > You are basically getting the cosine similarity, but times the norm of the > word vector you supplied, because it's not divided through. You could just > divide the results yourself. > I don't think it will be back-ported because the the behavior was intended > in 1.x, just wrongly documented, and we don't want to change the behavior > in 1.x. The results are still correctly ordered anyway. > > On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi <tr.man...@gmail.com> > wrote: > > Sean, > > Thanks for answer. I am using Spark 1.6 so are you saying the output I am > getting is cos(A,B)=dot(A,B)/norm(A) ? > > My point with respect to normalization was that if you normalise or don't > normalize both vectors A,B, the output would be same. Since if I normalize > A and B, then > > Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If > we don't normalize it would have a norm in the denominator so output is > same. > > But I understand you are saying in Spark 1.x, one vector was not > normalized. If that is the case then it makes sense. > > Any idea how to fix this (get the right cosine similarity) in Spark 1.x? . > If the input word in findSynonyms is not normalized while calculating > cosine, then doing w2vmodel.transform(input_word) to get a vector > representation and then diving the current result by the norm of this > vector should be correct? > > Also, I am very open to editing the docs on things I find not properly > documented or wrong, but I need to know if that is allowed (is it like a > Wiki)?. > ᐧ > > On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen <so...@cloudera.com> wrote: > > It should be the cosine similarity, yes. I think this is what was fixed in > https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was > really just outputting the 'unnormalized' similarity (dot / norm(a) only) > but the docs said cosine similarity. Now it's cosine similarity in Spark 2. > The normalization most certainly matters here, and it's the opposite: > dividing the dot by vec norms gives you the cosine. > > Although docs can always be better (and here was a case where it was > wrong) all of this comes with javadoc and examples. Right now at least, > .transform() describes the operation as you do, so it is documented. I'd > propose you invest in improving the docs rather than saying 'this isn't > what I expected'. > > (No, our book isn't a reference for MLlib, more like worked examples) > > On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi <tr.man...@gmail.com> > wrote: > > I used a word2vec algorithm of spark to compute documents vector of a text. > > I then used the findSynonyms function of the model object to get synonyms > of few words. > > I see something like this: > > > > > I do not understand why the cosine similarity is being calculated as more > than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 > (taking negative angles). > > Why it is more than 1 here? What's going wrong here?. > > Please note, normalization of the vectors should not be changing the > cosine similarity values since the formula remains the same. If you > normalise it's just a dot product then, if you don't it's dot product/ > (normA)*(normB). > > I am facing lot of issues with respect to understanding or interpreting > the output of Spark's ml algos. The documentation is not very clear and > there is hardly anything mentioned with respect to how and what is being > returned. > > For ex. word2vec algorithm is to convert word to vector form. So I would > expect .transform method would give me vector of each word in the text. > > However .transform basically returns doc2vec (averages all word vectors of > a text). This is confusing since nothing of this is mentioned in the docs > and I keep thinking why I have only one word vector instead of word vectors > for all words. > > I do understand by returning doc2vec it is helpful since now one doesn't > have to average out each word vector for the whole text. But the docs don't > help or explicitly say that. > > This ends up wasting lot of time in just figuring out what is being > returned from an algorithm from Spark. > > Does someone have a better solution for this? > > I have read the Spark book. That is not about Mllib. I am not sure if > Sean's book would cover all the documentation aspect better than what we > have currently on the docs page. > > Thanks > > > > ᐧ > > > > ᐧ > ᐧ >