On Sat, Dec 26, 2009 at 12:18 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> These are fine as distance measures. It is also common to use > sqrt(1-cos^2) > which is more like an angle, but 1-cos is good enough for almost anything. > > With normal text, btw, all of the coordinates are positive so the largest > possible angle is pi/2 (cos = 0, sin = 1). > I guess what I was saying is that if you take a less "normal" representation of text (a random projection, say, or a projection onto the SVD, etc.), you can get negative similarities which make sense, and in this case you have similarity == 1 for perfect alignment, 0 for uncorrelated, and -1 for anti-parallel, and you definitely *want* -1, not +1. Going with sqrt(2*(1-cos^2)) ~=~ theta is only good for small angles - for large angles this isn't so great anymore, and once the angle goes over pi/2, it's actually no longer monotonic and is doing most certainly the wrong thing, which is why I usually stick with 1-cos for distance if I'm not measuring similarity. I guess my question to you, Robin, is why would you take the abs? If the data is text, then yes, in a normal representation your coefficients are always positive, and so all cosines are greater than zero, and there's no need to take abs, right? The only case where I'd imagine wanting to consider anti-parallel to be basically the same as parallel is in the collaborative filtering case, where as we've discussed on this list in the past, sometimes a negative rating is as much a measure of similarity as a positive one, and so if you've mean-centered your ratings, then you do want dot products which effectively take the abs as well. I'd say that is the exception, not the norm, however. -jake > > On Sat, Dec 26, 2009 at 10:53 AM, Robin Anil <robin.a...@gmail.com> wrote: > > > I ran Cosine and tanimoto distance measure ( d = 1 - similarity measure) > on > > the following vector pairs > > > > (-1, -1) and (3,3) Cosine : 2.0 > > Tanimoto: 1.2307692307692308 > > (1, 1) and (3,3) Cosine : 0.0 > > Tanimoto: 0.5714285714285714 > > (1, 8) and (8,1) Cosine : 0.7538461538461538 > > Tanimoto: 0.8596491228070176 > > > > How should anti parallel vectors be treated in MAHOUT clustering > packages. > > is it acceptable? to return 2.0 for antiparallel vectors and 1.0 for > > perpendicular vectors in the case of text data the vectors are positive. > If > > clustering of scientific data is done, what should be the default > > behaviour. Since clustering is always trying find a configuration where > the > > distances are at the minimum? Since I have dealt mostly with Text data, I > > would always try and get the abs value of cosine similarity before > > subtracting from 1.0. Has anyone of you encountered a such a situation > wrt > > some particular dataset? > > Robin > > > > > > -- > Ted Dunning, CTO > DeepDyve >