Re: Cosine and Tanimoto Similarity

2009-12-27 Thread Ted Dunning
Floating point precision is not an issue with any of these metrics since the counts you are dealing with are never large enough for the statistical uncertainty (roughly sqrt(number of observations)) to outweigh the numerical accuracy (roughly 10^-7 for float 10^-17 for double). A much large proble

Re: Cosine and Tanimoto Similarity

2009-12-27 Thread Robin Anil
One thing I found very irritating when using cosine or numbers in the range 0,1 is that sometimes two distinct items have very small values of distance when you inspect them. I am always worried that precision of float is not enough to capture that small detail that makes the difference of accept o

Re: Cosine and Tanimoto Similarity

2009-12-26 Thread Ted Dunning
As distance goes, I prefer either angle in the 0 to pi range or Euclidean distance in the range 0 to 2. You are correct that it is weird that most things are at distance pi/2 or 1, but that is the price of living on an n-sphere. For similarity, the only thing that really matters is that 0 is real

Re: Cosine and Tanimoto Similarity

2009-12-26 Thread Jake Mannix
On Sat, Dec 26, 2009 at 2:47 PM, Ted Dunning wrote: > One minor additional point is that you might want to use (1-cos)/2 in order > to get a result in [0,1]. > For distance, yeah, this can be fine, but for vectors which can have negative components, I don't like doing that with similarity (where

Re: Cosine and Tanimoto Similarity

2009-12-26 Thread Ted Dunning
One minor additional point is that you might want to use (1-cos)/2 in order to get a result in [0,1]. On Sat, Dec 26, 2009 at 1:32 PM, Jake Mannix wrote: > On Sat, Dec 26, 2009 at 12:18 PM, Ted Dunning > wrote: > > > These are fine as distance measures. It is also common to use > > sqrt(1-cos^

Re: Cosine and Tanimoto Similarity

2009-12-26 Thread Jake Mannix
On Sat, Dec 26, 2009 at 12:18 PM, Ted Dunning wrote: > These are fine as distance measures. It is also common to use > sqrt(1-cos^2) > which is more like an angle, but 1-cos is good enough for almost anything. > > With normal text, btw, all of the coordinates are positive so the largest > possib

Re: Cosine and Tanimoto Similarity

2009-12-26 Thread Ted Dunning
These are fine as distance measures. It is also common to use sqrt(1-cos^2) which is more like an angle, but 1-cos is good enough for almost anything. With normal text, btw, all of the coordinates are positive so the largest possible angle is pi/2 (cos = 0, sin = 1). On Sat, Dec 26, 2009 at 10:5

Re: Cosine and Tanimoto Similarity

2009-12-26 Thread Robin Anil
Anti parallel concept doesnt come in text data. Where all the weights are positive. Think about it, you really cant have a document where the word apple occurs -3 times. But if you consider data which actually have -ve weights(I also havent encounted any such). Then the measure is subject to interp

Re: Cosine and Tanimoto Similarity

2009-12-26 Thread Jake Mannix
Sorry, misfire! I've usually tried to maximize similarity, without ever using abs, even on text. Antiparallel is dissimilar, no? On Dec 26, 2009 11:12 AM, "Jake Mannix" wrote: I've never treated text any differently, and > > On Dec 26, 2009 10:54 AM, "Robin Anil" wrote: > > I ran Cosine and

Re: Cosine and Tanimoto Similarity

2009-12-26 Thread Jake Mannix
I've never treated text any differently, and On Dec 26, 2009 10:54 AM, "Robin Anil" wrote: I ran Cosine and tanimoto distance measure ( d = 1 - similarity measure) on the following vector pairs (-1, -1) and (3,3) Cosine : 2.0 Tanimoto: 1.2307692307692308 (1, 1) and (3,3) Cosine : 0.0 Tanimoto

Cosine and Tanimoto Similarity

2009-12-26 Thread Robin Anil
I ran Cosine and tanimoto distance measure ( d = 1 - similarity measure) on the following vector pairs (-1, -1) and (3,3) Cosine : 2.0 Tanimoto: 1.2307692307692308 (1, 1) and (3,3) Cosine : 0.0 Tanimoto: 0.5714285714285714 (1, 8) and (8,1) Cosine : 0.7538461538461538 Tanimoto: 0.8596491228070