On Sat, Dec 26, 2009 at 12:18 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> These are fine as distance measures.  It is also common to use
> sqrt(1-cos^2)
> which is more like an angle, but 1-cos is good enough for almost anything.
>
> With normal text, btw, all of the coordinates are positive so the largest
> possible angle is pi/2 (cos = 0, sin = 1).
>

I guess what I was saying is that if you take a less "normal" representation
of text (a random projection, say, or a projection onto the SVD, etc.), you
can get negative similarities which make sense, and in this case you have
similarity == 1 for perfect alignment, 0 for uncorrelated, and -1 for
anti-parallel,
and you definitely *want* -1, not +1.

Going with sqrt(2*(1-cos^2)) ~=~ theta  is only good for small angles - for
large angles this isn't so great anymore, and once the angle goes over
pi/2,
it's actually no longer monotonic and is doing most certainly the wrong
thing,
which is why I usually stick with 1-cos for distance if I'm not measuring
similarity.

I guess my question to you, Robin, is why would you take the abs?  If the
data is text, then yes, in a normal representation your coefficients are
always
positive, and so all cosines are greater than zero, and there's no need to
take
abs, right?

The only case where I'd imagine wanting to consider anti-parallel to be
basically
the same as parallel is in the collaborative filtering case, where as we've
discussed on this list in the past, sometimes a negative rating is as much
a measure of similarity as a positive one, and so if you've mean-centered
your
ratings, then you do want dot products which effectively take the abs as
well.

I'd say that is the exception, not the norm, however.

  -jake


>
> On Sat, Dec 26, 2009 at 10:53 AM, Robin Anil <robin.a...@gmail.com> wrote:
>
> > I ran Cosine and tanimoto distance measure ( d = 1 - similarity measure)
> on
> > the following vector pairs
> >
> > (-1, -1) and (3,3) Cosine : 2.0
> > Tanimoto: 1.2307692307692308
> > (1, 1)   and (3,3) Cosine : 0.0
> > Tanimoto: 0.5714285714285714
> > (1, 8)   and (8,1) Cosine : 0.7538461538461538
> >  Tanimoto: 0.8596491228070176
> >
> > How should anti parallel vectors be treated in MAHOUT clustering
> packages.
> > is it acceptable? to return 2.0 for antiparallel vectors and 1.0 for
> > perpendicular vectors in the case of text data the vectors are positive.
> If
> > clustering of scientific data is done, what should be the default
> > behaviour. Since clustering is always trying find a configuration where
> the
> > distances are at the minimum? Since I have dealt mostly with Text data, I
> > would always try and get the abs value of cosine similarity before
> > subtracting from 1.0. Has anyone of you encountered a such a situation
> wrt
> > some particular dataset?
> > Robin
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to