Yes, Richard is right. I used the arc of the value and it solved the mismatch. Math.acos(value) which would range from 0 to π / 2. "...π / 2 meaning independent, 0 meaning exactly the same, with in-between values indicating intermediate similarities or dissimilarities...." --wiki<http://en.wikipedia.org/w/index.php?title=Jaccard_index§ion=2#Tanimoto_coefficient_.28extended_Jaccard_coefficient.29>
I think since Tanimoto distance is more suited for only binary values, (but with TF-IDF we have other values than 0s and 1s). Pearson correlations as Sean has suggested works for cosine distance if, the data are 'centered' (have a mean of 0). But I think as Richard said (in TF-IDF vectors we aren't going to get any negative values), we can't have mean of 0. Regards, Dipesh > > 2008/12/6 Sean Owen <[EMAIL PROTECTED]> > > > To answer a few recent points: > > > > Not sure if this is helpful, but, the collaborative filtering part of > > Mahout contains an implementation of cosine distance measure -- sort > > of. Really it has an implementation of the Pearson correlation, which > > is equivalent, if the data are 'centered' (have a mean of 0). This is, > > in my opinion, a good idea. So if you agree, you could copy and adapt > > this implementation of Pearson to your purpose. It is pretty easy to > > re-create the actual cosine distance measure correlation too from this > > code -- I used to have it separately in the code. > > > > The Tanimoto distance is a ratio of intersection to union of two sets, > > so is between 0 and 1. Cosine distance is, essentially, the cosine of > > an angle in feature-space, so is between -1 and 1. > > > -- ---------------------------------------- "Help Ever Hurt Never"- Baba
