Ah, I didn't realise that there was an implementation of the Pearson correlation, I just wrote a cosine distance measure myself. The cosine distance does go from -1 to 1, but with TF-IDF vectors you aren't going to get any negative values, so it effectively goes from 0 to 1. You have to be careful though because the k-means implementation assumes larger distance value means "further away" (for clustering purposes), whereas obviously with cosine distance a larger value means "closer together".
2008/12/6 Sean Owen <[EMAIL PROTECTED]> > To answer a few recent points: > > Not sure if this is helpful, but, the collaborative filtering part of > Mahout contains an implementation of cosine distance measure -- sort > of. Really it has an implementation of the Pearson correlation, which > is equivalent, if the data are 'centered' (have a mean of 0). This is, > in my opinion, a good idea. So if you agree, you could copy and adapt > this implementation of Pearson to your purpose. It is pretty easy to > re-create the actual cosine distance measure correlation too from this > code -- I used to have it separately in the code. > > The Tanimoto distance is a ratio of intersection to union of two sets, > so is between 0 and 1. Cosine distance is, essentially, the cosine of > an angle in feature-space, so is between -1 and 1. >
