To answer a few recent points: Not sure if this is helpful, but, the collaborative filtering part of Mahout contains an implementation of cosine distance measure -- sort of. Really it has an implementation of the Pearson correlation, which is equivalent, if the data are 'centered' (have a mean of 0). This is, in my opinion, a good idea. So if you agree, you could copy and adapt this implementation of Pearson to your purpose. It is pretty easy to re-create the actual cosine distance measure correlation too from this code -- I used to have it separately in the code.
The Tanimoto distance is a ratio of intersection to union of two sets, so is between 0 and 1. Cosine distance is, essentially, the cosine of an angle in feature-space, so is between -1 and 1. On Sat, Dec 6, 2008 at 12:54 PM, Philippe Lamarche <[EMAIL PROTECTED]> wrote: > Hi, > > I used the Tanimoto distance. As I understand it, it's almost like the > cosine distance, with a range between 0 and infinity as opposed to 0 and > 3.14. Seems to work well. > > > > > On Fri, Dec 5, 2008 at 11:54 PM, dipesh <[EMAIL PROTECTED]> wrote: > >> Hi Philippe, >> >> I'm also doing some work on text clustering with feature extraction. For >> text clustering the Cosine Distance is considered a better Similarity >> metrics than the Eucledian Distance Measure. I couldn't find >> CosineDistanceMeasure in Mahout, did u use Cosine Distance Measure in your >> clustering project?
