Re: RowSimilarity

Suneel Marthi Sat, 12 May 2012 19:07:26 -0700

The consider() method in the distance measure (Tanimoto in ur scenario) is the 
one that does the cut-off.
All of the similarity measures (almost all of them) have some implementation of 
consider() so as to cut-off the returned results.

Have a look at Sebastian's explanation in 
https://issues.apache.org/jira/browse/MAHOUT-803.

________________________________
 From: Pat Ferrel <p...@occamsmachete.com>
To: user@mahout.apache.org 
Sent: Saturday, May 12, 2012 7:29 PM
Subject: RowSimilarity

I tried an experiment running RowSimilarity with 16 docs of short quotations on 
a similar subject. It looks to me that using tanimoto the largest pair-wise 
distance allowed for the similar docs was 0.4. Though I asked for 10 similar 
docs I got 0 to 10. I see this same effect with larger data sets but haven't 
seen an obvious cut-off point

I was expecting to be able to make the decision about cut-off distance myself. 
In other words I was expecting to always get 20 similar docs when I asked for 
20. It is useful to see what docs are at larger distances.

How is RowSimilarity deciding when to cut-off the returned docs?

Re: RowSimilarity

Reply via email to