One thing I found very irritating when using cosine or numbers in the range
0,1 is that sometimes two distinct items have very small values of distance
when you inspect them. I am always worried that precision of float is not
enough to capture that small detail that makes the difference of accept or
reject.  On the other hand Log likelihood similarity seem to have values in
the range 100+, sometimes even 1000+ for strong likelihoods.
Very unlikely events have small values <1.0

In practice, it kind of holds, as the number of documents increase, I
usually have to scale cosine to a larger range or switch to some hybrid
similarity metric for good clustering.  What about you guys, I mean both of
you have worked on huge data sets, what kind of insights can you share about
what works and what doesnt.

Reply via email to