Re: log-likelihood ratio value in item similarity calculation

2013-04-12 Thread Ted Dunning
The only virtue of using the natural base is that you get a nice asymptotic distribution for random data. On Fri, Apr 12, 2013 at 1:10 AM, Sean Owen wrote: > Yes that's true, it is more usually bits. Here it's natural log / nats. > Since it's unnormalized anyway another constant factor doesn'

Re: log-likelihood ratio value in item similarity calculation

2013-04-12 Thread Sean Owen
Yes that's true, it is more usually bits. Here it's natural log / nats. Since it's unnormalized anyway another constant factor doesn't hurt and it means not having to change the base. On Fri, Apr 12, 2013 at 8:01 AM, Phoenix Bai wrote: > I got 168, because I use log base 2 instead of e. > ([?])

Re: log-likelihood ratio value in item similarity calculation

2013-04-12 Thread Phoenix Bai
I got 168, because I use log base 2 instead of e. ([?]) if memory serves right, I read it in entropy definition that people normally use base 2, so I just assumed it was 2 in code. (my bad). And now I have a better understanding, so thank you both for the explanation. On Fri, Apr 12, 2013 at 6:0

Re: log-likelihood ratio value in item similarity calculation

2013-04-11 Thread Sean Owen
Yes I also get (er, Mahout gets) 117 (116.69), FWIW. I think the second question concerned counts vs relative frequencies -- normalized, or not. Like whether you divide all the counts by their sum or not. For a fixed set of observations that does change the LLR because it is unnormalized, not beca

Re: log-likelihood ratio value in item similarity calculation

2013-04-11 Thread Ted Dunning
Counts are critical here. Suppose that two rare events occur together the first time you ever see them. How exciting is this? Not very in my mind, but not necessarily trivial. Now suppose that they occur together 20 times and never occur alone after you have collected 20 times more data. This i

Re: log-likelihood ratio value in item similarity calculation

2013-04-11 Thread Ted Dunning
These numbers don't match what I get. I get LLR = 117. This is wildly anomalous so this pair should definitely be connected. Both items are quite rare (15/300,000 or 20/300,000 rates) but they occur together most of the time that they appear. On Wed, Apr 10, 2013 at 2:15 AM, Phoenix Bai wrot

Re: log-likelihood ratio value in item similarity calculation

2013-04-10 Thread Sean Owen
Yes using counts is more efficient. Certainly it makes the LLR value different since the results are not normalized; all the input values are N times larger (N = sum k), and so the LLR is N times larger. 2x more events in the same ratio will make the LLR 2x larger too. That's just fine if you're c

Re: log-likelihood ratio value in item similarity calculation

2013-04-10 Thread Phoenix Bai
Good point. btw, why use counts instead of probabilities? for easy and efficient implementation? also, do you think the similarity score using counts might quite differ from using probabilities? thank you very much for your prompt reply. [?] On Wed, Apr 10, 2013 at 5:50 PM, Sean Owen wrote: >

Re: log-likelihood ratio value in item similarity calculation

2013-04-10 Thread Sean Owen
These events do sound 'similar'. They occur together about half the time either one of them occurs. You might have many pairs that end up being similar for the same reason, and this is not surprising. They're all "really" similar. The mapping here from LLR's range of [0,inf) to [0,1] is pretty arb

log-likelihood ratio value in item similarity calculation

2013-04-10 Thread Phoenix Bai
Hi, the counts for two events are: * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B** k21=13**k22=300,000* according to the code, I will get: rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222 colEntropy = entropy(7,13) + entropy(8, 300,000) = 152 matrixEntropy(entropy(7