Re: log-likelihood ratio value in item similarity calculation

Phoenix Bai Fri, 12 Apr 2013 00:01:53 -0700

I got 168, because I use log base 2 instead of e.
([?]) if memory serves right, I read it in entropy definition that people
normally use base 2, so I just assumed it was 2 in code. (my bad).


And now I have a better understanding, so thank you both for the
explanation.


On Fri, Apr 12, 2013 at 6:01 AM, Sean Owen <sro...@gmail.com> wrote:

> Yes I also get (er, Mahout gets) 117 (116.69), FWIW.
>
> I think the second question concerned counts vs relative frequencies
> -- normalized, or not. Like whether you divide all the counts by their
> sum or not. For a fixed set of observations that does change the LLR
> because it is unnormalized, not because the situation has changed.
>
> Obviously you're right that the changing situations you describe do
> entail a change in LLR!
>
> On Thu, Apr 11, 2013 at 10:52 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> > These numbers don't match what I get.
> >
> > I get LLR = 117.
> >
> > This is wildly anomalous so this pair should definitely be connected.
>  Both
> > items are quite rare (15/300,000 or 20/300,000 rates) but they occur
> > together most of the time that they appear.
> >
> >
> >
> > On Wed, Apr 10, 2013 at 2:15 AM, Phoenix Bai <baizh...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> the counts for two events are:
> >> * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
> >> k21=13**k22=300,000*
> >> according to the code, I will get:
> >>
> >> rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
> >> colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
> >> matrixEntropy(entropy(7,8,13, 300,000) = 458
> >>
> >> thus,
> >>
> >> LLR=2.0*(458-222-152) = 168
> >> similarityScore = 1 - 1/(1+168) = 0.994
> >>
> >> So, my problem is,
> >> the similarity scores I get for all the items are all this high and it
> >> makes it so hard to identify the real similar ones.
> >>
> >> As you can see, the counts of event A, and B are quite small while the
> >> total count for k22 is quite high. And this phenomenon is quite common
> in
> >> my dataset.
> >>
> >> So, my question is,
> >> what kind of adjustment could I do to lower the similarity score to a
> more
> >> reasonable range?
> >>
> >> Please shed some lights, thanks in advance!
> >>
>

Re: log-likelihood ratio value in item similarity calculation

Reply via email to