I got 168, because I use log base 2 instead of e. ([?]) if memory serves right, I read it in entropy definition that people normally use base 2, so I just assumed it was 2 in code. (my bad).
And now I have a better understanding, so thank you both for the explanation. On Fri, Apr 12, 2013 at 6:01 AM, Sean Owen <sro...@gmail.com> wrote: > Yes I also get (er, Mahout gets) 117 (116.69), FWIW. > > I think the second question concerned counts vs relative frequencies > -- normalized, or not. Like whether you divide all the counts by their > sum or not. For a fixed set of observations that does change the LLR > because it is unnormalized, not because the situation has changed. > > Obviously you're right that the changing situations you describe do > entail a change in LLR! > > On Thu, Apr 11, 2013 at 10:52 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > These numbers don't match what I get. > > > > I get LLR = 117. > > > > This is wildly anomalous so this pair should definitely be connected. > Both > > items are quite rare (15/300,000 or 20/300,000 rates) but they occur > > together most of the time that they appear. > > > > > > > > On Wed, Apr 10, 2013 at 2:15 AM, Phoenix Bai <baizh...@gmail.com> wrote: > > > >> Hi, > >> > >> the counts for two events are: > >> * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B** > >> k21=13**k22=300,000* > >> according to the code, I will get: > >> > >> rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222 > >> colEntropy = entropy(7,13) + entropy(8, 300,000) = 152 > >> matrixEntropy(entropy(7,8,13, 300,000) = 458 > >> > >> thus, > >> > >> LLR=2.0*(458-222-152) = 168 > >> similarityScore = 1 - 1/(1+168) = 0.994 > >> > >> So, my problem is, > >> the similarity scores I get for all the items are all this high and it > >> makes it so hard to identify the real similar ones. > >> > >> As you can see, the counts of event A, and B are quite small while the > >> total count for k22 is quite high. And this phenomenon is quite common > in > >> my dataset. > >> > >> So, my question is, > >> what kind of adjustment could I do to lower the similarity score to a > more > >> reasonable range? > >> > >> Please shed some lights, thanks in advance! > >> >