Re: mutual information issue in logLikelihoodRatio

Ted Dunning Mon, 03 Jun 2013 09:26:40 -0700

Thanks Sean!


On Mon, Jun 3, 2013 at 12:14 PM, Sean Owen <sro...@gmail.com> wrote:

> I glanced at this and I am confused too.
>
> Ted I double-checked your blog post and it seems fine -- you popped
> the minus sign out of the entropy expression and reversed the args in
> the mutual info term, which will be relevant in a second. This is
> computing the value of the G test, right, and you are computing
> regular entropy and multiplying by the sum later. For the matrix [1 0
> ; 0 1] I get an unnormalized LLR of 2.772, yes.
>
> In the Java code, the expression for unnormalized entropy looks
> correct. This is how it gets the "N" term in there explicitly. It
> hasn't omitted the minus sign in entropy. But then the final
> expression should have a minus sign in front of H(k) (matrix entropy)
> right? and it looks like it does the opposite.
>
> The proposed change in this thread doesn't quite work as it results in
> 0. But I somehow suspect it is prevented from working directly by the
> previous point. Indeed, if you negate all the entropy calculation (or,
> flip around the mutual information expression) the tests pass. (Except
> when it comes to how root LLR is handled for negative LLR, but that's
> a detail)
>
> I suppose it would be best to make the code reflect Ted's nice clear
> post. It is actually a little faster too.
>
> I am still not clear on why the current expression works, though it
> evidently does. I don't know it's history or if it's just an alternate
> formulation.
>
> Since I'm already here let me see if I can sort out a patch that also
> addresses negative LLR correctly.
>
> On Mon, Jun 3, 2013 at 2:58 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> > So looking at the tests, this makes things look less horrifying.
> >
> > org.apache.mahout.math.stats.LogLikelihoodTest#testLogLikelihood
> >
> >     assertEquals(2.772589, LogLikelihood.logLikelihoodRatio(1, 0, 0, 1),
> > 0.000001);
> >     assertEquals(27.72589, LogLikelihood.logLikelihoodRatio(10, 0, 0,
> 10),
> > 0.00001);
> >     assertEquals(39.33052, LogLikelihood.logLikelihoodRatio(5, 1995, 0,
> > 100000), 0.00001);
> >     assertEquals(4730.737, LogLikelihood.logLikelihoodRatio(1000, 1995,
> > 1000, 100000), 0.001);
> >     assertEquals(5734.343, LogLikelihood.logLikelihoodRatio(1000, 1000,
> > 1000, 100000), 0.001);
> >     assertEquals(5714.932, LogLikelihood.logLikelihoodRatio(1000, 1000,
> > 1000, 99000), 0.001);
> >
> > Next step is to determine whether these values are correct.  I recognize
> > the first two.
> >
> > I put these values into my R script and got a successful load.  I think
> > that this means that the code is somehow correct, regardless of your
> > reading of it.  I don't have time right now to read the code in detail,
> but
> > I think that things are working.
> >
> > You can find my R code at
> https://dl.dropboxusercontent.com/u/36863361/llr.R
> >
> >
> >
> > On Mon, Jun 3, 2013 at 3:41 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> >
> >>
> >> This is a horrifying possibility.  I thought we had several test cases
> in
> >> place to verify this code.
> >>
> >> Let me look.  I wonder if the code you have found is not referenced
> >> somehow.
> >>
> >>
> >> On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 <qzche...@gmail.com> wrote:
> >>
> >>> The definition of
> >>> org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11,
> >>> long k12, long k21, long k22):
> >>>
> >>>     public static double logLikelihoodRatio(long k11, long k12, long
> k21,
> >>> long k22) {
> >>>       Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 &&
> k22
> >>> >= 0);
> >>>       // note that we have counts here, not probabilities, and that the
> >>> entropy is not normalized.
> >>> *      double rowEntropy = entropy(k11, k12) + entropy(k21, k22);*
> >>> *      double columnEntropy = entropy(k11, k21) + entropy(k12, k22);*
> >>>       double matrixEntropy = entropy(k11, k12, k21, k22);
> >>>       if (rowEntropy + columnEntropy > matrixEntropy) {
> >>>         // round off error
> >>>         return 0.0;
> >>>       }
> >>>       return 2.0 * (matrixEntropy - rowEntropy - columnEntropy);
> >>>     }
> >>>
> >>> The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I
> >>> think it should be:
> >>>
> >>> *      double rowEntropy = entropy(k11+k12, k21+k22)*
> >>> *      double columnEntropy = entropy(k11+k21, k12+k22)*
> >>> *
> >>> *
> >>> which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) -
> >>> H(colSums(k))) *referred from
> >>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html .
> >>>
> >>> LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22
> in
> >>> this example), and I is the mutual infomation.
> >>>
> >>> [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB
> >>> value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g.
> p(1,1) =
> >>> k11/N.
> >>>
> >>>
> >>> [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) -
> >>> H(colSums(k)
> >>>
> >>> The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k),
> >>> we get:
> >>>
> >>>     *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) -
> >>> entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))*
> >>> *
> >>> *
> >>> that multiplied by 2.0 is just the LLR.
> >>>
> >>> Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio
> wrong
> >>> or have I misunderstood something?
> >>>
> >>
> >>
>

Re: mutual information issue in logLikelihoodRatio

Reply via email to