Re: mutual information issue in logLikelihoodRatio

Ted Dunning Mon, 03 Jun 2013 00:59:37 -0700

So looking at the tests, this makes things look less horrifying.

org.apache.mahout.math.stats.LogLikelihoodTest#testLogLikelihood


    assertEquals(2.772589, LogLikelihood.logLikelihoodRatio(1, 0, 0, 1),
0.000001);
    assertEquals(27.72589, LogLikelihood.logLikelihoodRatio(10, 0, 0, 10),
0.00001);
    assertEquals(39.33052, LogLikelihood.logLikelihoodRatio(5, 1995, 0,
100000), 0.00001);
    assertEquals(4730.737, LogLikelihood.logLikelihoodRatio(1000, 1995,
1000, 100000), 0.001);
    assertEquals(5734.343, LogLikelihood.logLikelihoodRatio(1000, 1000,
1000, 100000), 0.001);
    assertEquals(5714.932, LogLikelihood.logLikelihoodRatio(1000, 1000,
1000, 99000), 0.001);

Next step is to determine whether these values are correct.  I recognize
the first two.

I put these values into my R script and got a successful load.  I think
that this means that the code is somehow correct, regardless of your
reading of it.  I don't have time right now to read the code in detail, but
I think that things are working.

You can find my R code at https://dl.dropboxusercontent.com/u/36863361/llr.R



On Mon, Jun 3, 2013 at 3:41 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

>
> This is a horrifying possibility.  I thought we had several test cases in
> place to verify this code.
>
> Let me look.  I wonder if the code you have found is not referenced
> somehow.
>
>
> On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 <qzche...@gmail.com> wrote:
>
>> The definition of
>> org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11,
>> long k12, long k21, long k22):
>>
>>     public static double logLikelihoodRatio(long k11, long k12, long k21,
>> long k22) {
>>       Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 && k22
>> >= 0);
>>       // note that we have counts here, not probabilities, and that the
>> entropy is not normalized.
>> *      double rowEntropy = entropy(k11, k12) + entropy(k21, k22);*
>> *      double columnEntropy = entropy(k11, k21) + entropy(k12, k22);*
>>       double matrixEntropy = entropy(k11, k12, k21, k22);
>>       if (rowEntropy + columnEntropy > matrixEntropy) {
>>         // round off error
>>         return 0.0;
>>       }
>>       return 2.0 * (matrixEntropy - rowEntropy - columnEntropy);
>>     }
>>
>> The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I
>> think it should be:
>>
>> *      double rowEntropy = entropy(k11+k12, k21+k22)*
>> *      double columnEntropy = entropy(k11+k21, k12+k22)*
>> *
>> *
>> which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) -
>> H(colSums(k))) *referred from
>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html .
>>
>> LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in
>> this example), and I is the mutual infomation.
>>
>> [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB
>> value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) =
>> k11/N.
>>
>>
>> [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) -
>> H(colSums(k)
>>
>> The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k),
>> we get:
>>
>>     *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) -
>> entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))*
>> *
>> *
>> that multiplied by 2.0 is just the LLR.
>>
>> Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio wrong
>> or have I misunderstood something?
>>
>
>

Re: mutual information issue in logLikelihoodRatio

Reply via email to