I glanced at this and I am confused too. Ted I double-checked your blog post and it seems fine -- you popped the minus sign out of the entropy expression and reversed the args in the mutual info term, which will be relevant in a second. This is computing the value of the G test, right, and you are computing regular entropy and multiplying by the sum later. For the matrix [1 0 ; 0 1] I get an unnormalized LLR of 2.772, yes.
In the Java code, the expression for unnormalized entropy looks correct. This is how it gets the "N" term in there explicitly. It hasn't omitted the minus sign in entropy. But then the final expression should have a minus sign in front of H(k) (matrix entropy) right? and it looks like it does the opposite. The proposed change in this thread doesn't quite work as it results in 0. But I somehow suspect it is prevented from working directly by the previous point. Indeed, if you negate all the entropy calculation (or, flip around the mutual information expression) the tests pass. (Except when it comes to how root LLR is handled for negative LLR, but that's a detail) I suppose it would be best to make the code reflect Ted's nice clear post. It is actually a little faster too. I am still not clear on why the current expression works, though it evidently does. I don't know it's history or if it's just an alternate formulation. Since I'm already here let me see if I can sort out a patch that also addresses negative LLR correctly. On Mon, Jun 3, 2013 at 2:58 AM, Ted Dunning <[email protected]> wrote: > So looking at the tests, this makes things look less horrifying. > > org.apache.mahout.math.stats.LogLikelihoodTest#testLogLikelihood > > assertEquals(2.772589, LogLikelihood.logLikelihoodRatio(1, 0, 0, 1), > 0.000001); > assertEquals(27.72589, LogLikelihood.logLikelihoodRatio(10, 0, 0, 10), > 0.00001); > assertEquals(39.33052, LogLikelihood.logLikelihoodRatio(5, 1995, 0, > 100000), 0.00001); > assertEquals(4730.737, LogLikelihood.logLikelihoodRatio(1000, 1995, > 1000, 100000), 0.001); > assertEquals(5734.343, LogLikelihood.logLikelihoodRatio(1000, 1000, > 1000, 100000), 0.001); > assertEquals(5714.932, LogLikelihood.logLikelihoodRatio(1000, 1000, > 1000, 99000), 0.001); > > Next step is to determine whether these values are correct. I recognize > the first two. > > I put these values into my R script and got a successful load. I think > that this means that the code is somehow correct, regardless of your > reading of it. I don't have time right now to read the code in detail, but > I think that things are working. > > You can find my R code at https://dl.dropboxusercontent.com/u/36863361/llr.R > > > > On Mon, Jun 3, 2013 at 3:41 AM, Ted Dunning <[email protected]> wrote: > >> >> This is a horrifying possibility. I thought we had several test cases in >> place to verify this code. >> >> Let me look. I wonder if the code you have found is not referenced >> somehow. >> >> >> On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 <[email protected]> wrote: >> >>> The definition of >>> org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11, >>> long k12, long k21, long k22): >>> >>> public static double logLikelihoodRatio(long k11, long k12, long k21, >>> long k22) { >>> Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 && k22 >>> >= 0); >>> // note that we have counts here, not probabilities, and that the >>> entropy is not normalized. >>> * double rowEntropy = entropy(k11, k12) + entropy(k21, k22);* >>> * double columnEntropy = entropy(k11, k21) + entropy(k12, k22);* >>> double matrixEntropy = entropy(k11, k12, k21, k22); >>> if (rowEntropy + columnEntropy > matrixEntropy) { >>> // round off error >>> return 0.0; >>> } >>> return 2.0 * (matrixEntropy - rowEntropy - columnEntropy); >>> } >>> >>> The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I >>> think it should be: >>> >>> * double rowEntropy = entropy(k11+k12, k21+k22)* >>> * double columnEntropy = entropy(k11+k21, k12+k22)* >>> * >>> * >>> which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) - >>> H(colSums(k))) *referred from >>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html . >>> >>> LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in >>> this example), and I is the mutual infomation. >>> >>> [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB >>> value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) = >>> k11/N. >>> >>> >>> [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) - >>> H(colSums(k) >>> >>> The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k), >>> we get: >>> >>> *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) - >>> entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))* >>> * >>> * >>> that multiplied by 2.0 is just the LLR. >>> >>> Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio wrong >>> or have I misunderstood something? >>> >> >>
