Thanks Sean!
On Mon, Jun 3, 2013 at 12:14 PM, Sean Owen <sro...@gmail.com> wrote: > I glanced at this and I am confused too. > > Ted I double-checked your blog post and it seems fine -- you popped > the minus sign out of the entropy expression and reversed the args in > the mutual info term, which will be relevant in a second. This is > computing the value of the G test, right, and you are computing > regular entropy and multiplying by the sum later. For the matrix [1 0 > ; 0 1] I get an unnormalized LLR of 2.772, yes. > > In the Java code, the expression for unnormalized entropy looks > correct. This is how it gets the "N" term in there explicitly. It > hasn't omitted the minus sign in entropy. But then the final > expression should have a minus sign in front of H(k) (matrix entropy) > right? and it looks like it does the opposite. > > The proposed change in this thread doesn't quite work as it results in > 0. But I somehow suspect it is prevented from working directly by the > previous point. Indeed, if you negate all the entropy calculation (or, > flip around the mutual information expression) the tests pass. (Except > when it comes to how root LLR is handled for negative LLR, but that's > a detail) > > I suppose it would be best to make the code reflect Ted's nice clear > post. It is actually a little faster too. > > I am still not clear on why the current expression works, though it > evidently does. I don't know it's history or if it's just an alternate > formulation. > > Since I'm already here let me see if I can sort out a patch that also > addresses negative LLR correctly. > > On Mon, Jun 3, 2013 at 2:58 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > So looking at the tests, this makes things look less horrifying. > > > > org.apache.mahout.math.stats.LogLikelihoodTest#testLogLikelihood > > > > assertEquals(2.772589, LogLikelihood.logLikelihoodRatio(1, 0, 0, 1), > > 0.000001); > > assertEquals(27.72589, LogLikelihood.logLikelihoodRatio(10, 0, 0, > 10), > > 0.00001); > > assertEquals(39.33052, LogLikelihood.logLikelihoodRatio(5, 1995, 0, > > 100000), 0.00001); > > assertEquals(4730.737, LogLikelihood.logLikelihoodRatio(1000, 1995, > > 1000, 100000), 0.001); > > assertEquals(5734.343, LogLikelihood.logLikelihoodRatio(1000, 1000, > > 1000, 100000), 0.001); > > assertEquals(5714.932, LogLikelihood.logLikelihoodRatio(1000, 1000, > > 1000, 99000), 0.001); > > > > Next step is to determine whether these values are correct. I recognize > > the first two. > > > > I put these values into my R script and got a successful load. I think > > that this means that the code is somehow correct, regardless of your > > reading of it. I don't have time right now to read the code in detail, > but > > I think that things are working. > > > > You can find my R code at > https://dl.dropboxusercontent.com/u/36863361/llr.R > > > > > > > > On Mon, Jun 3, 2013 at 3:41 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > >> > >> This is a horrifying possibility. I thought we had several test cases > in > >> place to verify this code. > >> > >> Let me look. I wonder if the code you have found is not referenced > >> somehow. > >> > >> > >> On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 <qzche...@gmail.com> wrote: > >> > >>> The definition of > >>> org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11, > >>> long k12, long k21, long k22): > >>> > >>> public static double logLikelihoodRatio(long k11, long k12, long > k21, > >>> long k22) { > >>> Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 && > k22 > >>> >= 0); > >>> // note that we have counts here, not probabilities, and that the > >>> entropy is not normalized. > >>> * double rowEntropy = entropy(k11, k12) + entropy(k21, k22);* > >>> * double columnEntropy = entropy(k11, k21) + entropy(k12, k22);* > >>> double matrixEntropy = entropy(k11, k12, k21, k22); > >>> if (rowEntropy + columnEntropy > matrixEntropy) { > >>> // round off error > >>> return 0.0; > >>> } > >>> return 2.0 * (matrixEntropy - rowEntropy - columnEntropy); > >>> } > >>> > >>> The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I > >>> think it should be: > >>> > >>> * double rowEntropy = entropy(k11+k12, k21+k22)* > >>> * double columnEntropy = entropy(k11+k21, k12+k22)* > >>> * > >>> * > >>> which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) - > >>> H(colSums(k))) *referred from > >>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html . > >>> > >>> LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 > in > >>> this example), and I is the mutual infomation. > >>> > >>> [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB > >>> value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. > p(1,1) = > >>> k11/N. > >>> > >>> > >>> [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) - > >>> H(colSums(k) > >>> > >>> The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k), > >>> we get: > >>> > >>> *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) - > >>> entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))* > >>> * > >>> * > >>> that multiplied by 2.0 is just the LLR. > >>> > >>> Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio > wrong > >>> or have I misunderstood something? > >>> > >> > >> >