[ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743359#action_12743359 ]
Ted Dunning commented on MAHOUT-163: ------------------------------------ Here are some code comments: this isn't a property ... shouldn't be called get*. how about scoreDocumentFrequencies instead? {noformat} private double getLLRFromDF(int inDF, int outDF, int clusterSize, int corpusSize) { + int k11 = inDF; + int k12 = clusterSize - inDF; + int k21 = outDF; + int k22 = corpusSize - clusterSize - outDF; + + double llr = getLLR(k11, k12, k21, k22); + + return llr; + } {noformat} This next has a spelling error (missing P). Also, entropy isn't a property so this isn't a getter. Since it is a pure function, entropy is a fine name (no verb needed). a,b,c,d should have names (I suggest k1 .. k4). {noformat} + private double getEntroy(int a, int b, int c, int d) { + int[] elements = {a, b, c, d}; + return getEntropy(elements); + } + + private double getEntropy(int a, int b) { + int[] elements = {a, b}; + return getEntropy(elements); + } {noformat} In terms of generality, it might be preferable to use doubles instead of ints here. Again, this isn't a property, so it shouldn't be a getter. {noformat} + private double getEntropy(int[] elements) { + int sum = 0; {noformat} The variable i has the connotation of an index. It can also be a bit hard to read. Pick something else. {noformat} + for (int i : elements) { + sum += i; + } {noformat} Don't use float constant with a double value. {noformat} + double result = 0.0f; + for (int i : elements) { + result += getElementEntropy(i, sum); + } {noformat} WHy the use of zero here? Unary minus is just fine. {noformat} + return 0.0d - result; + } + {noformat} This is a trivial function used only one place and not visible to tests. Why is it separate out? {noformat} + private double getElementEntropy(int a, int sum) { + int zeroFlag = (a == 0 ? 1: 0); {noformat} Use a cast here, not *1f. Cast to double, not float. {noformat} + double result = a * Math.log((a+zeroFlag)*1.0f/sum); + return result; + } {noformat} Actually, I would recommend rewriting the entropy loop this way: {noformat} private double getEntropy(int[] elements) { double sum = 0; for (int x : elements) { sum += x; } double result = 0.0; for (int x : elements) { if (x > 0) { result += x * Math.log(x / sum); } else if (x < 0) { throw new IllegalArgumentException("Should not have negative count for entropy computation: (" + x + ")"); } return -result; } {noformat} This is fine except for the get* name. LLR should have a reference or an explanation in the javadoc. {noformat} + private double getLLR(int k11, int k12, int k21, int k22) { + double rowEntropy = getEntropy(k11, k12) + getEntropy(k21, k22); + double columnEntropy = getEntropy(k11, k21) + getEntropy(k12, k22); + double matrixEntropy = getEntroy(k11, k12, k21, k22); + double result = 2 * (matrixEntropy - rowEntropy - columnEntropy); + return result; + } + {noformat} > Get (better) cluster labels using Log Likelihood Ratio > ------------------------------------------------------ > > Key: MAHOUT-163 > URL: https://issues.apache.org/jira/browse/MAHOUT-163 > Project: Mahout > Issue Type: Improvement > Reporter: Shashikant Kore > Fix For: 0.2 > > Attachments: mahout-cluster-labels-llr.patch > > > Log Likelihood Ratio (LLR) is a better technique to identify cluster labels > instead of the top features of the centroid vector. LLR finds terms/phrases > which are common in the cluster but rare outside. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.