[ 
https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743359#action_12743359
 ] 

Ted Dunning commented on MAHOUT-163:
------------------------------------

Here are some code comments:

this isn't a property ... shouldn't be called get*.  how about 
scoreDocumentFrequencies instead?
{noformat}  
private double getLLRFromDF(int inDF, int outDF, int clusterSize, int 
corpusSize) {  
+    int k11 = inDF;
+    int k12 = clusterSize - inDF;
+    int k21 = outDF; 
+    int k22 = corpusSize - clusterSize - outDF;
+    
+    double llr = getLLR(k11, k12, k21, k22);
+
+    return llr;
+  }
{noformat}

This next has a spelling error (missing P).  Also, entropy isn't a property so 
this isn't a getter.  Since it is a pure function, entropy is a fine name (no 
verb needed).  a,b,c,d should have names (I suggest k1 .. k4).
{noformat}
+  private double getEntroy(int a, int b, int c, int d) {
+    int[] elements = {a, b, c, d};
+    return getEntropy(elements);
+  }
+  
+  private double getEntropy(int a, int b) {
+    int[] elements = {a, b};
+    return getEntropy(elements);
+  }
{noformat}
In terms of generality, it might be preferable to use doubles instead of ints 
here.  Again, this isn't a property, so it shouldn't be a getter. 
{noformat}
+  private double getEntropy(int[] elements) {
+    int sum = 0;    
{noformat}
The variable i has the connotation of an index.  It can also be a bit hard to 
read.  Pick something else.
{noformat}
+    for (int i : elements) {
+      sum += i;
+    }
{noformat}
Don't use float constant with a double value.
{noformat}
+    double result = 0.0f;
+    for (int i : elements) {
+      result += getElementEntropy(i, sum);
+    }
{noformat}
WHy the use of zero here?  Unary minus is just fine.
{noformat}

+    return 0.0d - result;
+  }
+  
{noformat}
This is a trivial function used only one place and not visible to tests.  Why 
is it separate out?
{noformat}
+  private double getElementEntropy(int a, int sum) {
+    int zeroFlag = (a == 0 ? 1: 0);
{noformat}
Use a cast here, not *1f.  Cast to double, not float.
{noformat}
+    double result = a * Math.log((a+zeroFlag)*1.0f/sum);
+    return result;
+  }
{noformat}
Actually, I would recommend rewriting the entropy loop this way:
{noformat}
private double getEntropy(int[] elements) {
    double sum = 0;    
    for (int x : elements) {
      sum += x;
    }
    double result = 0.0;
    for (int x : elements) {
        if (x > 0) {
            result += x * Math.log(x / sum);
        } else if (x < 0) {
            throw new IllegalArgumentException("Should not have negative count 
for entropy computation: (" + x + ")");
    }
    return -result;
}
{noformat}
This is fine except for the get* name.  LLR should have a reference or an 
explanation in the javadoc.
{noformat}
+  private double getLLR(int k11, int k12, int k21, int k22) {    
+    double rowEntropy = getEntropy(k11, k12) + getEntropy(k21, k22);
+    double columnEntropy = getEntropy(k11, k21) + getEntropy(k12, k22);
+    double matrixEntropy = getEntroy(k11, k12, k21, k22);
+    double result = 2 * (matrixEntropy - rowEntropy - columnEntropy); 
+    return result;    
+  }
+
{noformat}  


> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels 
> instead of the top features of the centroid vector. LLR finds terms/phrases 
> which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to