[ 
https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795570#action_12795570
 ] 

Shashikant Kore commented on MAHOUT-163:
----------------------------------------

Grant, 

Yes, it should have been configurable number.  

If the corpus size is big (tens of thousands of documents or more), the size I 
was working with, most likely such clusters are formed by outliers. Ignoring 
such clusters doesn't have any impact on the quality.


> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, 
> MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, 
> mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels 
> instead of the top features of the centroid vector. LLR finds terms/phrases 
> which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to