[ 
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419865#comment-13419865
 ] 

Jeff Eastman edited comment on MAHOUT-1045 at 7/21/12 3:57 PM:
---------------------------------------------------------------

This patch adds similar CDbw methods to return per-cluster densities and adds 
some significant performance improvements by caching values that were computed 
twice before. It also adds a unit test that evaluates against Pat's database. I 
changed the inter-cluster density calculation to return the average - like the 
document says but does not show in eqn 1 - and to ignore the NaN values that 
are present in this dataset. Probably needs some more testing too.

I'm still not sure how to deal with the NaN values, as they appear to be valid 
but infinitely dense clusters that all contain vectors identical to the cluster 
center. 
                
      was (Author: jeastman):
    This patch adds similar CDbw methods to return per-cluster densities and 
adds some significant performance options by caching values that were computed 
twice before. It also adds a unit test that evaluates against Pat's database. I 
changed the inter-cluster density calculation to return the average - like the 
document says but does not show in eqn 1 - and to ignore the NaN values that 
are present in this dataset. Probably needs some more testing too.

I'm still not sure how to deal with the NaN values, as they appear to be valid 
but infinitely dense clusters that all contain vectors identical to the cluster 
center. 
                  
> Cluster evaluators returning bad results
> ----------------------------------------
>
>                 Key: MAHOUT-1045
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1045
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6, 0.7, 0.8
>         Environment: Several environments and data sets
>            Reporter: Pat Ferrel
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1045.patch, MAHOUT-1045.patch, MAHOUT-1045.patch, 
> MAHOUT-1045.patch, first-time-density-nan.txt
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is 
> almost always NaN. The CDbw inter-cluster density is almost always 0. I have 
> also seen several cases where CDbw fails to return any results but have not 
> tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff 
> Eastman.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to