[ 
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415371#comment-13415371
 ] 

Jeff Eastman commented on MAHOUT-1045:
--------------------------------------

Yes, the pruning process was supposed to catch all of the irregular situations 
that might come up but it was clearly inadequate here. I've removed it in favor 
of calculating a sparse vector of <clusterId, double> for all the 
intra-densities and then omitting any NaN values from the average. It seems to 
work.

Removing the pruning from the inter-cluster calculation allows clusters with 
less elements to be included in the computation but does not introduce NaN 
values because the cluster centers are only used for this average. I'm going to 
add a method that returns a sparse matrix <clusterId, clusterId, double> of all 
the inter-cluster distances and then compute the average using these values.

These two changes should pave the way to a CLI invocation that actually writes 
these structures to files so you can do whatever additional analysis you want.

I will post the revised patch in a bit.

                
> Cluster evaluators returning bad results
> ----------------------------------------
>
>                 Key: MAHOUT-1045
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1045
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6, 0.7, 0.8
>         Environment: Several environments and data sets
>            Reporter: Pat Ferrel
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1045.patch, first-time-density-nan.txt
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is 
> almost always NaN. The CDbw inter-cluster density is almost always 0. I have 
> also seen several cases where CDbw fails to return any results but have not 
> tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff 
> Eastman.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to