[ 
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415238#comment-13415238
 ] 

Jeff Eastman commented on MAHOUT-1045:
--------------------------------------

Cluster 33465 looks to be suspect. Its representative points contain the same 
document name repeated 5 times and the cluster center vector, which is 
identical to the repeated document vector. This would occur if the 
clusteredPoints for the cluster contained only the single document and is a 
related case to clusters having none. The representative points calculation 
will emit duplicates like this if there are less points in the cluster than the 
number of representative points requested.

This cluster should be pruned by invalidCluster() but it is not because the 
document vector values differ from the cluster center vector by a very 
insignificant value: (e.g. 0.09790894164219519 vs 0.09790894164219517). This 
filter should probably use an epsilon comparison rather than Vector.equals(), 
which considers these small differences to be significant.

All of this makes me wonder if the pruning process should be scrapped, and NaN 
values simply not included in the averages. I'm going to follow this line of 
thought and will report back later.
                
> Cluster evaluators returning bad results
> ----------------------------------------
>
>                 Key: MAHOUT-1045
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1045
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6, 0.7, 0.8
>         Environment: Several environments and data sets
>            Reporter: Pat Ferrel
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1045.patch, first-time-density-nan.txt
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is 
> almost always NaN. The CDbw inter-cluster density is almost always 0. I have 
> also seen several cases where CDbw fails to return any results but have not 
> tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff 
> Eastman.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to