[
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414702#comment-13414702
]
Pat Ferrel commented on MAHOUT-1045:
------------------------------------
Thinking about how the evaluators are tested; it seems that there are several
cases that will occur often in real world data that might not be seen in
synthetic data. First is ultra dense clusters because of identical docs/items.
Another is empty vectors. With dimensional reduction in it's various forms you
can easily get empty vectors and depending on how the clusterer works you may
get a cluster around the origin or one of empty vectors. Anyway back to the
fire drill.
> Cluster evaluators returning bad results
> ----------------------------------------
>
> Key: MAHOUT-1045
> URL: https://issues.apache.org/jira/browse/MAHOUT-1045
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6, 0.7, 0.8
> Environment: Several environments and data sets
> Reporter: Pat Ferrel
> Fix For: 0.8
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is
> almost always NaN. The CDbw inter-cluster density is almost always 0. I have
> also seen several cases where CDbw fails to return any results but have not
> tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff
> Eastman.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira