[ 
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414697#comment-13414697
 ] 

Pat Ferrel commented on MAHOUT-1045:
------------------------------------

As to the very dense clusters; from visual inspection I bet those are a set of 
identical pages. I have done nothing to filter them since they have different 
URLs and for my particular use the URL makes them different. I noticed this 
looking at the results so it is just a guess. For other people doing crawls I 
suspect it will happen fairly often. 

As to the change made to the initialization from the absolute 0 to 
Double.MIN_VALUE I claim ignorance of the java Double implementation (years 
waisted in other programming languages). However given the special nature of 
NaN, 0, positive and negative infinity, and the need to correct for rounding 
errors in large scale calculations my intuition says it is the right thing to 
do, BTW I will be much more careful myself in the future. When this is over I'd 
suggest a code review from a Java math expert, don't rely on me obviously.

As to whether the value returned looks good; I am not sure what to expect since 
I haven't seen a lot of values from that calc but I'll plug in your code to 
recreate the evaluator table (the one with several values of k) and we'll see 
what it looks like.

You reference "the book" in the question about normalization. I've been meaning 
to ask where the ClusterEvaluator work came from? 

I've got a fire drill on another project but will get back on this asap.
                
> Cluster evaluators returning bad results
> ----------------------------------------
>
>                 Key: MAHOUT-1045
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1045
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6, 0.7, 0.8
>         Environment: Several environments and data sets
>            Reporter: Pat Ferrel
>             Fix For: 0.8
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is 
> almost always NaN. The CDbw inter-cluster density is almost always 0. I have 
> also seen several cases where CDbw fails to return any results but have not 
> tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff 
> Eastman.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to