+ [email protected] so the conversation will be visible to others.

Hi Pat,

I'm at a loss as to how the inter-cluster density can be zero. The tests that were producing those values have been fixed. Does the ClusterEvaluator produce zero with your data too? If so, let's debug that one first as it shares the representative points computation and is a lot easier to debug.

How many representative points are you computing? Have you inspected them to see if they look ok? There are routines in the two evaluator unit tests that will print them out and we can make them public static if it will help. Since they are identical it might also make sense to move them into a public utility. I will do that if you think it will be useful.

-------- Original Message --------
Subject: Re: [jira] [Commented] (MAHOUT-1020) The Cluster Evaluator is returning bad results
Date:   Fri, 01 Jun 2012 09:43:45 -0700
From:   Pat Ferrel <[email protected]>
To:     Jeff Eastman <[email protected]>



It is always 0 on any data set I've tried, even when no pruning is
reported. The debug output I sent you had no reported pruning as I
recall. But again I'm on 0.6, upgrading as we write...

On 6/1/12 9:35 AM, Jeff Eastman wrote:
 I don't understand the inter-cluster density = 0. The tests that were
 producing those values were in error and they now produce reasonable
 looking densities. Have you taken a look at the representative points
 produced from your clusters? If they are all the same then pruning
 will occur and you might end up with nothing left to evaluate.


 On 6/1/12 12:28 PM, Pat Ferrel wrote:
 Sure, it is attached. It iterates through a small data set of 228
 docs and 3-7 clusters with kmeans. The results are the output of both
 evaluators on the resulting clusters. Still on 0.6 I'm afraid.

 How about the CDbw output of inter-cluster distance always = 0.0? I
 understand that it is an important measure.

 On 6/1/12 9:23 AM, Jeff Eastman wrote:
 This patch fixed a problem with the unit test that was causing
 kmeans and fuzzyk tests to fail. It did not change any of the CDbw
 evaluation code, which now seems to produce reasonable results for
 all tests. I also just fixed the same problem in the ClusterEvaluator.

 I can't seem to find the debug output you mention. Can you please
 repost it?


 On 6/1/12 11:12 AM, Pat Ferrel wrote:
 The representative point calc is used in the general case too, not
 just the test case. And didn't you say that bad representative
 points leads to having the cluster pruned? Also does this fix the
 inter-cluster distance always = 0?

 I do need to move to trunk I suppose, then I will test.

 BTW did the debug output I sent you look like reasonable results?

 On 6/1/12 8:05 AM, Hudson (JIRA) wrote:
      [
 
https://issues.apache.org/jira/browse/MAHOUT-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287466#comment-13287466
 ]

 Hudson commented on MAHOUT-1020:
 --------------------------------

 Integrated in Mahout-Quality #1509 (See
 [https://builds.apache.org/job/Mahout-Quality/1509/])
      MAHOUT-1020: fixed path names for testKmeans and
 testFuzzyKmeans that were causing representative points
 calculation to fail. CDbw results now look more reasonable.
 (Revision 1345214)

       Result = FAILURE
 jeastman :
 http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345214
 Files :
 *
 
/mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/cdbw/TestCDbwEvaluator.java


 The Cluster Evaluator is returning bad results
 ----------------------------------------------

                  Key: MAHOUT-1020
                  URL:
 https://issues.apache.org/jira/browse/MAHOUT-1020
              Project: Mahout
           Issue Type: Bug
           Components: Clustering
     Affects Versions: 0.6
          Environment: Various environments and data sets. Mahout
 0.6, 0.7 trunk not tested.
             Reporter: Pat Ferrel
             Assignee: Jeff Eastman
              Fix For: 0.7


 Conversation with between Pat Ferrel and Jeff Eastman on the user
 list
 Hi Pat,
 I don't have a good answer here. Evidently, something in CDbw has
 become broken and you are the first to notice. When I run
 TestCDbwEvaluator, the values for k-means and fuzzy-k are clearly
 incorrect. The values for Canopy, MeanShift and Dirichlet are not
 so obviously incorrect but I remain suspicious. Something must
 have become broken in the recent clustering refactoring.
   From the method CDbwEvaluator.invalidCluster comment (used to
 enable pruning):
     * Return if the cluster is valid. Valid clusters must have
 more than 2 representative points,
     * and at least one of them must be different than the cluster
 center. This is because the
     * representative points extraction will duplicate the cluster
 center if it is empty.
 Oddly enough, inspection of the test log indicates that only
 k-means and fuzzy-k are not pruning clusters. Clearly some more
 investigation is needed. I will take a look at it tomorrow. In
 the mean time if you develop any additional insight please do
 share it with us.
 Thanks,
 Jeff
 On 5/17/12 3:53 PM, Pat Ferrel wrote:
 I built a tool that iterates through a list of values for k on
 the same data and spits out the CDbw and ClusterEvaluator
 results each time.

 When the evaluator or CDbw prunes a cluster, how do I interpret
 that? They seem to throw out the same clusters on a given run.
 Also CDbw always returns an inter-cluster density of 0?
-- This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators:
 https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa

 For more information on JIRA, see:
 http://www.atlassian.com/software/jira








Reply via email to