No argument there and that is exactly one of my points. Real data often clusters at multiple scales. Using kmeans to find this involves calculating clusters at several scales and evaluating the results for each scale factor (k)--on average. However I think that this will always create some bad/non-cohesive clusters (at any scale) and it would be nice to have a way to throw these out or at least flag them.

Wouldn't some measure of the distribution of points in each cluster give us a way to detect every cluster's cohesiveness?

BTW I imagine there are more elegant ways to cluster at multiple scales, perhaps even all at once, but I haven't found one and would welcome enlightenment. Blindly running hierarchical clustering is not a fair answer since it has the same problems mentioned above.

On 7/11/12 10:36 AM, Ted Dunning wrote:
With k-means algorithms, you don't find out much about clumpiness because
large clusters in the data will get multiple k-means clusters.

On Wed, Jul 11, 2012 at 10:21 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

The average distance to the nearest cluster measures overall clumpiness
found at a particular scale but does not address the cohesiveness of any
particular clump. In any real world data set some clusters will be cohesive
and some not. This happens for at least two reasons; some data does not
clump, and there are multiple scales for clumpiness. This is an important
distinction I believe and implies the need for a cohesiveness per cluster
evaluation.

It was my understanding that the ClusterEvaluator included an attempt to
provide this measure with intra-cluster density per cluster though it looks
like that output has been removed?

On 7/8/12 6:07 PM, Ted Dunning wrote:

I can't comment on the existing evaluators, but for me the only real
measure that I care about is average distance to nearest cluster for new or
held-out data.  I will be building something of this sort for the
clustering part of the knn code I have been working on.

On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <p...@occamsmachete.com<mailto:
p...@occamsmachete.com>**> wrote:

     To use something like kmeans on any large and changing data set it
     seems a requirement that there be some means of evaluating the
     quality of clusters at different scales. The usual eyeballing
     breaks down quickly.

     Trying to use the cluster evaluators in Mahout with kmeans as the
     clustering method and cosine and the distance measure has proven
     problematic. The method is to iterate through the data using
     different ks and performing the evaluation at each point. What I
     find is that certain values are almost always in error. The
     Intra-cluster density from ClusterEvaluator is almost always NaN.
     The CDbw inter-cluster density is almost always 0. I have also
     seen several cases where CDbw fails to return any results but have
     not tracked down why yet.

     Given that the data for either evaluator is usually incomplete
     these methods are not very useful. Is mahout dropping the
     evaluators? Is the general wisdom that they are not particularly
     useful? Should a newer method be pursued? This seems a fairly
     important question to me, am I missing something?

     Raw data for a sample crawl is given below:








Reply via email to