With k-means algorithms, you don't find out much about clumpiness because
large clusters in the data will get multiple k-means clusters.

On Wed, Jul 11, 2012 at 10:21 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> The average distance to the nearest cluster measures overall clumpiness
> found at a particular scale but does not address the cohesiveness of any
> particular clump. In any real world data set some clusters will be cohesive
> and some not. This happens for at least two reasons; some data does not
> clump, and there are multiple scales for clumpiness. This is an important
> distinction I believe and implies the need for a cohesiveness per cluster
> evaluation.
>
> It was my understanding that the ClusterEvaluator included an attempt to
> provide this measure with intra-cluster density per cluster though it looks
> like that output has been removed?
>
> On 7/8/12 6:07 PM, Ted Dunning wrote:
>
>> I can't comment on the existing evaluators, but for me the only real
>> measure that I care about is average distance to nearest cluster for new or
>> held-out data.  I will be building something of this sort for the
>> clustering part of the knn code I have been working on.
>>
>> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <p...@occamsmachete.com<mailto:
>> p...@occamsmachete.com>**> wrote:
>>
>>     To use something like kmeans on any large and changing data set it
>>     seems a requirement that there be some means of evaluating the
>>     quality of clusters at different scales. The usual eyeballing
>>     breaks down quickly.
>>
>>     Trying to use the cluster evaluators in Mahout with kmeans as the
>>     clustering method and cosine and the distance measure has proven
>>     problematic. The method is to iterate through the data using
>>     different ks and performing the evaluation at each point. What I
>>     find is that certain values are almost always in error. The
>>     Intra-cluster density from ClusterEvaluator is almost always NaN.
>>     The CDbw inter-cluster density is almost always 0. I have also
>>     seen several cases where CDbw fails to return any results but have
>>     not tracked down why yet.
>>
>>     Given that the data for either evaluator is usually incomplete
>>     these methods are not very useful. Is mahout dropping the
>>     evaluators? Is the general wisdom that they are not particularly
>>     useful? Should a newer method be pursued? This seems a fairly
>>     important question to me, am I missing something?
>>
>>     Raw data for a sample crawl is given below:
>>
>>
>>
>>
>>
>
>

Reply via email to