With k-means algorithms, you don't find out much about clumpiness because large clusters in the data will get multiple k-means clusters.
On Wed, Jul 11, 2012 at 10:21 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > The average distance to the nearest cluster measures overall clumpiness > found at a particular scale but does not address the cohesiveness of any > particular clump. In any real world data set some clusters will be cohesive > and some not. This happens for at least two reasons; some data does not > clump, and there are multiple scales for clumpiness. This is an important > distinction I believe and implies the need for a cohesiveness per cluster > evaluation. > > It was my understanding that the ClusterEvaluator included an attempt to > provide this measure with intra-cluster density per cluster though it looks > like that output has been removed? > > On 7/8/12 6:07 PM, Ted Dunning wrote: > >> I can't comment on the existing evaluators, but for me the only real >> measure that I care about is average distance to nearest cluster for new or >> held-out data. I will be building something of this sort for the >> clustering part of the knn code I have been working on. >> >> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <p...@occamsmachete.com<mailto: >> p...@occamsmachete.com>**> wrote: >> >> To use something like kmeans on any large and changing data set it >> seems a requirement that there be some means of evaluating the >> quality of clusters at different scales. The usual eyeballing >> breaks down quickly. >> >> Trying to use the cluster evaluators in Mahout with kmeans as the >> clustering method and cosine and the distance measure has proven >> problematic. The method is to iterate through the data using >> different ks and performing the evaluation at each point. What I >> find is that certain values are almost always in error. The >> Intra-cluster density from ClusterEvaluator is almost always NaN. >> The CDbw inter-cluster density is almost always 0. I have also >> seen several cases where CDbw fails to return any results but have >> not tracked down why yet. >> >> Given that the data for either evaluator is usually incomplete >> these methods are not very useful. Is mahout dropping the >> evaluators? Is the general wisdom that they are not particularly >> useful? Should a newer method be pursued? This seems a fairly >> important question to me, am I missing something? >> >> Raw data for a sample crawl is given below: >> >> >> >> >> > >