I think that he means cluster sizes rather than term weights.

For text, term frequencies follow an approximate power law.

On Mon, Jul 9, 2012 at 10:06 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Sorry, I'm not following this shorthand. Are you asking if the term
> weights of each centroid follow a power law, like they are supposed to?
>
> On 7/9/12 12:34 AM, Lance Norskog wrote:
>
>> Power law size scaling.
>>
>> On Sun, Jul 8, 2012 at 11:39 PM, Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>>
>>> What do you mean by self similarity?  Power law size scaling?  Or that
>>> two successive clusterings get nearly the same answer?
>>>
>>> Sent from my iPhone
>>>
>>> On Jul 8, 2012, at 8:40 PM, Lance Norskog <goks...@gmail.com> wrote:
>>>
>>>  Are there any measures of self-similarity?
>>>>
>>>> On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <ted.dunn...@gmail.com>
>>>> wrote:
>>>>
>>>>  I can't comment on the existing evaluators, but for me the only real
>>>>> measure that I care about is average distance to nearest cluster for
>>>>> new or
>>>>> held-out data.  I will be building something of this sort for the
>>>>> clustering part of the knn code I have been working on.
>>>>>
>>>>>
>>>>> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <p...@occamsmachete.com>
>>>>> wrote:
>>>>>
>>>>>  To use something like kmeans on any large and changing data set it
>>>>>> seems
>>>>>> a requirement that there be some means of evaluating the quality of
>>>>>> clusters at different scales. The usual eyeballing breaks down
>>>>>> quickly.
>>>>>>
>>>>>> Trying to use the cluster evaluators in Mahout with kmeans as the
>>>>>> clustering method and cosine and the distance measure has proven
>>>>>> problematic. The method is to iterate through the data using
>>>>>> different ks
>>>>>> and performing the evaluation at each point. What I find is that
>>>>>> certain
>>>>>> values are almost always in error. The Intra-cluster density from
>>>>>> ClusterEvaluator is almost always NaN. The CDbw  inter-cluster
>>>>>> density is
>>>>>> almost always 0. I have also seen several cases where CDbw fails to
>>>>>> return
>>>>>> any results but have not tracked down why yet.
>>>>>>
>>>>>> Given that the data for either evaluator is usually incomplete these
>>>>>> methods are not very useful. Is mahout dropping the evaluators? Is the
>>>>>> general wisdom that they are not particularly useful? Should a newer
>>>>>> method
>>>>>> be pursued? This seems a fairly important question to me, am I missing
>>>>>> something?
>>>>>>
>>>>>> Raw data for a sample crawl is given below:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> --
>>>> Lance Norskog
>>>> goks...@gmail.com
>>>>
>>>
>>
>>
>
>

Reply via email to