To beat a very tired horse, I think that all squared error correlation
measures (Pearson's chi-squared, Pearson's correlation, squared deviation
and so on) are completely suspect for small count data.  Furthermore, any
reasonable sample of truly long-tail phenomena includes great numbers of
small counts.  Furtherfurthermore, long-tail phenomena are the rule rather
than the exception.

Thus, I almost never like these measures and would have a hard time arguing
that there is anything good about this kind of measure.  The only exception
would be in a pub where I would take any side of any debate for the
amusement of the crowd.

Try mutual information or multinomial likelihood ratios instead.

On Tue, Jun 23, 2009 at 3:48 PM, Sean Owen <[email protected]> wrote:

> One could argue that this behavior is actually a good thing -- basing
> an estimate of similarity based on one data point could be very
> unreliable.
>

Reply via email to