To beat a very tired horse, I think that all squared error correlation measures (Pearson's chi-squared, Pearson's correlation, squared deviation and so on) are completely suspect for small count data. Furthermore, any reasonable sample of truly long-tail phenomena includes great numbers of small counts. Furtherfurthermore, long-tail phenomena are the rule rather than the exception.
Thus, I almost never like these measures and would have a hard time arguing that there is anything good about this kind of measure. The only exception would be in a pub where I would take any side of any debate for the amusement of the crowd. Try mutual information or multinomial likelihood ratios instead. On Tue, Jun 23, 2009 at 3:48 PM, Sean Owen <[email protected]> wrote: > One could argue that this behavior is actually a good thing -- basing > an estimate of similarity based on one data point could be very > unreliable. >
