The problem was not in your corpus, but in your test (to misquote Marc Antony).
See my surprise and coincidence paper that shows why chi-squared tests are evil <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.2186>for this sort of application. The short answer is that they over-estimate how excited you should be by as much as 300 orders of magnitude. On Wed, Aug 5, 2009 at 3:26 PM, Tanton Gibbs <[email protected]> wrote: > The problem was that the collection > was so large that ANY repeated connection looked statistically > significant (I was using chi-squares). I eventually had to apply a > cutoff, but I wonder if there was a more elegant way to do it. I > realize this is not the same thing as the OP's question - hope you > don't mind :) > -- Ted Dunning, CTO DeepDyve
