Yes. This is a good thresholding to do. Typically I have done this by simply providing a threshold on the LLR score itself. It is convenient to restate the score itself as the signed square root of the score since that lets you add information about whether the cooccurrence is more or less common than expected and it puts the scale essentially on the same scale as standard deviations from a normal distribution. On that scale, a cutoff in the range of 5 to 15 is commonly used. The fact that 5 standard deviations represents a p-value of about 3 x 10^-7 is indicative of how stringent this criterion would be if it were a frequentist hypothesis test.
In practice, I haven't found that the cutoff is that useful. Part of the reason for this is that I would just as soon have some wild-eyed behavior happen with low data situations so that some kind of recommendations happen and we can gather more data. I don't see this threshold in our current RowSimilarityJob. There is a threshold, but it is applied to counts on only some similarity classes. On Wed, Aug 6, 2014 at 5:07 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > On Wed, Aug 6, 2014 at 5:04 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > On Wed, Aug 6, 2014 at 6:01 PM, Dmitriy Lyubimov <dlie...@gmail.com> > > wrote: > > > > > > LLR is a classic test. > > > > > > > > > What i meant here it doesn't produce a p-value. or does it? > > > > > > > It produces an asymptotically chi^2 distributed statistic with 1-degree > of > > freedom (for our case of 2x2 contingency tables) which can be reduced > > trivially to a p-value in the standard way. > > > > Great. so that means that we can do h_0 rejection based on a %-expressed > level? >