Yes.  This is a good thresholding to do.

Typically I have done this by simply providing a threshold on the LLR score
itself.  It is convenient to restate the score itself as the signed square
root of the score since that lets you add information about whether the
cooccurrence is more or less common than expected and it puts the scale
essentially on the same scale as standard deviations from a normal
distribution.  On that scale, a cutoff in the range of 5 to 15 is commonly
used.  The fact that 5 standard deviations represents a p-value of about 3
x 10^-7 is indicative of how stringent this criterion would be if it were a
frequentist hypothesis test.

In practice, I haven't found that the cutoff is that useful.  Part of the
reason for this is that I would just as soon have some wild-eyed behavior
happen with low data situations so that some kind of recommendations happen
and we can gather more data.


I don't see this threshold in our current RowSimilarityJob.  There is a
threshold, but it is applied to counts on only some similarity classes.



On Wed, Aug 6, 2014 at 5:07 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> On Wed, Aug 6, 2014 at 5:04 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> > On Wed, Aug 6, 2014 at 6:01 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> > > > LLR is a classic test.
> > >
> > >
> > > What i meant here it doesn't produce a p-value. or does it?
> > >
> >
> > It produces an asymptotically chi^2 distributed statistic with 1-degree
> of
> > freedom (for our case of 2x2 contingency tables) which can be reduced
> > trivially to a p-value in the standard way.
> >
>
> Great. so that means that we can do h_0 rejection based on a %-expressed
> level?
>

Reply via email to