+1 for better-than-absolute-threshold quality metric. Several users of hadoop 
item similarity have asked for this or expressed difficulty in using the 
existing absolute threshold. One method Ken Krugler asked for was to toss the 
lowest LLR items up to some fraction of the total. He wanted to use this to 
loose the least “quality" while sparsifying the matrix and it was on CF data. I 
understand that this conflates quality with sparsification but there it is 
anyway.
 
On Aug 7, 2014, at 5:41 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

Speaking from experience, I think that expressing the threshold as a
confidence has some attraction, but can be a bit of a difficult interface.
For instance, the equivalent of a 5 standard deviation threshold of either
0.9999999 or 0.0000001 (or did I get those right?  Can you tell?).  In
either case, I think it is nicer to have something on a log scale somehow.

The options to make this easier all center about restating this threshold
around some naturally log-ish scale.  Log-odds, log p-value and standard
deviations are all candidates for this.  Log odds for 0.9999999 is 16.1 in
natural base or 7 in base 10.  For 0.0000001, the log odds are -16.1 or -7.
For the SD scale, it is 5.7 or -5.7.

The problem is that these quantities are well outside the natural range for
percentiles.  In normal frequentist analysis, the quantities being played
with are things like p-values of 0.05 or 0.01 or even 0.001.  These are
practical for this notation.  The much more extreme thresholds required for
recommendations are not so useful in this notation.




On Thu, Aug 7, 2014 at 2:00 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> if exploration and bootstrap are concerns, in my case its saturation is
> achieved by a different methodology. I want this threshold to be, (1) of
> course optional, and (2) be expressed in confidence level, %, just to
> understand the ballpark in each case.
> 
> Ok i think i understand the code to convert confidence into LLR threshold
> (and vice versa). Thanks.
> 
> 
> On Thu, Aug 7, 2014 at 1:38 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
>> Yes.  This is a good thresholding to do.
>> 
>> Typically I have done this by simply providing a threshold on the LLR
> score
>> itself.  It is convenient to restate the score itself as the signed
> square
>> root of the score since that lets you add information about whether the
>> cooccurrence is more or less common than expected and it puts the scale
>> essentially on the same scale as standard deviations from a normal
>> distribution.  On that scale, a cutoff in the range of 5 to 15 is
> commonly
>> used.  The fact that 5 standard deviations represents a p-value of about
> 3
>> x 10^-7 is indicative of how stringent this criterion would be if it
> were a
>> frequentist hypothesis test.
>> 
>> In practice, I haven't found that the cutoff is that useful.  Part of the
>> reason for this is that I would just as soon have some wild-eyed behavior
>> happen with low data situations so that some kind of recommendations
> happen
>> and we can gather more data.
>> 
>> 
>> I don't see this threshold in our current RowSimilarityJob.  There is a
>> threshold, but it is applied to counts on only some similarity classes.
>> 
>> 
>> 
>> On Wed, Aug 6, 2014 at 5:07 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>> wrote:
>> 
>>> On Wed, Aug 6, 2014 at 5:04 PM, Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>>> 
>>>> On Wed, Aug 6, 2014 at 6:01 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>>> wrote:
>>>> 
>>>>>> LLR is a classic test.
>>>>> 
>>>>> 
>>>>> What i meant here it doesn't produce a p-value. or does it?
>>>>> 
>>>> 
>>>> It produces an asymptotically chi^2 distributed statistic with
> 1-degree
>>> of
>>>> freedom (for our case of 2x2 contingency tables) which can be reduced
>>>> trivially to a p-value in the standard way.
>>>> 
>>> 
>>> Great. so that means that we can do h_0 rejection based on a
> %-expressed
>>> level?
>>> 
>> 
> 

Reply via email to