using root LLR

Pat Ferrel Tue, 15 Nov 2016 12:22:25 -0800

I understand the eyeball method but not sure users will so am working on a 
t-digest calculation of an LLR threshold. This is to maintain a certain 
sparsity at maximum “quality”. But I have a few questions.

You mention root LLR, ok but that will create negative numbers. I assume:

1) we should use the absolute value of root LLR for ranking in the max # of 
indicators sense. Seems like no value in creating the sqrt( | rootLLR | ) since 
the rank will not change but we can’t just use the value returned by the java 
root LL function directly
2) Likewise we use the absolute value of root LLR to compare with the 
threshold. Put another way without using absolute value the value passes the 
LLR threshold test if mean - threshold < value < mean + threshold
3) However the positive and negative root LLR values would be used in the 
t-digest quantile calc, which ideally would have mean = 0.

Seems simple but just checking my understanding, are these correct?

On Jan 2, 2016, at 3:17 PM, Ted Dunning <tdunn...@maprtech.com> wrote:

I usually like to use a combination of a fixed threshold for llr plus a max 
number of indicators.

The fixed threshold I use is typically around 20-30 for raw LLR which 
corresponds to about 5 for root LLR. I often eyeball the lists of indicators 
for items that I understand to find a point where the list of indicators 
becomes about half noise, half useful indicators.

On Sat, Jan 2, 2016 at 2:15 PM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
One interesting thing we saw is that like-genre was better discarded and 
dislike-genre left in the mix.

This brings up a fundamental issue with how we use LLR to downsample in Mahout. 
In this case by downsampling I mean llr(A’B), where we keep some max number of 
indicators based on the best LLR score. For the primary action—something like 
“buy”—this works well since there are usually quite a lot of items, but for B 
there may be very few items, genre is an example. Using the same max # of 
indicators for A’A as well as all the rest (A’B, etc)  means that very little 
if any downsampling based on LLR score happens for A’B. So for A’B the result 
is really more like simple cross-cooccurrence.

This seems worth addressing, if only because in our analysis the effect made 
like-genre useless, when intuition would say that it should be useful. Our 
hypothesis is that since no downsampling happened and very many of the 
reviewers preferred most all of the genres it had no differentiating value. If 
we had changed the per item max indicators to some smaller number this might 
have left only strongly correlated like-genre indicators.

Assuming I’ve got the issue correctly identified the options I can think of are:
1) use a fixed number LLR threshold for A’B or other cross-cooccurrence 
indicator. This seems pretty impractical. 
2) add a max indicators threshold param for each of the secondary indicators. 
This would be fairly easy and could be based on the # of B items. Some method 
of choosing this might end up being ~100 for A’A (the default), and a function 
of the # of items in B, C, etc. The plus is that this would be easy and keep 
the calculation at O(n) but the function that return 100 for A, and some 
smaller number for B, C, and the rest is not clear to me.
3) create a threshold based on the distribution of llr(A’B). This could be 
based on a correlation confidence (actually confidence of non-correlation for 
LLR). The down side is that this means we need to calculate all of llr(A’B) 
which approaches O(n^2) then do the downsampling of the complete llr(A’B). This 
removes the rather significant practical benefit of the current downsampling 
algorithm. Practically speaking most indicators will be of dimensionality on 
the order of # of A items or will be very very much smaller, like # of genre’s. 
So maybe calculating the distribution of llr(A’B) wouldn’t bee to bad if only 
done when B has a small number of items. In the small B case it would be O(n*m) 
where m is the number of items in B and n is the number or items in A and m << 
n so this would nearly be O(n). Also this could be mixed with #2 and only 
calculated every so often since it probably won’t change very much in any one 
application.

I guess I’d be inclined to test by trying a range of max # of indicators on our 
test data since the number of genre’s are small. If there is any place that 
produces significantly better results we could proceed to try the confidence 
method and see if it allows us to calculate the optimal #. If so them we could 
implement this for very occasional calculation on live datasets.

Any advice?

> On Dec 30, 2015, at 2:26 PM, Ted Dunning <tdunn...@maprtech.com 
> <mailto:tdunn...@maprtech.com>> wrote:
> 
> 
> This is really nice work!
> 
> On Wed, Dec 30, 2015 at 11:50 AM, Pat Ferrel <p...@occamsmachete.com 
> <mailto:p...@occamsmachete.com>> wrote:
> As many of you know Mahout-Samsara includes an interesting and important 
> extension to cooccurrence similarity, which supports cross-coossurrence and 
> log-likelihood downsampling. This, when combined with a search engine, gives 
> us a multimodal recommender. Some of us integrated Mahout with a DB and 
> search engine to create what we call (humbly) the Universal Recommender. 
> 
> We just completed a tool that measures the effects of what we call secondary 
> events or indicators using the Universal Recommender. It calculates a ranking 
> based precision metric called mean average precision—MAP@k. We took a dataset 
> from the Rotten Tomatoes web site of “fresh”, and “rotten” reviews and 
> combined that with data about the genres, casts, directors, and writers of 
> the various video items. This gave us the indicators below:
> like, video-id <== primary indicator
> dislike, video-id
> like-genre, genre-id
> dislike-genre, genre-id
> like-director, director-id
> dislike-director, director-id
> like-writer, writer-id
> dislike-writer, writer-id
> like-cast, cast-member-id
> dislike-cast, cast-member-id
> These aren’t necessarily what we would have chosen if we were designing 
> something from scratch but are possible to gather from public data.
> 
> We have only ~5000 mostly professional reviewers with ~250k video items in 
> this dataset but have a larger one we are integrating. We are also writing a 
> white paper and blog post with some deeper analysis. There are several 
> tidbits of insight when you look deeper.
> 
> The bottom line is that using most of the above indicators we were able to 
> get a 26% increase in MAP@1 over using only “like”. This is important because 
> the vast majority of recommenders can only really ingest one type of 
> indicator.
> 
> http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html 
> <http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html>
> https://github.com/actionml/template-scala-parallel-universal-recommendation 
> <https://github.com/actionml/template-scala-parallel-universal-recommendation>

using root LLR

Reply via email to