Re: Blending initial recommendations for cross recommendation

Pat Ferrel Sat, 01 Jun 2013 13:23:01 -0700

I use MAP (mean average precision) 
http://en.wikipedia.org/wiki/Information_retrieval#Mean_average_precision . 
Either with random held out data or more usually a time-based split. The oldest 
90% for train the the most recent 10% for test.


We looked at the power of purchases and views for predicting purchases and as 
you'd expect found purchases far far more predictive. Therefore I do the 
following for a cross-recommender:

B = matrix purchases userId x itemId, 1=purchase, no value for no purchase.
A = matrix views userId x itemId
H_p = history of purchase vectors for each user, users as columns, so H_p = B'

[B'B]H_p = R_p = recommendations made from user history. This, in my case is 
actually the result of using the item-based mahout recommender trained on 
purchases and using the similarity = log-likelihood (we tried other sims and 
this worked best). The [B'B] is actually row similarity using log-likelihood. 
Since B'B is symmetric the row similarity works. For B'A, which is not 
symmetric, row similarity doesn't work--there are two matrixes. To use 
log-likelihood sparsifier I'm looking at modifying the matrix multiply code.
  
[B'A]H_p = R_c = cross-recommendations based on cooccurrences of view with 
purchases. 

We also calculate the similarity matrix in both cases and do a linear 
combination of  4 recommendations.

It sounds like we're using a different training set than you and only using 
views with cooccurrences. If you measure MAP on a recommender built with views 
where you are calculating how well is predicts known held out purchases vs 
training on purchases you will almost surely find that views are much less 
predictive and unless you have a huge amount more view data than purchases they 
will loose the comparison.

I haven't implemented the learned weights yet but the overhead is not an issue. 
Once you learn the weights you only have to check them once in a great while 
for a given data source. They are constants at all recommendation time.

BTW this can be used for other recommendations by chaining actions that have 
something in common (users IDs?) To make things easier to account for we use a 
unified user Id and item Id space for the action data, though it is not 
strictly necessary.

Some related docs
Have a look at Sean's notes here: 
http://myrrix.com/example-merging-recommenders/
Ted's preso here: http://www.slideshare.net/tdunning/search-as-recommendation

On Jun 1, 2013, at 12:19 PM, Dominik Hübner <cont...@dhuebner.com> wrote:

Thanks for those detailed responses!
So I assume that the problem of scaling the initial recommendations can be 
implicitly solved by learning the weights for the linear combination. The log 
rank Ted mentioned seems to be useful as well!

What metric do you use to evaluate your recommendations? I currently setup a 
cooccurrence matrix to predict based on purchases only, a Purchase X View 
matrix and a view only matrix.
I am trying to recommend products leading people to purchases. So I somehow 
need to find a way, to distinct between types recommendations and prefer 
"purchase" recommendations. Is there any common approach to evaluate 
recommendations in such settings? Maybe I just didn't get what you meant with: 
"In my case I weight the entire row vector, add the two, resort, and calculate 
precision @ some number of recs." 
Using a percentile rank as shown in the paper for implicit ALS might be useful 
here. The problem nevertheless is, that most users only have a single purchase.

Furthermore, do you really perform gradient descent learning of the weights 
using hadoop/mahout? Isn't this too costly to perform due to the overheads of 
the JVM and hadoop?



On Jun 1, 2013, at 1:21 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> I've got a cross-recommender too. It was originally conceived to do a 
> multi-action ensemble from Ted's notes. I'm now gathering a new data set and 
> building the meta-model learner.
> 
> Even with the same scale you need to learn the weighting factors. Consider a 
> simple ensemble case:
> 
> R_p is the matrix of all recommendations from an item-base recommender using 
> history for any single action (purchase?)
> S_p is the matrix of all similar items for a given item as measured by user 
> actions (purchases?)
> 
> It is highly likely that you can improve your ranking (also agree ratings are 
> seldom useful) by incorporating both sets of data. So R = R_p + aS_p This is 
> the linear combo Ted mentions. But what is "a"? The scale of the strengths 
> here are very similar but their importance is not. In my case I found that 
> S_p was far more predictive of a future purchase than R_p but without R_p 
> every user sees the same recommendations for a given item. So some combo of 
> the two may prove far stronger in practice than either alone. The combo 
> certainly has produced a better precision score.
> 
> Once you learn "a" things are much simpler but to learn it you need an 
> objective function like improving precision. Then you must find what 
> weighting "a" gives the highest precision. This requires virtually the entire 
> of R_p and S_p. Since you don't know "a" you don't know how much of the 
> sorted R or S you need to combine. Using the Solr idea you would combine as 
> many results as you can get from the queries trying different weightings and 
> measuring precision each time. In my case I weight the entire row vector, add 
> the two, resort, and calculate precision @ some number of recs.
> 
> Implementing a simple form of gradient ascent seems like a good approach 
> because the full ensemble is going to have more than just "a" and therefore 
> need help converging across multiple dimensions.
> 
> The version of the cross-recommender I have considers item similarities as 
> well as user history based recommendations we have an ensemble something like:
> 
> R_p + aS_p + bR_v + cS_v = R 
> here R_v and S_v are from the cooccurrences of views with purchases, not just 
> views. 
> 
> Ted's idea below about using the rank or log rank instead of whatever the 
> similarity metric returns as the "strength" seems like a really good one. It 
> evens out the scales and so may converge quicker if nothing else. Pretty easy 
> to implement too.
> 
> 
> Repetition alert! Does anyone have a data set to try this on?
> 
> 
> 
> On May 31, 2013, at 1:15 PM, Koobas <koo...@gmail.com> wrote:
> 
> Ted,
> Thank you very much. This is very insightful.
> The log scaling is definitely an intuitive way of building the meta model.
> Not much disagreement about the uselessness of predicting ratings.
> 
> 
> On Fri, May 31, 2013 at 4:00 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
>> In my case, I put all the indicators from all different sources in the same
>> Solr/Lucene index.  Recommendations consists of making a single query to
>> Solr/Lucene with as much data as I have or want to include.
>> 
>> At the point that this query is done, there are no weights on the
>> indicators ... merely presence or absence in a field or query.  The weights
>> that I typically use are computed on the fly by Lucene's default similarity
>> score and the results tend to be very good.  There is no issue of combining
>> scores on different scales since there is only one composite score.
>> 
>> If you *really* want to build multiple models using different technologies
>> and combine them, you need a so-called meta-model.  There are many ways to
>> build such a beast.  A very simple way is to reduce all scores to quantiles
>> then to a log-odds scale (taking care not to ever estimate a quantile as
>> either 0 or 1).  A linear combination of these rescaled scores can work
>> pretty well although you do have to learn the linear weights.
>> 
>> Sometimes scores vary strongly from query to query.  In such cases,
>> reducing a score to being some kind of rank statistic can be helpful. For
>> instance, you may want to have a score that is the log of the rank that an
>> item appears at in the results list.  You might also be able to normalize
>> scores based on properties of the query. Such rank-based or normalized
>> scores can then often be combined by any meta-model, including the one I
>> mentioned above.
>> 
>> You should also look at the netflix papers, especially the one describing
>> the winning entry for more ideas on model combination.  The major
>> difference there is that they were trying to predict a rating which is a
>> task that I find essentially useless since ranking is so much more
>> important in most real-world applications.  Others may dispute my
>> assessment on this, of course.
>> 
>> There are many ways of building the meta-model that you need, but one
>> over-riding thought that I have is that the deviations from ideal in all
>> real cases will be large enough that theory should not be taken too
>> literally here, but rather should be used as a weak, though still useful,
>> inspirational guide.
>> 
>> 
>> On Fri, May 31, 2013 at 3:18 PM, Koobas <koo...@gmail.com> wrote:
>> 
>>> I am also very interested in the answer to this question.
>>> Just to reiterate, if you use different recommenders, e.g.,
>>> kNN user-based, kNN item-based, ALS, each one produces
>>> recommendations on a different scale. So how do you combine them?
>>> 
>>> 
>>> On Fri, May 31, 2013 at 3:07 PM, Dominik Hübner <cont...@dhuebner.com
>>>> wrote:
>>> 
>>>> Hey,
>>>> I have implemented a cross recommender based on the approach Ted
>> Dunning
>>>> proposed (cannot find the original post, but here is a follow up
>>>> http://www.mail-archive.com/user@mahout.apache.org/msg12983.html).
>>>> Currently I am struggling with the last step of blending the initial
>>>> recommendations.
>>>> 
>>>> My current approach:
>>>> 1. Compute a cooccurrence matrix for each useful combination of
>>>> user-product interaction (e.g. which product views and purchased do
>>> appear
>>>> in common …)
>>>> 2. Perform initial recommendation based on each matrix and the required
>>>> type of user vector (e.g. a user's history of views OR purchases) (like
>>> the
>>>> item-based recommender implemented in Mahout)
>>>> 
>>>> In step 2, I adapted the AggregateAndRecommendReducer of Mahout, which
>>>> normalizes vectors while building the sum of weighted similarities or
>> in
>>>> this case => cooccurrences.
>>>> 
>>>> Now I end up with multiple recommendations for each product, but all of
>>>> them are on a different scale.
>>>> How can I convert them to have the same scale, in order to be able to
>>>> weight them and build the linear combinations of initial
>> recommendations
>>> as
>>>> Ted proposed?
>>>> Would it make sense to normalize user vectors (before multiplying) as
>>> well?
>>>> 
>>>> Otherwise views would have a much higher influence than purchases due
>> to
>>>> their plain characteristics (they just appear way more frequently). Or
>> is
>>>> this the reason for weighting purchases higher and views lower? If so,
>> I
>>>> think it's sort of inconvenient. Wouldn't it be much more favorable to
>>> get
>>>> each type of interaction within the same scale and use the weights just
>>> to
>>>> control each types influence on the final recommendation?
>>>> 
>>>> Thanks in advance for any suggestions!
>>>> 
>>>> 
>>>> 
>>>> Regards
>>>> Dominik
>>>> 
>>>> Sent from my iPhone
>>> 
>> 
>

Re: Blending initial recommendations for cross recommendation

Reply via email to