Re: Evaluating recommendations through user observation

Sean Owen Tue, 28 Dec 2010 01:12:21 -0800

If the general point is that user behavior is an incomplete, indirect,
and sometimes erroneous expression of what they like, yes I agree.
Sometimes users don't even know what they like (hence recommenders).
That's a meta-point, I think, which is an issue for recommending, or
evaluating the recommender against other user signals. The framework
has nothing to say about translating user signals into ratings, no, if
that's what you mean.

But, part of the question was what to do with whatever imperfect
translation to ratings one has, so I take that as a given. I don't
know of any special secrets here. I tend to think recommendations are
a coarse sort of output; I wouldn't read too much into whether rec 1
or 6 was picked; that one was picked as "good" is significant at all.
RMSE and other metrics are fine, though still suffer from the fact
that their input (user ratings) is noisy.

A crude test which you can run in the lab is to see if their future
behavior seemed to agree with recommendations. For this, you don't
need future user data, you can just hold out the most recent n days of
data and train on the rest. For situations where you have ratings, the
code does have support for you, running RMSE tests and such.

When you don't have ratings, it'll also help you do
precision/recall-style tests. This is a little problematic as you may
have really good top 10 recommendations but simply not observe the
user interacting with them. Precision and recall will always be really
low -- maybe useful as a relative comparison of two implementations,
but not a lot. Another sort of test which is not in the code is to
look at the n days of real user activity in this situation and see how
strongly the recommender would have rated that item for the user. The
higher the better. That too is a useful sort of relative comparison
for implementations in the lab.

I think the best measure is the broadest and most direct one. You put
in recs for a reason, to increase clicks/conversions/engagement over
some baseline. Do recommendations improve that metric when put in
place vs when not shown? A/B testing is the way to go, on
clicks/conversions or whatever. This is harder since you have to
deploy this in the field over some days and measure the difference,
but perhaps the best way.

On Mon, Dec 27, 2010 at 11:37 PM, Lance Norskog <[email protected]> wrote:
> Different people watch different numbers of movies. They also rate
> some but not all. Their recommendations may be in one or a few
> clusters (other clustering can be genre, which day of the week is the
> rating, on and on) or may be scattered all over genres (Harry Potter &
> British comedy & European soft-core 70's porn). Evaluating the worth
> of user X's ratings is also important. If you want to interpret the
> ratings in an absolute number system, you want to map the incoming
> ratings because they may average at 7.
>
> The code in Mahout doesn't address these issues.
>
> On Mon, Dec 27, 2010 at 6:54 AM, Otis Gospodnetic
> <[email protected]> wrote:
>> Hi,
>>
>> I was wondering how people evaluate the quality of recommendations other than
>> RMSE and such in eval package.
>> For example, what are some good ways to measure/evaluate the quality of
>> recommendations based on simply observing users' usage of recommendations?
>> Here are 2 ideas.
>>
>> * If you have a mechanism to capture user's rating of the watched item,  that
>> gives you (in)direct feedback about the quality of the  recommendation.  When
>> evaluating and comparing you probably also want to  take into account the
>> ordinal of the recommended item in the list of  recommended items.  If a 
>> person
>> chooses 1st recommendation and gives it a  score of 10 (best) it's different
>> than when a person chooses 7th  recommendation and gives it a score of 10.  
>> Or
>> if a person chooses 1st  recommendation and gives it a rating of 1.0 (worst) 
>> vs.
>> choosing 10th  recommendation and rating it 1.0.
>>
>> * Even if you don't have a mechanism to capture rating feedback from  
>> viewers,
>> you can evaluate and compare.  You can do that by purely  looking at 
>> ordinals of
>> items selected from recommendations.  If a  person chooses something closer 
>> to
>> "the top" of the recommendation list,  the recommendations can be considered
>> better than if the user chooses  something closer to "the bottom".  This 
>> idea is
>> similar to MRR in search  - 
>> http://en.wikipedia.org/wiki/Mean_reciprocal_rank .
>>
>> * The above ideas assume recommendations are not shuffled, meaning that their
>> order represents their real recommendation score-based order
>>
>> I'm wondering:
>> A) if these ways or measuring/evaluating quality of recommendations are
>> good/bad/flawed
>> B) if there are other, better ways of doing this
>>
>> Thanks,
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>
>
>
> --
> Lance Norskog
> [email protected]
>

Re: Evaluating recommendations through user observation

Reply via email to