2010/3/31 Christoph Hermann <[email protected]>:
> Hello,
>
> in the book Mahout in Action there are a few lines of code to evaluate
> boolean preferences with precision and recall, but i can vary threshold
> as much as i want, it won't change anything. Only changing "at" (number
> of recommendations) seems to change the results.

Yes, threshold will not matter at all for 'boolean' data (as long as
it's >= 1), more on this below.

> My problem is, that i get P = R = 0.0033 which i would think is *very*
> low. relevanceThreshold = 1 and at = 10.

I am not surprised by low precision. It doesn't necessarily mean the
recommender is bad (though it could!). I think a precision-recall test
is somewhat flawed for recommenders. It measures how well the
recommender returns things the user has already seen, which are not
necessarily the best recommendations. That is, a recommender gets
penalized in this test if it recommends something the user *would*
like, but hasn't rated.


> As far as i can see GenericRecommenderIRStatsEvaluator gets a list of
> preferences for each user and compares these to some threshold.
> The BooleanPreference returns always 1 as a value, so this doesn't make
> much sense to me in this context, since it would return all the
> elements. One could completely skip this comparison with a Boolean data
> model, right?

Yes, it will select all items as relevant, since there is no way to
judge which are liked more than others. It will take the first "at"
items as the set of relevant ones, which are effectively randomly
chosen.

You could skip the comparison, sure, but didn't think it worth
complicating the code here to make that optimization, which wouldn't
do much for overall speed.


> Better would be to have two DataModels which are divided at a certain
> point in time and then we make recommendations based on the older one
> and check if these occur in the newer one, correct?
> This way we would have a way to tell which ones are "good"
> recommendations and which one are not.

I agree that seems somewhat more coherent. At least the training model
is a set of preferences that actually existed, together, at some
point. The framework does not have timing information in general
though, so that's why it's not appearing.

You could modify the code to use this info for your purposes if you like.

But I think it still has the same basic flaw, which may still give you
low and unuseful results: it's just judging how well the recommender
recommends those items the user went on to encounter. While those are
probably good recommendations, they're not necessarily the best.

As an evaluation, it's still better than nothing, though I think it's
hard to get a meaningful result from the eval this way.


Hmm, I should write about this in the chapter more, eh.

Reply via email to