But modeling a user only by his/her low ratings can be problematic since people generally are more precise (I believe) in their high ratings. Another problem is that recommender algorithms in general first mean normalize the ratings for each user. Suppose that we have the following ratings of 3 people (A, B, and C) on 5 items.
A's ratings: 1 2 3 4 5 B's ratings: 1 3 5 2 4 C's ratings: 1 2 3 4 5 Suppose that A is the test user. Now if we put only the low ratings of A (1, 2, and 3) into the training set and mean normalize the ratings then A will be more similar to B than C, which is not true. ________________________________ From: Sean Owen <sro...@gmail.com> To: Mahout User List <user@mahout.apache.org>; Ahmet Ylmaz <ahmetyilmazefe...@yahoo.com> Sent: Saturday, February 16, 2013 8:41 PM Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator No, this is not a problem. Yes it builds a model for each user, which takes a long time. It's accurate, but time-consuming. It's meant for small data. You could rewrite your own test to hold out data for all test users at once. That's what I did when I rewrote a lot of this just because it was more useful to have larger tests. There are several ways to choose the test data. One common way is by time, but there is no time information here by default. The problem is that, for example, recent ratings may be low -- or at least not high ratings. But the evaluation is of course asking the recommender for items that are predicted to be highly rated. Random selection has the same problem. Choosing by rating at least makes the test coherent. It does bias the training set, but, the test set is supposed to be small. There is no way to actually know, a priori, what the top recommendations are. You have no information to evaluate most recommendations. This makes a precision/recall test fairly uninformative in practice. Still, it's better than nothing and commonly understood. While precision/recall won't be high on tests like this, because of this, I don't get these values for movielens data on any normal algo, but, you may be, if choosing an algorithm or parameters that don't work well. On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz <ahmetyilmazefe...@yahoo.com>wrote: > Hi, > > I have looked at the internals of Mahout's RecommenderIRStatsEvaluator > code. I think that there are two important problems here. > > According to my understanding the experimental protocol used in this code > is something like this: > > It takes away a certain percentage of users as test users. > For > each test user it builds a training set consisting of ratings given by > all other users + the ratings of the test user which are below the > relevanceThreshold. > It then builds a model and makes a > recommendation to the test user and finds the intersection between this > recommendation list and the items which are rated above the > relevanceThreshold by the test user. > It then calculates the precision and recall in the usual way. > > Probems: > 1. (mild) It builds a model for every test user which can take a lot of > time. > > 2. (severe) Only the ratings (of the test user) which are below the > relevanceThreshold are put into the training set. This means that the > algorithm > only knows the preferences of the test user about the items which s/he > don't like. This is not a good representation of user ratings. > > Moreover when I run this evaluator on movielens 1m data, the precision and > recall turned out to be, respectively, > > 0.011534185658699288 > 0.007905982905982885 > > and the run took about 13 minutes on my intel core i3. (I used user based > recommendation with k=2) > > > Altgough I know that it is not ok to judge the performance of a > recommendation algorithm by looking at these absolute precision and recall > values, still these numbers seems to me too low which might be the result > of the second problem I mentioned above. > > Am I missing something? > > Thanks > Ahmet >