No, this is not a problem.

Yes it builds a model for each user, which takes a long time. It's
accurate, but time-consuming. It's meant for small data. You could rewrite
your own test to hold out data for all test users at once. That's what I
did when I rewrote a lot of this just because it was more useful to have
larger tests.

There are several ways to choose the test data. One common way is by time,
but there is no time information here by default. The problem is that, for
example, recent ratings may be low -- or at least not high ratings. But the
evaluation is of course asking the recommender for items that are predicted
to be highly rated. Random selection has the same problem. Choosing by
rating at least makes the test coherent.

It does bias the training set, but, the test set is supposed to be small.

There is no way to actually know, a priori, what the top recommendations
are. You have no information to evaluate most recommendations. This makes a
precision/recall test fairly uninformative in practice. Still, it's better
than nothing and commonly understood.

While precision/recall won't be high on tests like this, because of this, I
don't get these values for movielens data on any normal algo, but, you may
be, if choosing an algorithm or parameters that don't work well.




On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz <ahmetyilmazefe...@yahoo.com>wrote:

> Hi,
>
> I have looked at the internals of Mahout's RecommenderIRStatsEvaluator
> code. I think that there are two important problems here.
>
> According to my understanding the experimental protocol used in this code
> is something like this:
>
> It takes away a certain percentage of users as test users.
> For
>  each test user it builds a training set consisting of ratings given by
> all other users + the ratings of the test user which are below the
> relevanceThreshold.
> It then builds a model and makes a
> recommendation to the test user and finds the intersection between this
> recommendation list and the items which are rated above the
> relevanceThreshold by the test user.
> It then calculates the precision and recall in the usual way.
>
> Probems:
> 1. (mild) It builds a model for every test user which can take a lot of
> time.
>
> 2. (severe) Only the ratings (of the test user) which are below the
> relevanceThreshold are put into the training set. This means that the
> algorithm
> only knows the preferences of the test user about the items which s/he
> don't like. This is not a good representation of user ratings.
>
> Moreover when I run this evaluator on movielens 1m data, the precision and
> recall turned out to be, respectively,
>
> 0.011534185658699288
> 0.007905982905982885
>
> and the run took about 13 minutes on my intel core i3. (I used user based
> recommendation with k=2)
>
>
> Altgough I know that it is not ok to judge the performance of a
> recommendation algorithm by looking at these absolute precision and recall
> values, still these numbers seems to me too low which might be the result
> of the second problem I mentioned above.
>
> Am I missing something?
>
> Thanks
> Ahmet
>

Reply via email to