[ 
https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158645#comment-13158645
 ] 

Paulo Villegas commented on MAHOUT-898:
---------------------------------------

Hi, thanks for the long reply!

I did test both approaches with a dataset, namely I took the Movielens 1M 
dataset, did the usual training/test separation and measured output. 
Interestingly, my proposed modification actually increased the prediction error 
(MAE from 0.78 to 1.01, for one of the runs). But the precision/recall measures 
that I also took give a totally different picture. For instance, Precision@10 
is 0.05% (i.e. negligible) for the original version, while the modified version 
gets a value of 5%. That is two orders of magnitude greater. Recall values are 
similar (0.04% against 1.6% for recall@10). You really can see that in the 
recommendation lists: the modified version usually produces recommendations 
that are much more recognizable and "semantically" related to the training set 
(though as I said, a degree of surprise is also good).

What i believe is happening is that the original version is pushing items to 
the top (capped to maximum) based on negative correlations, and they totally 
block the opportunity for the items in the testset to get recommended, hence 
the low precision/recall. But the modified version, that works much better in 
this context, increases prediction error because it tends to produce lower 
prediction values; given that the rating statistics are biased towards higher 
values (there are fewer low ratings), this increases the overall error. But 
this is just an unconfirmed guess.

Additionally to the cases you mention, one potential caveat in my proposed 
modification is the asymmetry in how the negative correlations affect positive 
and negative ratings. Negative correlations imply the behaviour should be 
reversed, and this works to convert high ratings into low ratings, but not 
viceversa (low ratings get converted into still lower ones). I believe this is 
something that mean-centering or the like could solve (in practical terms, if 
ratings were from -2 to 2, it would work more naturally). This is something I 
intend to do, but the modification to the software is not as straightforward as 
the abs.

Incidentally, we also tried loglikelihood as similarity metric (and a few other 
ones); we set on Pearson because it worked best. This was before I measured 
precision/recall, I'll probably now repeat the experiments to get those metrics 
with log-likelihood to see what comes up.

                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.6
>
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based 
> recommender normalizes by the sum of similarities for items used in 
> estimation. But the terms in the sum taken to normalize should be in absolute 
> value, since they can be negative (e.g. when using Pearson correlation, 
> similarity is in [-1,1]). Now they are not, and as a result when there are 
> negative and positive values they cancel out, giving a small denominator and 
> incorrectly boosting the preference for the item (symptom: it is easy for a 
> predicted preference to take the maximum value, since the quotient becomes 
> large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for 
> src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to