[
https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158645#comment-13158645
]
Paulo Villegas commented on MAHOUT-898:
---------------------------------------
Hi, thanks for the long reply!
I did test both approaches with a dataset, namely I took the Movielens 1M
dataset, did the usual training/test separation and measured output.
Interestingly, my proposed modification actually increased the prediction error
(MAE from 0.78 to 1.01, for one of the runs). But the precision/recall measures
that I also took give a totally different picture. For instance, Precision@10
is 0.05% (i.e. negligible) for the original version, while the modified version
gets a value of 5%. That is two orders of magnitude greater. Recall values are
similar (0.04% against 1.6% for recall@10). You really can see that in the
recommendation lists: the modified version usually produces recommendations
that are much more recognizable and "semantically" related to the training set
(though as I said, a degree of surprise is also good).
What i believe is happening is that the original version is pushing items to
the top (capped to maximum) based on negative correlations, and they totally
block the opportunity for the items in the testset to get recommended, hence
the low precision/recall. But the modified version, that works much better in
this context, increases prediction error because it tends to produce lower
prediction values; given that the rating statistics are biased towards higher
values (there are fewer low ratings), this increases the overall error. But
this is just an unconfirmed guess.
Additionally to the cases you mention, one potential caveat in my proposed
modification is the asymmetry in how the negative correlations affect positive
and negative ratings. Negative correlations imply the behaviour should be
reversed, and this works to convert high ratings into low ratings, but not
viceversa (low ratings get converted into still lower ones). I believe this is
something that mean-centering or the like could solve (in practical terms, if
ratings were from -2 to 2, it would work more naturally). This is something I
intend to do, but the modification to the software is not as straightforward as
the abs.
Incidentally, we also tried loglikelihood as similarity metric (and a few other
ones); we set on Pearson because it worked best. This was before I measured
precision/recall, I'll probably now repeat the experiments to get those metrics
with log-likelihood to see what comes up.
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
> Key: MAHOUT-898
> URL: https://issues.apache.org/jira/browse/MAHOUT-898
> Project: Mahout
> Issue Type: Bug
> Components: Collaborative Filtering
> Environment: mahout-core
> Reporter: Paulo Villegas
> Assignee: Sean Owen
> Priority: Minor
> Labels: patch
> Fix For: 0.6
>
> Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based
> recommender normalizes by the sum of similarities for items used in
> estimation. But the terms in the sum taken to normalize should be in absolute
> value, since they can be negative (e.g. when using Pearson correlation,
> similarity is in [-1,1]). Now they are not, and as a result when there are
> negative and positive values they cancel out, giving a small denominator and
> incorrectly boosting the preference for the item (symptom: it is easy for a
> predicted preference to take the maximum value, since the quotient becomes
> large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for
> src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira