[ 
https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158101#comment-13158101
 ] 

Sean Owen commented on MAHOUT-898:
----------------------------------

(Pearson is often mentioned in early literature but it's hardly the fastest or 
generally best metric -- log-likelihood is a better default. But that's a 
separate question.)

Now of course if negative similarity rarely comes up, it doesn't matter as much 
what we do with it. It would not be a big deal to change it. I think the key 
questions are both what is least surprising, and what is most effective? I tend 
to want to implement the simple logical "base case" thing and leave hooks to 
modify. So what's the simple, logical thing here?

Stick to a 1-5 rating range for simplicity. Say you've rated item A a 4, and it 
has similarity s to item B. (Ignore any other items.) I think it would be 
surprising if the weighted average here were not 4, regardless of s -- because 
it is for all positive s. But what if s is negative? Your change would mean any 
such item has an estimated rating of -4 -- or, capped to 1. But this is a very 
dissimilar item. Maybe 1 makes more sense as an estimate than 4. But it's going 
to be 1 for any item regardless of the rating too, not just similarity. That 
somehow feels funny.

Really, a weight represents a strength of vote for a certain answer in the 
weighted average. Higher weights push the answers towards the value it weights. 
A 0 weight does nothing. A negative weight therefore is a vote for the answer 
to be far from the value it weights. 1 is a vote to make the answer exactly X; 
-1 is a vote to make it infinitely far from X.

So say you have a rating of 3 with similarity -1, and a 4 with similarity 1. 
Right now the implementation would estimate "infinity" and cap it to 5. This 
change would cause it to estimate 0.5 and cap to 1. 5 seems right-er than 1, in 
balancing being "exactly 4" and "extremely far from 3". I *think* you'd find 
other scenarios work out this way. (It assumes you accept the premise of what a 
negative weight should mean.)


Of course your point stands that this leads often to behavior that's 
intuitively undesirable. While at the moment I still feel like the handling of 
negative weights is as logical as it can be, I think negative weights are 
problematic. I suppose I'd say "don't use them" is the real solution. You could 
modify / wrap Pearson to add 1 to the similarity value (accepting that this has 
its own logic issues, but, maybe better in practice.) Or better probably use 
another metric.


Did you happen to test both ways to see if it consistently makes better 
recommendations on a data set? Not suggesting you need to just curious. That 
would be an interesting empirical test.


I think it's an interesting question and good to open it up again. I liked 
Tom/Tamas's logic, which I've remembered and cribbed above, from last time. 
What do you think?
                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.6
>
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based 
> recommender normalizes by the sum of similarities for items used in 
> estimation. But the terms in the sum taken to normalize should be in absolute 
> value, since they can be negative (e.g. when using Pearson correlation, 
> similarity is in [-1,1]). Now they are not, and as a result when there are 
> negative and positive values they cancel out, giving a small denominator and 
> incorrectly boosting the preference for the item (symptom: it is easy for a 
> predicted preference to take the maximum value, since the quotient becomes 
> large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for 
> src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to