[ 
https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved MAHOUT-430.
------------------------------

      Assignee: Sean Owen
    Resolution: Not A Problem

But once you fall off the end of one, the remaining data points in the other 
array are irrelevant, since they have no counterpart in the one that ended. 
These metrics are only computed over the item/users that are common to both.

You can write a test to double-check but after looking at it again I'm certain 
it's correct.

It's probably the fastest implementation possible, and because of that, not the 
shortest. I don't think it's too messy, but have a crack at improving it while 
not comprising speed if you like. This is the best I could do after years of 
looking at this method.

The duplication is unfortunate but hard to refactor because they're doing 
slightly different things actually. Again open to suggestions but I have stared 
at this for a long time and, given the priority for speed, not been able to do 
better.

> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>
> Looking at the userSimilarity and itemSimilarity methods in 
> AbstractSimilarity, both compute metrics over each User's/Tool's 
> PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through 
> each PreferenceArray in a single loop, comparing indexes to make sure we 
> don't fall off the end. Eventually, we get to the end of an array, which is 
> caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. 
> Specifically, for the other array, the one that we *didn't* fall off the end 
> of, the metrics don't reflect the preferences we have not yet visited. In the 
> example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to 
> do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably 
> sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to