AbstractSimilarity improperly computes vector metrics
-----------------------------------------------------

                 Key: MAHOUT-430
                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
             Project: Mahout
          Issue Type: Bug
          Components: Collaborative Filtering
    Affects Versions: 0.4
            Reporter: Emerson Murphy-HIll


Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, 
both compute metrics over each User's/Tool's PreferenceArrays, metrics like 
'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single 
loop, comparing indexes to make sure we don't fall off the end. Eventually, we 
get to the end of an array, which is caught here:

if (compare <= 0) {
  if (++xPrefIndex >= xLength) {
    break;
  }
...

The problem is, the metrics may not be correct when the break occurs. 
Specifically, for the other array, the one that we *didn't* fall off the end 
of, the metrics don't reflect the preferences we have not yet visited. In the 
example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do 
something like this:

if (compare <= 0) {
  if (++xPrefIndex >= xLength) {
    sumY2 += squareSumRest(yPrefs,yPrefIndex);
    break;
  }
...

private double squareSumRest(Preference[] preferences, int startingFrom) {
  double squareSum = 0;
  for(int i = startingFrom; i < preferences.length; i++){
    double val = preferences[i].getValue();
    squareSum += val*val;
  }
  return squareSum;
}

I believe that the problem affects the sumX and sumY variables (and probably 
sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.

A couple of comments about these two methods:
1) They're really hard to reason about. Isn't there a simpler implementation?
2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to