[
https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883130#action_12883130
]
Emerson Murphy-HIll commented on MAHOUT-430:
--------------------------------------------
You are right in that the remaining data points are irrelevant for sumXY. But I
believe that the data points *are* relevant for sumX and sumY, because these
values should be computed for the entirety of both feature arrays.
This may not be the case for some similarity metrics, but it is for
UncenteredCosineSimilarity, which is what I'm working with. For this metric,
the full sumX and sumY are necessary, for instance, for computing the magnitude
of each preference array (http://en.wikipedia.org/wiki/Cosine_similarity : here
the ||A|| and ||B||).
> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
> Key: MAHOUT-430
> URL: https://issues.apache.org/jira/browse/MAHOUT-430
> Project: Mahout
> Issue Type: Bug
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Emerson Murphy-HIll
> Assignee: Sean Owen
>
> Looking at the userSimilarity and itemSimilarity methods in
> AbstractSimilarity, both compute metrics over each User's/Tool's
> PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through
> each PreferenceArray in a single loop, comparing indexes to make sure we
> don't fall off the end. Eventually, we get to the end of an array, which is
> caught here:
> if (compare <= 0) {
> if (++xPrefIndex >= xLength) {
> break;
> }
> ...
> The problem is, the metrics may not be correct when the break occurs.
> Specifically, for the other array, the one that we *didn't* fall off the end
> of, the metrics don't reflect the preferences we have not yet visited. In the
> example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to
> do something like this:
> if (compare <= 0) {
> if (++xPrefIndex >= xLength) {
> sumY2 += squareSumRest(yPrefs,yPrefIndex);
> break;
> }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
> double squareSum = 0;
> for(int i = startingFrom; i < preferences.length; i++){
> double val = preferences[i].getValue();
> squareSum += val*val;
> }
> return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably
> sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.