[ https://issues.apache.org/jira/browse/MAHOUT-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132507#comment-13132507 ]
Sean Owen commented on MAHOUT-847: ---------------------------------- The problem with caching sqrt(n) is that every pair of users / vectors may overlap in a different number of dimensions. It's not fixed (or if it were, I wouldn't be correcting for it as it wouldn't change ordering). Well, the real Euclidean distance is in there alright. It is not a similarity metric by itself (it gets bigger with less similarity). So there's a transformation to a similarity value, and that's the made-up bit I'm changing here. I think "EuclideanDistance"+"Similarity" is therefore accurate. > Improve Euclidean distance similarity calculation > ------------------------------------------------- > > Key: MAHOUT-847 > URL: https://issues.apache.org/jira/browse/MAHOUT-847 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Affects Versions: 0.5 > Reporter: Sean Owen > Assignee: Sean Owen > Priority: Minor > Labels: distance, euclidean, similarity, vector > Fix For: 0.6 > > Attachments: MAHOUT-847.patch > > > In the non-distributed recommender world, the Euclidean distance similarity > is calculated as n/(1+d), where d is distance and n is dimension. 1/(1+d) is > a valid mapping from distance [0,infinity) to similarity (0,1]. n is there to > "correct" for the fact that things are farther apart in higher dimensions. It > would be right-er, after some discussion, to use a factor of sqrt(n), and > apply directly to the distance; 1/(1+d/sqrt(n)). > I propose fixing the calculation accordingly. > In the distributed similarity, the formula is 1-1/(1+d), which is the wrong > way around. That will be fixed. I'd apply the same heuristic, except that at > the moment we don't have access to the value of n at that point. I don't like > the inconsistency but it's minor; would rather get this change in now, which > definitely improves things. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira