The entire reference to similarity harks back to the original formulation of the MovieLens and Firefox recommenders which looked for similarity of rating patterns. That made some sense then, but it is a bit of a tortured turn of phrase when other formulations of recommendation are used.
There are currently two general approaches that seem to be generating reasonable recommendation results in practice, LLR based sparsification of cooccurrence and cross-occurrence matrices and matrix completion techniques typically implemented as some form of factorization. The enormous number of options that Mahout's map-reduce recommender implements have little practical utility and are more of an artifact of a desire to implement most of the research algorithms in a single framework. The concept of distance can be useful in the matrix factorization since it allows efficient algorithms to be derived. But with the sparsification problem, the concepts of similarity and distance break down because with cooccurrence we don't just have two answers. Instead, we have three: anomalous cooccurrence, non-anomalous cooccurrence and insufficient data. For the purposes of sparsification, we lump non-anomalous cooccurrence and insufficient data together, but this lumping has the side effect that the score that we get is not a useful measure of association, distance or similarity. Instead, we just put down that anomalously cooccurrent pairs are anomalous (a binary decision) and leave the weighting of them until later. If you are strict about thinking about cooccurrence measures as a distance, you get into measures of the strength of association. These measures will separate anomalous cooccurrence from non-anomalous cooccurrence, but they will smear the insufficient data cases into both options. Since most pairs have insufficient data, this will be a relatively disastrous thing to do, causing massive numbers of false positives that swamp the valid pairs. The virtue of LLR is that it does not do this, but there is a corollary vice in that the resulting score is not useful as a distance. Regarding the question about similarities and distances being used essentially synonymously, this is relatively common because of the fairly strict anti-correlation between them. Yes, there is a sign change, but they still are representing basically the same thing. Elevation and depth are similarly very closely related and somebody might refer to the elevation of an underwater mountain range above its base or its depth below the surface. These expressions are referring to the same z axis measurement with different origins. On Wed, Aug 6, 2014 at 5:21 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > So, compared to original paper [1], similarity is now hardcoded and always > LLR? Do we have any plans to parameterize that further? Is there any reason > to parameterize it? > > > Also, reading the paper, i am a bit wondering -- similarity and distance > are functions that usually are moving into different directions (i.e. > cosine similarity and angular distance) but in the paper distance scores > are also considered similarities? How's that? > > I suppose in that context LLR is considered a distance (higher scores mean > more `distant` items, co-occurring by chance only)? > > [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf > > -d >