The entire reference to similarity harks back to the original formulation
of the MovieLens and Firefox recommenders which looked for similarity of
rating patterns.  That made some sense then, but it is a bit of a tortured
turn of phrase when other formulations of recommendation are used.

There are currently two general approaches that seem to be generating
reasonable recommendation results in practice, LLR based sparsification of
cooccurrence and cross-occurrence matrices and matrix completion techniques
typically implemented as some form of factorization.  The enormous number
of options that Mahout's map-reduce recommender implements have little
practical utility and are more of an artifact of a desire to implement most
of the research algorithms in a single framework.

The concept of distance can be useful in the matrix factorization since it
allows efficient algorithms to be derived.  But with the sparsification
problem, the concepts of similarity and distance break down because with
cooccurrence we don't just have two answers.  Instead, we have three:
anomalous cooccurrence, non-anomalous cooccurrence and insufficient data.
 For the purposes of sparsification, we lump non-anomalous cooccurrence and
insufficient data together, but this lumping has the side effect that the
score that we get is not a useful measure of association, distance or
similarity.  Instead, we just put down that anomalously cooccurrent pairs
are anomalous (a binary decision) and leave the weighting of them until
later.

If you are strict about thinking about cooccurrence measures as a distance,
you get into measures of the strength of association.  These measures will
separate anomalous cooccurrence from non-anomalous cooccurrence, but they
will smear the insufficient data cases into both options.  Since most pairs
have insufficient data, this will be a relatively disastrous thing to do,
causing massive numbers of false positives that swamp the valid pairs.  The
virtue of LLR is that it does not do this, but there is a corollary vice in
that the resulting score is not useful as a distance.


Regarding the question about similarities and distances being used
essentially synonymously, this is relatively common because of the fairly
strict anti-correlation between them.  Yes, there is a sign change, but
they still are representing basically the same thing.  Elevation and depth
are similarly very closely related and somebody might refer to the
elevation of an underwater mountain range above its base or its depth below
the surface.  These expressions are referring to the same z axis
measurement with different origins.




On Wed, Aug 6, 2014 at 5:21 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> So, compared to original paper [1], similarity is now hardcoded and always
> LLR? Do we have any plans to parameterize that further? Is there any reason
> to parameterize it?
>
>
> Also, reading the paper, i am a bit wondering -- similarity and distance
> are functions that usually are moving into different directions (i.e.
> cosine similarity and angular distance) but in the paper distance scores
> are also considered similarities? How's that?
>
> I suppose in that context LLR is considered a distance (higher scores mean
> more `distant` items, co-occurring by chance only)?
>
> [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf
>
> -d
>

Reply via email to