Taking your comments out of order.

On Tue, Sep 8, 2009 at 2:41 PM, Sean Owen <[email protected]> wrote:

>  And yeah my world view is not centered around big off-line
> computations. In that world, yes I bet this is a good way to compute
> all recommendations at once. I still maintain that interesting use
> cases involve roughly real-time updates and recommendations.
>

This can be describe using the method that I use by just moving
parentheses.  The big off-line version of recommendation is (A' A) h while
the more more on-line version is A' (A h).  The overall shape of the
computation is the same but the first form allows the majority of the
computation to be done off-line.

My focus in the past has always been to produce recs really fast (sometimes
hundreds per second).  As is generally the case, doing as much computation
ahead of time is a great way to get fast response.


> I think reducing it to a matrix multiplication problem is tidy, and
> works I am sure.
>

This approach has been good for me for the last decade of building rec
engines.

I don't know if, practically, it affords the tweaks
> and hooks and modifications that let some of these stock algos
> actually produce good results. It can probably be made to work too.
>

As I tried to say in my original comment, I am only describing the shape or
pattern for the computation.  There are lots of tweaks that can be added
without changing this shape.  Also, there are lots of ways of implementing
the general pattern.  One of my favorites is to use a text retrieval engine
for the matrix-vector multiplication part and a map-reduce engine for the
matrix-matrix part.

I tend to view requirements for lots of hooks and modifications as an
indication that the underlying algorithms are not phrased well.  That isn't
always true, but it often is.  Use of methods that depend on normal
distribution assumptions often have pathological performance for small
counts.  This leads to lots of things like "if (n > 5)" sorts of hacks to
avoid problematic cases.  These methods include most chi-squared techniques,
correlation based recommendation engines and LSA.  The right answer is to
avoid the original mistake and use log-likelihood ratio tests, probabilistic
cooccurrence measures and LDA respectively.

As another example, in text retrieval, stop words are often used because
results for many retrieval engines are sooo bad without them.  Really,
though, this indicates a failure of the scoring algorithms, the phrase
recognition algorithms and possibly the index format.  All of these can be
fixed.  Whether it is worth fixing is another question, but the inference
from the need for stop lists that the fundamentals are somewhat broken is
still valid.

My gut is that it is not the most efficient way to do it -- if only
> because it definitely is not efficient to compute this to produce
> *one* set of recs, or, to answer many of the questions one asks of a
> recommender, which is not just to produce recs.
>

Efficiency is a tricky thing here.  Because the off-line and on-line forms
produce the same computation, and because the majority of the work is in the
off-line part, it is a definite win for response time to do off-line work.

But there is much more to it than that.  The off-line form of the
computation can be done using very clean, very fast sequential accesses to
the data.  The on-line form requires lots of random accesses to data.  That
difference makes the computation whalingly faster to do off-line.  In my
experience "whalingly" can be three orders of magnitude which is a huge
factor.

Even more important, however, is that there are lots of streamlinings that
can be done if you have global information and these can save many more
orders of magnitude in speed.  For instance, suppose that you have some item
that 50 million people have interacted with.  Do you think that you will
have any noticeable accuracy loss if you consider only 1 million of those
interactions?  Or even if you only consider a few thousand of them?  The
correct answer is that you can randomly downsample to about a thousand
interactions with no perceptible difference in accuracy.  This has a huge
impact because the cost of computing the coocurrence counts between two
items goes up with the product of the prevalence of each item.  If you
decrease each by a factor of a thousand, you have decreased the necessary
computation by a million.  Unfortunately, the overall prevalence is a global
property that is much easier to deal with off-line.

Similarly, sparsification by statistical means is trivial off-line and
difficult on-line.  This can, again, save you orders of magnitude in cost at
final recommendation time.

As a concrete example of just how massive these speedups are, I worked on
one system that had about 100-200 million interactions per day with about 10
million items.  We used 7 months of history to generate recommendations.
This included about 7 x 30 x 100-200 M = 20-40 billion observations.  Using
the techniques I have described we were able to build a system that did well
over 100 recommendations per second.  The off-line computation could be done
relatively easily in about 10 hours and would take less with more recent
versions of Pig and Hadoop.  Total hardware included a tiny hadoop cluster
with 10-20 cores and about 40GBytes of RAM for the off-line portion and a
few real-time servers.  With more modern machines, this entire operation
could have been done with 2-4 1U servers.  The recommendation quality was
excellent.  The system captured changes in recommendations much faster than
they actually occurred.

That may not have been the *most* efficient way to solve the problem, but it
was efficient enough to be hard to beat.

-- 
Ted Dunning, CTO
DeepDyve

Reply via email to