Re: Which database should I use with Mahout

Pat Ferrel Tue, 21 May 2013 09:00:34 -0700

In the interest of getting some empirical data out about various architectures:

On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:

>> ...
>> You use the user history vector as a query?
> 
> The most recent suffix of the history vector.  How much is used varies by
> the purpose.

We did some experiments with this using a year+ of e-com data. We measured the 
precision using different amounts of the history vector in 3 month increments. 
The precision increased throughout the year. At about 9 months the affects of 
what appears to be item/product/catalog/new model churn begin to become 
significant and so precision started to level off. We did *not* apply a filter 
to recs so that items not in the current catalog were not filtered before 
precision was measured. We'd expect this to improve results using older data.

In our case we never found a good truncation point though it looked like we 
were reaching one when the data ran out. Even the last 3 months produced a 4.5% 
better precision score.

> 
>> ...
>> Seems like you'd rule out browser based storage because you need the
>> history to train your next model. At least it would be in addition to a 
>> server based storage of history.
> 
> Yes.  In addition to.
> 
>> The user history matrix will be quite a bit larger than the user
>> recommendation matrix, maybe an order or two larger.
> 
> 
> I don't think so.  And it doesn't matter since this is reduced to
> significant cooccurrence and that is typically quite small compared to a
> list of recommendations for all users.
> 
>> I have 20 recs for me stored but I've purchases 100's of items, and have
>> viewed 1000's.
>> 
> 
> 20 recs is not sufficient.  Typically you need 300 for any given context
> and you need to recompute those very frequently.  If you use geo-specific
> recommendations, you may need thousands of recommendations to have enough
> geo-dispersion.  The search engine approach can handle all of that on the
> fly.
> 
> Also, the cached recs are user x (20-300) non-zeros.  The sparsified
> item-item cooccurrence matrix is item x 50.  Moreover, search engines are
> very good at compression.  If users >> items, then item x 50 is much
> smaller, especially after high quality compression (6:1 is a common
> compression ratio).
> 

The end application designed by the ecom customer required less than 10 recs 
for any given context so 20 gave us of room for runtime context type boosting.

Given that precision increased for a year of user history and that we needed to 
return 20 recs per user and per items the history matrix was nearly 2 orders of 
magnitude larger than the recs matrix. This was with about 5M users and 500K 
items over a year. The issue I was asking about was how to store and retrieve 
history vectors for queries. In our case it looks like some kind of scalable 
persistence store would be required and since pre-calculated reqs are indeed 
much smaller...

I fully believe your description of how well search engines store their index. 
The cooccurrence matrix is already sparsified by a similarity metric and any 
compression that Solr does will help keep the index small. In any case Solr 
does sharding so it can scale past one machine anyway.

>> 
>> Given that you have to have the entire user history vector to do the query
>> and given that this is still a lookup from an even larger matrix than the
>> recs/user matrix and given that you have to do the lookup before the Solr
>> query It can't be faster than just looking up pre-calculated recs.
> 
> 
> None of this applies.  There is an item x 50 sized search index.  There is
> a recent history that is available without a lookup.  All that is required
> is a single Solr query and that can handle multiple kinds of history and
> geo-location and user search terms all in a single step.
> 

Yes using a search engine the index is very small but the history vectors are 
not. Actually I wonder how well Solr would handle a large query? Is the 
truncation of the history vector required perhaps?

>> Something here may be "orders of magnitude" faster, but it isn't the total
>> elapsed time to return recs at runtime, right?
> 
> 
> Actually, it is.  Round trip of less than 10ms is common.  Precalculation
> goes away.  Export of recs nearly goes away.  Currency of recommendations
> is much higher.

This is certainly great performance, no doubt. Using a 12 node Cassandra ring 
(each machine had 16G of memory) spread across two geolocations we got 24,000 
tps to a worst case of 5000 tps. The average response for the entire system 
(which included two internal service layers and one query to cassandra) was 
5-10ms per response.

>> 
>> Maybe what you are saying is the time to pre-calculate the recs is 0 since
>> they are calculated at runtime but you still have to create the
>> cooccurrence matrix so you still need something like mahout hadoop to
>> produce a model and you still need to index the model with Solr and you
>> still need to lookup user history at runtime. Indexing with Solr is faster
>> than loading a db (8 hours? They are doing something wrong) but the query
>> side will be slower unless I've missed something.
>> 
> 
> I am pretty sure you have.  The customer are definitely not dopes.  Th
> eproblem is that precalculated recs are much, much bigger due to geo
> constraints.
> 

Not sure about geo-constraints. We did not consider these. 

There were cases where cassandra would inexplicably bog down and produce 
bizarrely slow tps, but they were rare and the subject of eradication efforts. 
The people responsible thought they were related to some internal journal 
cleanup or compaction and were trying to fix the problem. These are the bane of 
any large complex system to be sure. 

However the system's performance was very high. I personally wrote the 
pre-calculated recs to a dev cassandra instance. 5m users, 500k items so 5.5m 
rows including all created reqs (20+) from a mahout item-based recommender in 3 
hours. Dismally slow, I know, but the writing was done from a single process 
reading from HDFS and writing to a single machine cassandra 'cluster'. The 
obvious speedups would be to load the db using hadoop and to have a high 
performance multi-machine ring.

Since reqs are pre-calculated and we know how large they are (5.5m x 20 = 110m) 
it would be simple enough to put this into memory if a further speedup is 
required. But in our case we had to get some user specific things from 
cassandra (user login name, profile info, etc.) so the query by user had to be 
made anyway. Whatever was stored under the user's key was virtually free.

Anyway, that's just one case study and hopefully will help someone decide on an 
architecture based on their own resources and requirements.

Re: Which database should I use with Mahout

Reply via email to