Re: Which database should I use with Mahout

Pat Ferrel Mon, 20 May 2013 09:47:13 -0700

I certainly have questions about this architecture mentioned below but first 
let me make sure I understand.


You use the user history vector as a query? This will be a list of item IDs and 
strength-of-preference values (maybe 1s for purchases). The cooccurrence matrix 
has columns treated like terms and rows treated like documents though both are 
really items. Does Solr support weighted term lists as queries or do you have 
to throw out strength-of-preference? I ask because there are cases where the 
query will have non '1.0' values. When the strength values are just 1 the 
vector is really only a list or terms (items IDs).

This technique seems like using a doc as a query but you have reduced the doc 
to the form of a vector of weighted terms. I was unaware that Solr allowed 
weighted term queries. This is really identical to using Solr for fast doc 
similarity queries.

>> Using a cooccurrence matrix means you are doing item similairty since
>> there is no user data in the matrix. Or are you talking about using the
>> user history as the query? in which case you have to remember somewhere all
>> users' history and look it up for the query, no?
>> 
> 
> Yes.  You do.  And that is the key to making this orders of magnitude
> faster.
> 
> But that is generally fairly trivial to do.  One option is to keep it in a
> cookie.  Another is to use browser persistent storage.  Another is to use a
> memory based user profile database.  Yet another is to use M7 tables on
> MapR or HBase on other Hadoop distributions.
> 

Seems like you'd rule out browser based storage because you need the history to 
train your next model. At least it would be in addition to a server based 
storage of history. Another reason you wouldn't rely only on a browser storage 
is that it will be occasionally destroyed. Users span multiple devices these 
days too.

The user history matrix will be quite a bit larger than the user recommendation 
matrix, maybe and order or two larger. I have 20 recs for me stored but I've 
purchases 100's of items, and have viewed 1000's. 

Given that you have to have the entire user history vector to do the query and 
given that this is still a lookup from an even larger matrix than the recs/user 
matrix and given that you have to do the lookup before the Solr query It can't 
be faster than just looking up pre-calculated recs. In other words the query to 
produce the query will be more problematic than the query to produce the 
result, right?

Something here may be "orders of magnitude" faster, but it isn't the total 
elapsed time to return recs at runtime, right? Maybe what you are saying is the 
time to pre-calculate the recs is 0 since they are calculated at runtime but 
you still have to create the cooccurrence matrix so you still need something 
like mahout hadoop to produce a model and you still need to index the model 
with Solr and you still need to lookup user history at runtime. Indexing with 
Solr is faster than loading a db (8 hours? They are doing something wrong) but 
the query side will be slower unless I've missed something. 

In any case you *have* introduced a realtime rec calculation. This is able to 
use user history that may be seconds old and not yet reflected in the training 
data (the cooccurrence matrix) and this is very interesting! 

>> 
>> This will scale to thousands or tens of thousands of recommendations per
>> second against 10's of millions of items.  The number of users doesn't
>> matter.
>> 
> 

Yes, no doubt, but the history lookup is still an issue unless I've missed 
something. The NoSQL queries will scale to tens of thousands of recs against 
10s of millions of items but perhaps with larger more complex infrastructure? 
Not sure how Solr scales.

Being semi-ignorant of Solr intuition says that it's doing something to speed 
things up like using only part of the data somewhere to do approximations. Have 
there been any performance comparisons of say precision of one approach vs the 
other or do they return identical results?

Re: Which database should I use with Mahout

Reply via email to