Re: Which database should I use with Mahout

Ken Krugler Mon, 20 May 2013 10:46:19 -0700

Hi Pat,

On May 20, 2013, at 9:46am, Pat Ferrel wrote:


> I certainly have questions about this architecture mentioned below but first 
> let me make sure I understand. 
> 
> You use the user history vector as a query? This will be a list of item IDs 
> and strength-of-preference values (maybe 1s for purchases). The cooccurrence 
> matrix has columns treated like terms and rows treated like documents though 
> both are really items. Does Solr support weighted term lists as queries

Yes - you can "boost" individual terms in the query.

And you can use payloads on terms in the index to adjust their scores as well.

-- Ken

> or do you have to throw out strength-of-preference? I ask because there are 
> cases where the query will have non '1.0' values. When the strength values 
> are just 1 the vector is really only a list or terms (items IDs).
> 
> This technique seems like using a doc as a query but you have reduced the doc 
> to the form of a vector of weighted terms. I was unaware that Solr allowed 
> weighted term queries. This is really identical to using Solr for fast doc 
> similarity queries.
> 
>>> Using a cooccurrence matrix means you are doing item similairty since
>>> there is no user data in the matrix. Or are you talking about using the
>>> user history as the query? in which case you have to remember somewhere all
>>> users' history and look it up for the query, no?
>>> 
>> 
>> Yes.  You do.  And that is the key to making this orders of magnitude
>> faster.
>> 
>> But that is generally fairly trivial to do.  One option is to keep it in a
>> cookie.  Another is to use browser persistent storage.  Another is to use a
>> memory based user profile database.  Yet another is to use M7 tables on
>> MapR or HBase on other Hadoop distributions.
>> 
> 
> Seems like you'd rule out browser based storage because you need the history 
> to train your next model. At least it would be in addition to a server based 
> storage of history. Another reason you wouldn't rely only on a browser 
> storage is that it will be occasionally destroyed. Users span multiple 
> devices these days too.
> 
> The user history matrix will be quite a bit larger than the user 
> recommendation matrix, maybe and order or two larger. I have 20 recs for me 
> stored but I've purchases 100's of items, and have viewed 1000's. 
> 
> Given that you have to have the entire user history vector to do the query 
> and given that this is still a lookup from an even larger matrix than the 
> recs/user matrix and given that you have to do the lookup before the Solr 
> query It can't be faster than just looking up pre-calculated recs. In other 
> words the query to produce the query will be more problematic than the query 
> to produce the result, right?
> 
> Something here may be "orders of magnitude" faster, but it isn't the total 
> elapsed time to return recs at runtime, right? Maybe what you are saying is 
> the time to pre-calculate the recs is 0 since they are calculated at runtime 
> but you still have to create the cooccurrence matrix so you still need 
> something like mahout hadoop to produce a model and you still need to index 
> the model with Solr and you still need to lookup user history at runtime. 
> Indexing with Solr is faster than loading a db (8 hours? They are doing 
> something wrong) but the query side will be slower unless I've missed 
> something. 
> 
> In any case you *have* introduced a realtime rec calculation. This is able to 
> use user history that may be seconds old and not yet reflected in the 
> training data (the cooccurrence matrix) and this is very interesting! 
> 
>>> 
>>> This will scale to thousands or tens of thousands of recommendations per
>>> second against 10's of millions of items.  The number of users doesn't
>>> matter.
>>> 
>> 
> 
> Yes, no doubt, but the history lookup is still an issue unless I've missed 
> something. The NoSQL queries will scale to tens of thousands of recs against 
> 10s of millions of items but perhaps with larger more complex infrastructure? 
> Not sure how Solr scales.
> 
> Being semi-ignorant of Solr intuition says that it's doing something to speed 
> things up like using only part of the data somewhere to do approximations. 
> Have there been any performance comparisons of say precision of one approach 
> vs the other or do they return identical results?
> 
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Which database should I use with Mahout

Reply via email to