Re: Which database should I use with Mahout

Ted Dunning Mon, 20 May 2013 17:01:22 -0700

Inline answers.

On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> ...
> You use the user history vector as a query?

The most recent suffix of the history vector.  How much is used varies by
the purpose.

> This will be a list of item IDs and strength-of-preference values (maybe
> 1s for purchases).

Just a list of item x action codes.  No strength needed.  If you have 5
point ratings, then you can have 5 actions for each item.  The weighting
for each action can be learned.

> The cooccurrence matrix has columns treated like terms and rows treated
> like documents though both are really items.

Well, they are different.  The rows are fields within documents associated
with an item.  Other fields include ID and other things.  The contents of
the field are the codes associated with the item-action pairs for each
non-null column.  Usually there is only one action so this reduces to a
single column per item.

> Does Solr support weighted term lists as queries or do you have to throw
> out strength-of-preference?

I prefer to throw it out even though Solr would not require me to do so.
 They weights that I want can be encoded in the document index in any case.

> I ask because there are cases where the query will have non '1.0' values.
> When the strength values are just 1 the vector is really only a list or
> terms (items IDs).
>

I really don't know of any cases where this is really true.  There are
actions that are categorical.  I like to separate them out or to reduce to
a binary case.

>
> This technique seems like using a doc as a query but you have reduced the
> doc to the form of a vector of weighted terms. I was unaware that Solr
> allowed weighted term queries. This is really identical to using Solr for
> fast doc similarity queries.
>

It is really more like an ordinary query.  Typical recommendation queries
are short since they are only recent history.

>
> ...
>
> Seems like you'd rule out browser based storage because you need the
> history to train your next model.

Nothing says that we can't store data in two places according to use.
 Browser history is good for the part of the history that becomes the
query.  Central storage is good for the mass of history that becomes input
for analytics.

At least it would be in addition to a server based storage of history.

Yes.  In addition to.

> Another reason you wouldn't rely only on a browser storage is that it will
> be occasionally destroyed. Users span multiple devices these days too.
>

This can be dealt with using cookie resurrection techniques.  Or by letting
the user destroy their copy of the history if they like.

The user history matrix will be quite a bit larger than the user
> recommendation matrix, maybe and order or two larger.

I don't think so.  And it doesn't matter since this is reduced to
significant cooccurrence and that is typically quite small compared to a
list of recommendations for all users.

I have 20 recs for me stored but I've purchases 100's of items, and have
> viewed 1000's.
>

20 recs is not sufficient.  Typically you need 300 for any given context
and you need to recompute those very frequently.  If you use geo-specific
recommendations, you may need thousands of recommendations to have enough
geo-dispersion.  The search engine approach can handle all of that on the
fly.

Also, the cached recs are user x (20-300) non-zeros.  The sparsified
item-item cooccurrence matrix is item x 50.  Moreover, search engines are
very good at compression.  If users >> items, then item x 50 is much
smaller, especially after high quality compression (6:1 is a common
compression ratio).

>
> Given that you have to have the entire user history vector to do the query
> and given that this is still a lookup from an even larger matrix than the
> recs/user matrix and given that you have to do the lookup before the Solr
> query It can't be faster than just looking up pre-calculated recs.

None of this applies.  There is an item x 50 sized search index.  There is
a recent history that is available without a lookup.  All that is required
is a single Solr query and that can handle multiple kinds of history and
geo-location and user search terms all in a single step.

> In other words the query to produce the query will be more problematic
> than the query to produce the result, right?
>

Nope.  No such thing, therefore cost = 0.

> Something here may be "orders of magnitude" faster, but it isn't the total
> elapsed time to return recs at runtime, right?

Actually, it is.  Round trip of less than 10ms is common.  Precalculation
goes away.  Export of recs nearly goes away.  Currency of recommendations
is much higher.

> Maybe what you are saying is the time to pre-calculate the recs is 0 since
> they are calculated at runtime but you still have to create the
> cooccurrence matrix so you still need something like mahout hadoop to
> produce a model and you still need to index the model with Solr and you
> still need to lookup user history at runtime. Indexing with Solr is faster
> than loading a db (8 hours? They are doing something wrong) but the query
> side will be slower unless I've missed something.
>

I am pretty sure you have.  The customer are definitely not dopes.  Th
eproblem is that precalculated recs are much, much bigger due to geo
constraints.

>
> In any case you *have* introduced a realtime rec calculation. This is able
> to use user history that may be seconds old and not yet reflected in the
> training data (the cooccurrence matrix) and this is very interesting!
>
> >>
> >> This will scale to thousands or tens of thousands of recommendations per
> >> second against 10's of millions of items.  The number of users doesn't
> >> matter.
> >>
> >
>
> Yes, no doubt, but the history lookup is still an issue unless I've missed
> something. The NoSQL queries will scale to tens of thousands of recs
> against 10s of millions of items but perhaps with larger more complex
> infrastructure? Not sure how Solr scales.
>
> Being semi-ignorant of Solr intuition says that it's doing something to
> speed things up like using only part of the data somewhere to do
> approximations. Have there been any performance comparisons of say
> precision of one approach vs the other or do they return identical results?
>
>
>

Re: Which database should I use with Mahout

Reply via email to