Johannes, Your summary is good.
I would add that the precalculated recommendations can be large enough that the lookup becomes more expensive. Your point about staleness is very on-point. On Mon, May 20, 2013 at 10:15 PM, Johannes Schulte < johannes.schu...@gmail.com> wrote: > I think Pat is just saying that > time(history_lookup) (1) + time (recommendation_calculation) (2) > > time(precalc_lookop) (3) > > since 1 and 3 are assumed to be served by the same system class (key value > store, db) with a single key and 2 > 0. > > ed is using a lot of information that is available at recommendation time > and not fetched from a somewhere ("context of delivery", geolocation). The > question remaining is why the recent history is available without a lookup, > which can only be the case if the recommendation calculation is embedded in > a bigger request cycle the history is loaded somewhere else, or it's just > stored in the browser. > > if you would store the classical (netflix/mahout) user-item history in the > browser and use a disk matrix thing like lucene for calculation you would > end up in the same range. > > I think the points are more: > > > 1. Having more input's than the classical item-interactions > (geolocation->item,search_term->item ..) can be very easily carried out > with search index storing this precalculated "association rules" > > 2. Precalculation per user is heavyweight, stale and hard to do if the > context also plays a role (site the use is on e.g because you have to have > the cartesian product of recommendations prepared for every user), while > "real time" approach can handle it > > > > > > On Tue, May 21, 2013 at 2:00 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > Inline answers. > > > > > > On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel <pat.fer...@gmail.com> > wrote: > > > > > ... > > > You use the user history vector as a query? > > > > > > The most recent suffix of the history vector. How much is used varies by > > the purpose. > > > > > > > This will be a list of item IDs and strength-of-preference values > (maybe > > > 1s for purchases). > > > > > > Just a list of item x action codes. No strength needed. If you have 5 > > point ratings, then you can have 5 actions for each item. The weighting > > for each action can be learned. > > > > > > > The cooccurrence matrix has columns treated like terms and rows treated > > > like documents though both are really items. > > > > > > Well, they are different. The rows are fields within documents > associated > > with an item. Other fields include ID and other things. The contents of > > the field are the codes associated with the item-action pairs for each > > non-null column. Usually there is only one action so this reduces to a > > single column per item. > > > > > > > > > > > Does Solr support weighted term lists as queries or do you have to > throw > > > out strength-of-preference? > > > > > > I prefer to throw it out even though Solr would not require me to do so. > > They weights that I want can be encoded in the document index in any > case. > > > > > > > I ask because there are cases where the query will have non '1.0' > values. > > > When the strength values are just 1 the vector is really only a list or > > > terms (items IDs). > > > > > > > I really don't know of any cases where this is really true. There are > > actions that are categorical. I like to separate them out or to reduce > to > > a binary case. > > > > > > > > > > This technique seems like using a doc as a query but you have reduced > the > > > doc to the form of a vector of weighted terms. I was unaware that Solr > > > allowed weighted term queries. This is really identical to using Solr > for > > > fast doc similarity queries. > > > > > > > It is really more like an ordinary query. Typical recommendation queries > > are short since they are only recent history. > > > > > > > > > > ... > > > > > > Seems like you'd rule out browser based storage because you need the > > > history to train your next model. > > > > > > Nothing says that we can't store data in two places according to use. > > Browser history is good for the part of the history that becomes the > > query. Central storage is good for the mass of history that becomes > input > > for analytics. > > > > At least it would be in addition to a server based storage of history. > > > > > > Yes. In addition to. > > > > > > > Another reason you wouldn't rely only on a browser storage is that it > > will > > > be occasionally destroyed. Users span multiple devices these days too. > > > > > > > This can be dealt with using cookie resurrection techniques. Or by > letting > > the user destroy their copy of the history if they like. > > > > The user history matrix will be quite a bit larger than the user > > > recommendation matrix, maybe and order or two larger. > > > > > > I don't think so. And it doesn't matter since this is reduced to > > significant cooccurrence and that is typically quite small compared to a > > list of recommendations for all users. > > > > I have 20 recs for me stored but I've purchases 100's of items, and have > > > viewed 1000's. > > > > > > > 20 recs is not sufficient. Typically you need 300 for any given context > > and you need to recompute those very frequently. If you use geo-specific > > recommendations, you may need thousands of recommendations to have enough > > geo-dispersion. The search engine approach can handle all of that on the > > fly. > > > > Also, the cached recs are user x (20-300) non-zeros. The sparsified > > item-item cooccurrence matrix is item x 50. Moreover, search engines are > > very good at compression. If users >> items, then item x 50 is much > > smaller, especially after high quality compression (6:1 is a common > > compression ratio). > > > > > > > > Given that you have to have the entire user history vector to do the > > query > > > and given that this is still a lookup from an even larger matrix than > the > > > recs/user matrix and given that you have to do the lookup before the > Solr > > > query It can't be faster than just looking up pre-calculated recs. > > > > > > None of this applies. There is an item x 50 sized search index. There > is > > a recent history that is available without a lookup. All that is > required > > is a single Solr query and that can handle multiple kinds of history and > > geo-location and user search terms all in a single step. > > > > > > > > > In other words the query to produce the query will be more problematic > > > than the query to produce the result, right? > > > > > > > Nope. No such thing, therefore cost = 0. > > > > > > > Something here may be "orders of magnitude" faster, but it isn't the > > total > > > elapsed time to return recs at runtime, right? > > > > > > Actually, it is. Round trip of less than 10ms is common. Precalculation > > goes away. Export of recs nearly goes away. Currency of recommendations > > is much higher. > > > > > > > Maybe what you are saying is the time to pre-calculate the recs is 0 > > since > > > they are calculated at runtime but you still have to create the > > > cooccurrence matrix so you still need something like mahout hadoop to > > > produce a model and you still need to index the model with Solr and you > > > still need to lookup user history at runtime. Indexing with Solr is > > faster > > > than loading a db (8 hours? They are doing something wrong) but the > query > > > side will be slower unless I've missed something. > > > > > > > I am pretty sure you have. The customer are definitely not dopes. Th > > eproblem is that precalculated recs are much, much bigger due to geo > > constraints. > > > > > > > > > > In any case you *have* introduced a realtime rec calculation. This is > > able > > > to use user history that may be seconds old and not yet reflected in > the > > > training data (the cooccurrence matrix) and this is very interesting! > > > > > > >> > > > >> This will scale to thousands or tens of thousands of recommendations > > per > > > >> second against 10's of millions of items. The number of users > doesn't > > > >> matter. > > > >> > > > > > > > > > > Yes, no doubt, but the history lookup is still an issue unless I've > > missed > > > something. The NoSQL queries will scale to tens of thousands of recs > > > against 10s of millions of items but perhaps with larger more complex > > > infrastructure? Not sure how Solr scales. > > > > > > Being semi-ignorant of Solr intuition says that it's doing something to > > > speed things up like using only part of the data somewhere to do > > > approximations. Have there been any performance comparisons of say > > > precision of one approach vs the other or do they return identical > > results? > > > > > > > > > > > >