Hi Pat, On May 20, 2013, at 9:46am, Pat Ferrel wrote:
> I certainly have questions about this architecture mentioned below but first > let me make sure I understand. > > You use the user history vector as a query? This will be a list of item IDs > and strength-of-preference values (maybe 1s for purchases). The cooccurrence > matrix has columns treated like terms and rows treated like documents though > both are really items. Does Solr support weighted term lists as queries Yes - you can "boost" individual terms in the query. And you can use payloads on terms in the index to adjust their scores as well. -- Ken > or do you have to throw out strength-of-preference? I ask because there are > cases where the query will have non '1.0' values. When the strength values > are just 1 the vector is really only a list or terms (items IDs). > > This technique seems like using a doc as a query but you have reduced the doc > to the form of a vector of weighted terms. I was unaware that Solr allowed > weighted term queries. This is really identical to using Solr for fast doc > similarity queries. > >>> Using a cooccurrence matrix means you are doing item similairty since >>> there is no user data in the matrix. Or are you talking about using the >>> user history as the query? in which case you have to remember somewhere all >>> users' history and look it up for the query, no? >>> >> >> Yes. You do. And that is the key to making this orders of magnitude >> faster. >> >> But that is generally fairly trivial to do. One option is to keep it in a >> cookie. Another is to use browser persistent storage. Another is to use a >> memory based user profile database. Yet another is to use M7 tables on >> MapR or HBase on other Hadoop distributions. >> > > Seems like you'd rule out browser based storage because you need the history > to train your next model. At least it would be in addition to a server based > storage of history. Another reason you wouldn't rely only on a browser > storage is that it will be occasionally destroyed. Users span multiple > devices these days too. > > The user history matrix will be quite a bit larger than the user > recommendation matrix, maybe and order or two larger. I have 20 recs for me > stored but I've purchases 100's of items, and have viewed 1000's. > > Given that you have to have the entire user history vector to do the query > and given that this is still a lookup from an even larger matrix than the > recs/user matrix and given that you have to do the lookup before the Solr > query It can't be faster than just looking up pre-calculated recs. In other > words the query to produce the query will be more problematic than the query > to produce the result, right? > > Something here may be "orders of magnitude" faster, but it isn't the total > elapsed time to return recs at runtime, right? Maybe what you are saying is > the time to pre-calculate the recs is 0 since they are calculated at runtime > but you still have to create the cooccurrence matrix so you still need > something like mahout hadoop to produce a model and you still need to index > the model with Solr and you still need to lookup user history at runtime. > Indexing with Solr is faster than loading a db (8 hours? They are doing > something wrong) but the query side will be slower unless I've missed > something. > > In any case you *have* introduced a realtime rec calculation. This is able to > use user history that may be seconds old and not yet reflected in the > training data (the cooccurrence matrix) and this is very interesting! > >>> >>> This will scale to thousands or tens of thousands of recommendations per >>> second against 10's of millions of items. The number of users doesn't >>> matter. >>> >> > > Yes, no doubt, but the history lookup is still an issue unless I've missed > something. The NoSQL queries will scale to tens of thousands of recs against > 10s of millions of items but perhaps with larger more complex infrastructure? > Not sure how Solr scales. > > Being semi-ignorant of Solr intuition says that it's doing something to speed > things up like using only part of the data somewhere to do approximations. > Have there been any performance comparisons of say precision of one approach > vs the other or do they return identical results? > > -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr