I certainly have questions about this architecture mentioned below but first let me make sure I understand.
You use the user history vector as a query? This will be a list of item IDs and strength-of-preference values (maybe 1s for purchases). The cooccurrence matrix has columns treated like terms and rows treated like documents though both are really items. Does Solr support weighted term lists as queries or do you have to throw out strength-of-preference? I ask because there are cases where the query will have non '1.0' values. When the strength values are just 1 the vector is really only a list or terms (items IDs). This technique seems like using a doc as a query but you have reduced the doc to the form of a vector of weighted terms. I was unaware that Solr allowed weighted term queries. This is really identical to using Solr for fast doc similarity queries. >> Using a cooccurrence matrix means you are doing item similairty since >> there is no user data in the matrix. Or are you talking about using the >> user history as the query? in which case you have to remember somewhere all >> users' history and look it up for the query, no? >> > > Yes. You do. And that is the key to making this orders of magnitude > faster. > > But that is generally fairly trivial to do. One option is to keep it in a > cookie. Another is to use browser persistent storage. Another is to use a > memory based user profile database. Yet another is to use M7 tables on > MapR or HBase on other Hadoop distributions. > Seems like you'd rule out browser based storage because you need the history to train your next model. At least it would be in addition to a server based storage of history. Another reason you wouldn't rely only on a browser storage is that it will be occasionally destroyed. Users span multiple devices these days too. The user history matrix will be quite a bit larger than the user recommendation matrix, maybe and order or two larger. I have 20 recs for me stored but I've purchases 100's of items, and have viewed 1000's. Given that you have to have the entire user history vector to do the query and given that this is still a lookup from an even larger matrix than the recs/user matrix and given that you have to do the lookup before the Solr query It can't be faster than just looking up pre-calculated recs. In other words the query to produce the query will be more problematic than the query to produce the result, right? Something here may be "orders of magnitude" faster, but it isn't the total elapsed time to return recs at runtime, right? Maybe what you are saying is the time to pre-calculate the recs is 0 since they are calculated at runtime but you still have to create the cooccurrence matrix so you still need something like mahout hadoop to produce a model and you still need to index the model with Solr and you still need to lookup user history at runtime. Indexing with Solr is faster than loading a db (8 hours? They are doing something wrong) but the query side will be slower unless I've missed something. In any case you *have* introduced a realtime rec calculation. This is able to use user history that may be seconds old and not yet reflected in the training data (the cooccurrence matrix) and this is very interesting! >> >> This will scale to thousands or tens of thousands of recommendations per >> second against 10's of millions of items. The number of users doesn't >> matter. >> > Yes, no doubt, but the history lookup is still an issue unless I've missed something. The NoSQL queries will scale to tens of thousands of recs against 10s of millions of items but perhaps with larger more complex infrastructure? Not sure how Solr scales. Being semi-ignorant of Solr intuition says that it's doing something to speed things up like using only part of the data somewhere to do approximations. Have there been any performance comparisons of say precision of one approach vs the other or do they return identical results?