In the interest of getting some empirical data out about various architectures:
On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel <pat.fer...@gmail.com> wrote: >> ... >> You use the user history vector as a query? > > The most recent suffix of the history vector. How much is used varies by > the purpose. We did some experiments with this using a year+ of e-com data. We measured the precision using different amounts of the history vector in 3 month increments. The precision increased throughout the year. At about 9 months the affects of what appears to be item/product/catalog/new model churn begin to become significant and so precision started to level off. We did *not* apply a filter to recs so that items not in the current catalog were not filtered before precision was measured. We'd expect this to improve results using older data. In our case we never found a good truncation point though it looked like we were reaching one when the data ran out. Even the last 3 months produced a 4.5% better precision score. > >> ... >> Seems like you'd rule out browser based storage because you need the >> history to train your next model. At least it would be in addition to a >> server based storage of history. > > Yes. In addition to. > >> The user history matrix will be quite a bit larger than the user >> recommendation matrix, maybe an order or two larger. > > > I don't think so. And it doesn't matter since this is reduced to > significant cooccurrence and that is typically quite small compared to a > list of recommendations for all users. > >> I have 20 recs for me stored but I've purchases 100's of items, and have >> viewed 1000's. >> > > 20 recs is not sufficient. Typically you need 300 for any given context > and you need to recompute those very frequently. If you use geo-specific > recommendations, you may need thousands of recommendations to have enough > geo-dispersion. The search engine approach can handle all of that on the > fly. > > Also, the cached recs are user x (20-300) non-zeros. The sparsified > item-item cooccurrence matrix is item x 50. Moreover, search engines are > very good at compression. If users >> items, then item x 50 is much > smaller, especially after high quality compression (6:1 is a common > compression ratio). > The end application designed by the ecom customer required less than 10 recs for any given context so 20 gave us of room for runtime context type boosting. Given that precision increased for a year of user history and that we needed to return 20 recs per user and per items the history matrix was nearly 2 orders of magnitude larger than the recs matrix. This was with about 5M users and 500K items over a year. The issue I was asking about was how to store and retrieve history vectors for queries. In our case it looks like some kind of scalable persistence store would be required and since pre-calculated reqs are indeed much smaller... I fully believe your description of how well search engines store their index. The cooccurrence matrix is already sparsified by a similarity metric and any compression that Solr does will help keep the index small. In any case Solr does sharding so it can scale past one machine anyway. >> >> Given that you have to have the entire user history vector to do the query >> and given that this is still a lookup from an even larger matrix than the >> recs/user matrix and given that you have to do the lookup before the Solr >> query It can't be faster than just looking up pre-calculated recs. > > > None of this applies. There is an item x 50 sized search index. There is > a recent history that is available without a lookup. All that is required > is a single Solr query and that can handle multiple kinds of history and > geo-location and user search terms all in a single step. > Yes using a search engine the index is very small but the history vectors are not. Actually I wonder how well Solr would handle a large query? Is the truncation of the history vector required perhaps? >> Something here may be "orders of magnitude" faster, but it isn't the total >> elapsed time to return recs at runtime, right? > > > Actually, it is. Round trip of less than 10ms is common. Precalculation > goes away. Export of recs nearly goes away. Currency of recommendations > is much higher. This is certainly great performance, no doubt. Using a 12 node Cassandra ring (each machine had 16G of memory) spread across two geolocations we got 24,000 tps to a worst case of 5000 tps. The average response for the entire system (which included two internal service layers and one query to cassandra) was 5-10ms per response. >> >> Maybe what you are saying is the time to pre-calculate the recs is 0 since >> they are calculated at runtime but you still have to create the >> cooccurrence matrix so you still need something like mahout hadoop to >> produce a model and you still need to index the model with Solr and you >> still need to lookup user history at runtime. Indexing with Solr is faster >> than loading a db (8 hours? They are doing something wrong) but the query >> side will be slower unless I've missed something. >> > > I am pretty sure you have. The customer are definitely not dopes. Th > eproblem is that precalculated recs are much, much bigger due to geo > constraints. > Not sure about geo-constraints. We did not consider these. There were cases where cassandra would inexplicably bog down and produce bizarrely slow tps, but they were rare and the subject of eradication efforts. The people responsible thought they were related to some internal journal cleanup or compaction and were trying to fix the problem. These are the bane of any large complex system to be sure. However the system's performance was very high. I personally wrote the pre-calculated recs to a dev cassandra instance. 5m users, 500k items so 5.5m rows including all created reqs (20+) from a mahout item-based recommender in 3 hours. Dismally slow, I know, but the writing was done from a single process reading from HDFS and writing to a single machine cassandra 'cluster'. The obvious speedups would be to load the db using hadoop and to have a high performance multi-machine ring. Since reqs are pre-calculated and we know how large they are (5.5m x 20 = 110m) it would be simple enough to put this into memory if a further speedup is required. But in our case we had to get some user specific things from cassandra (user login name, profile info, etc.) so the query by user had to be made anyway. Whatever was stored under the user's key was virtually free. Anyway, that's just one case study and hopefully will help someone decide on an architecture based on their own resources and requirements.