Re: Which database should I use with Mahout
I think the simplest implementation is to just get extra results from the recommender and rescore after the rough retrieval. Integrating this into the actual scoring engine is very hard since it depends on global characteristics of the final result. The same applies to result set clustering. On Wed, May 22, 2013 at 9:51 PM, Johannes Schulte < johannes.schu...@gmail.com> wrote: > Yeah, that is what i had in mind as a simple solution. For examining bigger > result sets I always fear the cost of loading a lot of stored fields, > that's why i thought of including it in the scoring might be cool. It's not > possible with a plain Collector that maintains a priorityqueue of docid and > score but there should be some smart way to maintain a coordinate of the > top ranking items so far. Anyway, that's a different story > > Thanks for the help so far! > > > On Thu, May 23, 2013 at 2:14 AM, Ted Dunning > wrote: > > > Yes what you are describing with diversification is something that I have > > called anti-flood. It comes from the fact that we really are optimizing a > > portfolio of recommendations rather than a batch of independent > > recommendations. Doing this from first principles is very hard but there > > are very simple heuristics that do well in many practical situations. > For > > instance, simply penalizing the score of items based on how many items > > ranked above them that they are excessively close to does wonders. > > > > Sent from my iPhone > > > > On May 22, 2013, at 13:04, Johannes Schulte > > wrote: > > > > > Okay i got it! I also always have used a basic form of dithering but we > > > always called it shuffle since it basically was / is > Collections.shuffle > > on > > > a bigger list of results and therefore doesnt take the rank or score > into > > > account. Will try that.. > > > > > > With diversification i really meant more something that guarantees that > > you > > > maximize the intra-item distance (dithering often does but not on > > purpose). > > > The theory is, if i remember correctly, that if a user goes over the > list > > > from top to bottom and he didn't like the first item (which is assumed > to > > > be true if he looks at the second item, no idea if we really work that > > > way), it makes sense to make a new shot with something completely > > > different. and so on and on for the next items. I think it's also a > topic > > > for search engine results and i remember somethink like > > "lipschitz-bandits" > > > from my googling, but that is way above of what i am capable of doing. > I > > > just recognized that both amazon and netflix present multiple > > > recommendation lists grouped by category, so in a way it's similar to > > > search engine result clustering. > > > > > > > > > > > > > > > > > > On Wed, May 22, 2013 at 8:30 AM, Ted Dunning > > wrote: > > > > > >> On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte < > > >> johannes.schu...@gmail.com> wrote: > > >> > > >>> Thanks for the list...as a non native speaker I got problems > > >> understanding > > >>> the meaning of dithering here. > > >> > > >> Sorry about that. Your English is good enough that I hadn't noticed > any > > >> deficit. > > >> > > >> Dithering is constructive mixing of the recommendation results. The > > idea > > >> is to reorder the top results only slightly and the deeper results > more > > so. > > >> There are several good effects and one (slightly) bad one. > > >> > > >> The good effects are: > > >> > > >> a) there is much less of a sharp boundary at the end of the first page > > of > > >> results. This makes the statistics of the recommender work better and > > also > > >> helps the recommender not get stuck recommending just the things which > > >> already appear on the first page. > > >> > > >> b) results that are very deep in the results can still be shown > > >> occasionally. This means that if the rec engine has even a hint that > > >> something is good, it has a chance of increasing the ranking by > > gathering > > >> more data. This is a bit different from (a) > > >> > > >> c) (bonus benefit) users like seeing novel things. Even if they have > > done > > >> nothing to change their recommendations, they like seeing that you > have > > >> changed something so they keep coming back to the recommendation page. > > >> > > >> The major bad effect is that you are purposely decreasing relevance in > > the > > >> short term in order to get more information that will improve > relevance > > in > > >> the long term. The improvements dramatically outweigh this small > > problem. > > >> > > >> > > >>> I got the feeling that somewhere between a) and d) there is also > > >>> diversification of items in the recommendation list, so increasing > the > > >>> distance between the list items according to some metric like tf/idf > on > > >>> item information. Never tried that, but with lucene / solr it should > be > > >>> possible to use this information during scoring.. > > >> > > >> Yes. But no
Re: Which database should I use with Mahout
Yeah, that is what i had in mind as a simple solution. For examining bigger result sets I always fear the cost of loading a lot of stored fields, that's why i thought of including it in the scoring might be cool. It's not possible with a plain Collector that maintains a priorityqueue of docid and score but there should be some smart way to maintain a coordinate of the top ranking items so far. Anyway, that's a different story Thanks for the help so far! On Thu, May 23, 2013 at 2:14 AM, Ted Dunning wrote: > Yes what you are describing with diversification is something that I have > called anti-flood. It comes from the fact that we really are optimizing a > portfolio of recommendations rather than a batch of independent > recommendations. Doing this from first principles is very hard but there > are very simple heuristics that do well in many practical situations. For > instance, simply penalizing the score of items based on how many items > ranked above them that they are excessively close to does wonders. > > Sent from my iPhone > > On May 22, 2013, at 13:04, Johannes Schulte > wrote: > > > Okay i got it! I also always have used a basic form of dithering but we > > always called it shuffle since it basically was / is Collections.shuffle > on > > a bigger list of results and therefore doesnt take the rank or score into > > account. Will try that.. > > > > With diversification i really meant more something that guarantees that > you > > maximize the intra-item distance (dithering often does but not on > purpose). > > The theory is, if i remember correctly, that if a user goes over the list > > from top to bottom and he didn't like the first item (which is assumed to > > be true if he looks at the second item, no idea if we really work that > > way), it makes sense to make a new shot with something completely > > different. and so on and on for the next items. I think it's also a topic > > for search engine results and i remember somethink like > "lipschitz-bandits" > > from my googling, but that is way above of what i am capable of doing. I > > just recognized that both amazon and netflix present multiple > > recommendation lists grouped by category, so in a way it's similar to > > search engine result clustering. > > > > > > > > > > > > On Wed, May 22, 2013 at 8:30 AM, Ted Dunning > wrote: > > > >> On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte < > >> johannes.schu...@gmail.com> wrote: > >> > >>> Thanks for the list...as a non native speaker I got problems > >> understanding > >>> the meaning of dithering here. > >> > >> Sorry about that. Your English is good enough that I hadn't noticed any > >> deficit. > >> > >> Dithering is constructive mixing of the recommendation results. The > idea > >> is to reorder the top results only slightly and the deeper results more > so. > >> There are several good effects and one (slightly) bad one. > >> > >> The good effects are: > >> > >> a) there is much less of a sharp boundary at the end of the first page > of > >> results. This makes the statistics of the recommender work better and > also > >> helps the recommender not get stuck recommending just the things which > >> already appear on the first page. > >> > >> b) results that are very deep in the results can still be shown > >> occasionally. This means that if the rec engine has even a hint that > >> something is good, it has a chance of increasing the ranking by > gathering > >> more data. This is a bit different from (a) > >> > >> c) (bonus benefit) users like seeing novel things. Even if they have > done > >> nothing to change their recommendations, they like seeing that you have > >> changed something so they keep coming back to the recommendation page. > >> > >> The major bad effect is that you are purposely decreasing relevance in > the > >> short term in order to get more information that will improve relevance > in > >> the long term. The improvements dramatically outweigh this small > problem. > >> > >> > >>> I got the feeling that somewhere between a) and d) there is also > >>> diversification of items in the recommendation list, so increasing the > >>> distance between the list items according to some metric like tf/idf on > >>> item information. Never tried that, but with lucene / solr it should be > >>> possible to use this information during scoring.. > >> > >> Yes. But no. > >> > >> This can be done at the presentation tier entirely. I often do it by > >> defining a score based solely on rank, typically something like log(r). > I > >> add small amounts of noise to this synthetic score, often distributed > >> exponentially with a small mean. Then I sort the results according to > this > >> sum. > >> > >> Here are some simulated results computed using R: > >> > >>> (order((log(r) - runif(500, max=2)))[1:20]) > >> [1] 1 2 3 6 5 4 14 9 8 10 7 17 11 15 13 22 28 12 20 39 > >> [1] 1 2 5 3 4 8 6 10 9 16 24 31 20 30 13 18 7 14 36 38 > >> [1] 3 1 5 2 10 4 8 7 14 2
Re: Which database should I use with Mahout
Yes what you are describing with diversification is something that I have called anti-flood. It comes from the fact that we really are optimizing a portfolio of recommendations rather than a batch of independent recommendations. Doing this from first principles is very hard but there are very simple heuristics that do well in many practical situations. For instance, simply penalizing the score of items based on how many items ranked above them that they are excessively close to does wonders. Sent from my iPhone On May 22, 2013, at 13:04, Johannes Schulte wrote: > Okay i got it! I also always have used a basic form of dithering but we > always called it shuffle since it basically was / is Collections.shuffle on > a bigger list of results and therefore doesnt take the rank or score into > account. Will try that.. > > With diversification i really meant more something that guarantees that you > maximize the intra-item distance (dithering often does but not on purpose). > The theory is, if i remember correctly, that if a user goes over the list > from top to bottom and he didn't like the first item (which is assumed to > be true if he looks at the second item, no idea if we really work that > way), it makes sense to make a new shot with something completely > different. and so on and on for the next items. I think it's also a topic > for search engine results and i remember somethink like "lipschitz-bandits" > from my googling, but that is way above of what i am capable of doing. I > just recognized that both amazon and netflix present multiple > recommendation lists grouped by category, so in a way it's similar to > search engine result clustering. > > > > > > On Wed, May 22, 2013 at 8:30 AM, Ted Dunning wrote: > >> On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte < >> johannes.schu...@gmail.com> wrote: >> >>> Thanks for the list...as a non native speaker I got problems >> understanding >>> the meaning of dithering here. >> >> Sorry about that. Your English is good enough that I hadn't noticed any >> deficit. >> >> Dithering is constructive mixing of the recommendation results. The idea >> is to reorder the top results only slightly and the deeper results more so. >> There are several good effects and one (slightly) bad one. >> >> The good effects are: >> >> a) there is much less of a sharp boundary at the end of the first page of >> results. This makes the statistics of the recommender work better and also >> helps the recommender not get stuck recommending just the things which >> already appear on the first page. >> >> b) results that are very deep in the results can still be shown >> occasionally. This means that if the rec engine has even a hint that >> something is good, it has a chance of increasing the ranking by gathering >> more data. This is a bit different from (a) >> >> c) (bonus benefit) users like seeing novel things. Even if they have done >> nothing to change their recommendations, they like seeing that you have >> changed something so they keep coming back to the recommendation page. >> >> The major bad effect is that you are purposely decreasing relevance in the >> short term in order to get more information that will improve relevance in >> the long term. The improvements dramatically outweigh this small problem. >> >> >>> I got the feeling that somewhere between a) and d) there is also >>> diversification of items in the recommendation list, so increasing the >>> distance between the list items according to some metric like tf/idf on >>> item information. Never tried that, but with lucene / solr it should be >>> possible to use this information during scoring.. >> >> Yes. But no. >> >> This can be done at the presentation tier entirely. I often do it by >> defining a score based solely on rank, typically something like log(r). I >> add small amounts of noise to this synthetic score, often distributed >> exponentially with a small mean. Then I sort the results according to this >> sum. >> >> Here are some simulated results computed using R: >> >>> (order((log(r) - runif(500, max=2)))[1:20]) >> [1] 1 2 3 6 5 4 14 9 8 10 7 17 11 15 13 22 28 12 20 39 >> [1] 1 2 5 3 4 8 6 10 9 16 24 31 20 30 13 18 7 14 36 38 >> [1] 3 1 5 2 10 4 8 7 14 21 19 26 29 13 27 15 6 12 33 9 >> [1] 1 2 5 3 6 17 4 20 18 7 19 9 25 8 29 21 15 27 28 12 >> [1] 1 2 5 3 7 4 8 11 9 15 10 6 33 37 17 27 36 16 34 38 >> [1] 1 4 2 5 9 3 14 13 12 17 22 25 7 15 18 36 16 6 20 29 >> [1] 1 3 4 7 2 6 5 12 18 17 13 24 27 10 8 20 14 34 9 46 >> [1] 3 1 2 6 12 8 7 5 4 19 11 26 10 15 28 35 9 20 42 25 >> >> As you can see, the first four results are commonly single digits. This >> comes about because the uniform noise that I have subtracted from the log >> can only make a difference of 2 to the log with is equivalent of changing >> the rank by a factor of about 7. If we were to use different noise >> distributions we would get
Re: Which database should I use with Mahout
This data was for a mobile shopping app. Other answers below. > On May 21, 2013, at 5:42 PM, Ted Dunning wrote: > > Inline > > > On Tue, May 21, 2013 at 8:59 AM, Pat Ferrel wrote: > >> In the interest of getting some empirical data out about various >> architectures: >> >> On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel wrote: >> ... You use the user history vector as a query? >>> >>> The most recent suffix of the history vector. How much is used varies by >>> the purpose. >> >> We did some experiments with this using a year+ of e-com data. We measured >> the precision using different amounts of the history vector in 3 month >> increments. The precision increased throughout the year. At about 9 months >> the affects of what appears to be item/product/catalog/new model churn >> begin to become significant and so precision started to level off. We did >> *not* apply a filter to recs so that items not in the current catalog were >> not filtered before precision was measured. We'd expect this to improve >> results using older data. >> > > This is a time filter. How many transactions did this turn out to be. I > typically recommend truncating based on transactions rather than time. > > My own experience was music and video recommendations. Long history > definitely did not help much there. > This is what I've heard before; recommending music and media has it's own set of characteristics. We were at the point of looking at history on segments of the catalog (music vs food etc.) to do the same analysis. I suspect we would have found what you are saying. It would be a bit of processing to save only so many transactions for each user, certainly not impossible. Logs come in by time increments so we got new ones periodically but didn't count each user's transactions. In any case the item similarity type recs were quite a bit more predictive of purchases than those based on user specific history, which changes the requirements a bit. We always measured precision on both history and similarity based reqs though and a blend of both got the best score. I don't have access to the data now--I sure wish we had some to share so these issues could be investigated and compared. > >>> >>> 20 recs is not sufficient. Typically you need 300 for any given context >>> and you need to recompute those very frequently. If you use geo-specific >>> recommendations, you may need thousands of recommendations to have enough >>> geo-dispersion. The search engine approach can handle all of that on the >>> fly. >>> >>> Also, the cached recs are user x (20-300) non-zeros. The sparsified >>> item-item cooccurrence matrix is item x 50. Moreover, search engines are >>> very good at compression. If users >> items, then item x 50 is much >>> smaller, especially after high quality compression (6:1 is a common >>> compression ratio). >>> >> >> The end application designed by the ecom customer required less than 10 >> recs for any given context so 20 gave us of room for runtime context type >> boosting. >> > > And how do you generate the next page of results? > Its was a mobile app so it did not have a next page of results, it was an app design issue we had no control over. Actually Amazon on their web site uses only 100 in a horizontally scrolling strip, we had much less space to fill. But regarding saving more reqs--most of my experience was in storing the entire recommendation matrix for a slightly different purpose. I was working on building the cross-recommender, which (as you know) is an ensemble of models where you have to learn weights for each part of the model. To do a linear combination, without knowing the weight ahead of time, you need virtually all recs for each query. I never got to finish that so I've reproduced the code but now, as I said, lack the data. From all the research I did into the predictive power of various actions, the cross-recommender seemed to hold the most promise for cleaning one action using another more predictive action. If I can prove this out then a whole range of multi-action chains present themselves. At very least we have the framework for creating and learning the weights for small ensembles.
Re: Which database should I use with Mahout
Okay i got it! I also always have used a basic form of dithering but we always called it shuffle since it basically was / is Collections.shuffle on a bigger list of results and therefore doesnt take the rank or score into account. Will try that.. With diversification i really meant more something that guarantees that you maximize the intra-item distance (dithering often does but not on purpose). The theory is, if i remember correctly, that if a user goes over the list from top to bottom and he didn't like the first item (which is assumed to be true if he looks at the second item, no idea if we really work that way), it makes sense to make a new shot with something completely different. and so on and on for the next items. I think it's also a topic for search engine results and i remember somethink like "lipschitz-bandits" from my googling, but that is way above of what i am capable of doing. I just recognized that both amazon and netflix present multiple recommendation lists grouped by category, so in a way it's similar to search engine result clustering. On Wed, May 22, 2013 at 8:30 AM, Ted Dunning wrote: > On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte < > johannes.schu...@gmail.com> wrote: > > > Thanks for the list...as a non native speaker I got problems > understanding > > the meaning of dithering here. > > > > Sorry about that. Your English is good enough that I hadn't noticed any > deficit. > > Dithering is constructive mixing of the recommendation results. The idea > is to reorder the top results only slightly and the deeper results more so. > There are several good effects and one (slightly) bad one. > > The good effects are: > > a) there is much less of a sharp boundary at the end of the first page of > results. This makes the statistics of the recommender work better and also > helps the recommender not get stuck recommending just the things which > already appear on the first page. > > b) results that are very deep in the results can still be shown > occasionally. This means that if the rec engine has even a hint that > something is good, it has a chance of increasing the ranking by gathering > more data. This is a bit different from (a) > > c) (bonus benefit) users like seeing novel things. Even if they have done > nothing to change their recommendations, they like seeing that you have > changed something so they keep coming back to the recommendation page. > > The major bad effect is that you are purposely decreasing relevance in the > short term in order to get more information that will improve relevance in > the long term. The improvements dramatically outweigh this small problem. > > > > I got the feeling that somewhere between a) and d) there is also > > diversification of items in the recommendation list, so increasing the > > distance between the list items according to some metric like tf/idf on > > item information. Never tried that, but with lucene / solr it should be > > possible to use this information during scoring.. > > > > Yes. But no. > > This can be done at the presentation tier entirely. I often do it by > defining a score based solely on rank, typically something like log(r). I > add small amounts of noise to this synthetic score, often distributed > exponentially with a small mean. Then I sort the results according to this > sum. > > Here are some simulated results computed using R: > > > (order((log(r) - runif(500, max=2)))[1:20]) > [1] 1 2 3 6 5 4 14 9 8 10 7 17 11 15 13 22 28 12 20 39 > [1] 1 2 5 3 4 8 6 10 9 16 24 31 20 30 13 18 7 14 36 38 > [1] 3 1 5 2 10 4 8 7 14 21 19 26 29 13 27 15 6 12 33 9 > [1] 1 2 5 3 6 17 4 20 18 7 19 9 25 8 29 21 15 27 28 12 > [1] 1 2 5 3 7 4 8 11 9 15 10 6 33 37 17 27 36 16 34 38 > [1] 1 4 2 5 9 3 14 13 12 17 22 25 7 15 18 36 16 6 20 29 > [1] 1 3 4 7 2 6 5 12 18 17 13 24 27 10 8 20 14 34 9 46 > [1] 3 1 2 6 12 8 7 5 4 19 11 26 10 15 28 35 9 20 42 25 > > As you can see, the first four results are commonly single digits. This > comes about because the uniform noise that I have subtracted from the log > can only make a difference of 2 to the log with is equivalent of changing > the rank by a factor of about 7. If we were to use different noise > distributions we would get somewhat different kinds of perturbation. For > instance, using exponentially distributed noise gives mostly tame results > with some real surprises: > > > (order((log(r) - 0.3*rexp(500)))[1:20]) > [1] 1 2 3 8 4 5 9 6 7 25 14 11 13 24 10 31 34 12 22 21 > [1] 1 2 5 4 3 6 7 12 8 10 9 17 13 11 14 25 64 15 47 19 > [1] 1 2 3 4 5 6 7 10 8 9 11 21 13 12 15 16 14 25 18 33 > [1] 1 2 3 10 4 5 7 14 6 8 13 9 15 25 16 11 20 12 17 54 > [1] 1 3 2 4 7 5 6 8 11 23 9 32 18 10 13 15 12 48 14 19 > [1] 1 3 2 4 5 10 12 6 9 7 8 18 16 17 11 13 25 14 15 19 > [1] 6 1 2 4 3 5 9 11 7 15 8 10 14 12 19 16 13 25 39 18 > [1] 1 2 3
Re: Which database should I use with Mahout
On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte < johannes.schu...@gmail.com> wrote: > Thanks for the list...as a non native speaker I got problems understanding > the meaning of dithering here. > Sorry about that. Your English is good enough that I hadn't noticed any deficit. Dithering is constructive mixing of the recommendation results. The idea is to reorder the top results only slightly and the deeper results more so. There are several good effects and one (slightly) bad one. The good effects are: a) there is much less of a sharp boundary at the end of the first page of results. This makes the statistics of the recommender work better and also helps the recommender not get stuck recommending just the things which already appear on the first page. b) results that are very deep in the results can still be shown occasionally. This means that if the rec engine has even a hint that something is good, it has a chance of increasing the ranking by gathering more data. This is a bit different from (a) c) (bonus benefit) users like seeing novel things. Even if they have done nothing to change their recommendations, they like seeing that you have changed something so they keep coming back to the recommendation page. The major bad effect is that you are purposely decreasing relevance in the short term in order to get more information that will improve relevance in the long term. The improvements dramatically outweigh this small problem. > I got the feeling that somewhere between a) and d) there is also > diversification of items in the recommendation list, so increasing the > distance between the list items according to some metric like tf/idf on > item information. Never tried that, but with lucene / solr it should be > possible to use this information during scoring.. > Yes. But no. This can be done at the presentation tier entirely. I often do it by defining a score based solely on rank, typically something like log(r). I add small amounts of noise to this synthetic score, often distributed exponentially with a small mean. Then I sort the results according to this sum. Here are some simulated results computed using R: > (order((log(r) - runif(500, max=2)))[1:20]) [1] 1 2 3 6 5 4 14 9 8 10 7 17 11 15 13 22 28 12 20 39 [1] 1 2 5 3 4 8 6 10 9 16 24 31 20 30 13 18 7 14 36 38 [1] 3 1 5 2 10 4 8 7 14 21 19 26 29 13 27 15 6 12 33 9 [1] 1 2 5 3 6 17 4 20 18 7 19 9 25 8 29 21 15 27 28 12 [1] 1 2 5 3 7 4 8 11 9 15 10 6 33 37 17 27 36 16 34 38 [1] 1 4 2 5 9 3 14 13 12 17 22 25 7 15 18 36 16 6 20 29 [1] 1 3 4 7 2 6 5 12 18 17 13 24 27 10 8 20 14 34 9 46 [1] 3 1 2 6 12 8 7 5 4 19 11 26 10 15 28 35 9 20 42 25 As you can see, the first four results are commonly single digits. This comes about because the uniform noise that I have subtracted from the log can only make a difference of 2 to the log with is equivalent of changing the rank by a factor of about 7. If we were to use different noise distributions we would get somewhat different kinds of perturbation. For instance, using exponentially distributed noise gives mostly tame results with some real surprises: > (order((log(r) - 0.3*rexp(500)))[1:20]) [1] 1 2 3 8 4 5 9 6 7 25 14 11 13 24 10 31 34 12 22 21 [1] 1 2 5 4 3 6 7 12 8 10 9 17 13 11 14 25 64 15 47 19 [1] 1 2 3 4 5 6 7 10 8 9 11 21 13 12 15 16 14 25 18 33 [1] 1 2 3 10 4 5 7 14 6 8 13 9 15 25 16 11 20 12 17 54 [1] 1 3 2 4 7 5 6 8 11 23 9 32 18 10 13 15 12 48 14 19 [1] 1 3 2 4 5 10 12 6 9 7 8 18 16 17 11 13 25 14 15 19 [1] 6 1 2 4 3 5 9 11 7 15 8 10 14 12 19 16 13 25 39 18 [1] 1 2 3 4 30 5 7 6 9 8 16 11 10 15 12 13 37 14 31 23 [1] 1 2 3 4 9 16 5 6 8 7 10 13 11 17 15 19 12 20 14 26 [1] 1 2 3 13 5 4 7 6 8 15 12 11 9 10 36 14 24 70 19 16 [1] 1 2 6 3 5 4 11 22 7 9 250 8 10 15 12 17 13 40 16 14 > Have a nice day > > > > > On Wed, May 22, 2013 at 2:30 AM, Ted Dunning > wrote: > > > I have so far just used the weights that Solr applies natively. > > > > In my experience, what makes a recommendation engine work better is, in > > order of importance, > > > > a) dithering so that you gather wider data > > > > b) using multiple sources of input > > > > c) returning results quickly and reliably > > > > d) the actual algorithm or weighting scheme > > > > If you can cover items a-c in a real business, you are very lucky. The > > search engine approach handles (b) and (c) by nature which massively > > improves the likelihood of ever getting to examine (d). > > > > > > On Tue, May 21, 2013 at 1:13 AM, Johannes Schulte < > > johannes.schu...@gmail.com> wrote: > > > > > Thanks! Could you also add how to learn the weights you talked about, > or > > at > > > least a hint? Learning weights for search engine query terms always > > sounds > > > like "learning to rank" to me but this always seemed p
Re: Which database should I use with Mahout
Thanks for the list...as a non native speaker I got problems understanding the meaning of dithering here. I got the feeling that somewhere between a) and d) there is also diversification of items in the recommendation list, so increasing the distance between the list items according to some metric like tf/idf on item information. Never tried that, but with lucene / solr it should be possible to use this information during scoring.. Have a nice day On Wed, May 22, 2013 at 2:30 AM, Ted Dunning wrote: > I have so far just used the weights that Solr applies natively. > > In my experience, what makes a recommendation engine work better is, in > order of importance, > > a) dithering so that you gather wider data > > b) using multiple sources of input > > c) returning results quickly and reliably > > d) the actual algorithm or weighting scheme > > If you can cover items a-c in a real business, you are very lucky. The > search engine approach handles (b) and (c) by nature which massively > improves the likelihood of ever getting to examine (d). > > > On Tue, May 21, 2013 at 1:13 AM, Johannes Schulte < > johannes.schu...@gmail.com> wrote: > > > Thanks! Could you also add how to learn the weights you talked about, or > at > > least a hint? Learning weights for search engine query terms always > sounds > > like "learning to rank" to me but this always seemed pretty complicated > > and i never managed to try it out.. > > > > >
Re: Which database should I use with Mahout
Inline On Tue, May 21, 2013 at 8:59 AM, Pat Ferrel wrote: > In the interest of getting some empirical data out about various > architectures: > > On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel wrote: > > >> ... > >> You use the user history vector as a query? > > > > The most recent suffix of the history vector. How much is used varies by > > the purpose. > > We did some experiments with this using a year+ of e-com data. We measured > the precision using different amounts of the history vector in 3 month > increments. The precision increased throughout the year. At about 9 months > the affects of what appears to be item/product/catalog/new model churn > begin to become significant and so precision started to level off. We did > *not* apply a filter to recs so that items not in the current catalog were > not filtered before precision was measured. We'd expect this to improve > results using older data. > This is a time filter. How many transactions did this turn out to be. I typically recommend truncating based on transactions rather than time. My own experience was music and video recommendations. Long history definitely did not help much there. > > > > 20 recs is not sufficient. Typically you need 300 for any given context > > and you need to recompute those very frequently. If you use geo-specific > > recommendations, you may need thousands of recommendations to have enough > > geo-dispersion. The search engine approach can handle all of that on the > > fly. > > > > Also, the cached recs are user x (20-300) non-zeros. The sparsified > > item-item cooccurrence matrix is item x 50. Moreover, search engines are > > very good at compression. If users >> items, then item x 50 is much > > smaller, especially after high quality compression (6:1 is a common > > compression ratio). > > > > The end application designed by the ecom customer required less than 10 > recs for any given context so 20 gave us of room for runtime context type > boosting. > And how do you generate the next page of results? > Given that precision increased for a year of user history and that we > needed to return 20 recs per user and per items the history matrix was > nearly 2 orders of magnitude larger than the recs matrix. This was with > about 5M users and 500K items over a year. The history matrix should be at most 2.5 T bits = 300GB. Remember, this is a binary matrix that is relatively sparse so I would expect that a typical size would be more like a gigabyte. > The issue I was asking about was how to store and retrieve history vectors > for queries. In our case it looks like some kind of scalable persistence > store would be required and since pre-calculated reqs are indeed much > smaller... > I am still confused about this assertion. I think that you need <500 history items per person which is about 500 * 19bits < 1.3KB / user. I also think that you need 100 or more recs if you prestore them which is also in the kilobyte range. This doesn't sound all that different. And then the search index needs to store 500K x 50 nonzeros = 100 MB. This is much smaller than either the history or the prestored recommendations even before any compression. > Yes using a search engine the index is very small but the history vectors > are not. Actually I wonder how well Solr would handle a large query? Is the > truncation of the history vector required perhaps? > The history vector is rarely more than a hundred terms which is not that large a query. > > Actually, it is. Round trip of less than 10ms is common. Precalculation > > goes away. Export of recs nearly goes away. Currency of recommendations > > is much higher. > > This is certainly great performance, no doubt. Using a 12 node Cassandra > ring (each machine had 16G of memory) spread across two geolocations we got > 24,000 tps to a worst case of 5000 tps. The average response for the entire > system (which included two internal service layers and one query to > cassandra) was 5-10ms per response. > Uh... my numbers are for a single node. Query rates are typically about 1-2K queries/second so the speed is comparable.
Re: Which database should I use with Mahout
I have so far just used the weights that Solr applies natively. In my experience, what makes a recommendation engine work better is, in order of importance, a) dithering so that you gather wider data b) using multiple sources of input c) returning results quickly and reliably d) the actual algorithm or weighting scheme If you can cover items a-c in a real business, you are very lucky. The search engine approach handles (b) and (c) by nature which massively improves the likelihood of ever getting to examine (d). On Tue, May 21, 2013 at 1:13 AM, Johannes Schulte < johannes.schu...@gmail.com> wrote: > Thanks! Could you also add how to learn the weights you talked about, or at > least a hint? Learning weights for search engine query terms always sounds > like "learning to rank" to me but this always seemed pretty complicated > and i never managed to try it out.. > >
Re: Which database should I use with Mahout
In the interest of getting some empirical data out about various architectures: On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel wrote: >> ... >> You use the user history vector as a query? > > The most recent suffix of the history vector. How much is used varies by > the purpose. We did some experiments with this using a year+ of e-com data. We measured the precision using different amounts of the history vector in 3 month increments. The precision increased throughout the year. At about 9 months the affects of what appears to be item/product/catalog/new model churn begin to become significant and so precision started to level off. We did *not* apply a filter to recs so that items not in the current catalog were not filtered before precision was measured. We'd expect this to improve results using older data. In our case we never found a good truncation point though it looked like we were reaching one when the data ran out. Even the last 3 months produced a 4.5% better precision score. > >> ... >> Seems like you'd rule out browser based storage because you need the >> history to train your next model. At least it would be in addition to a >> server based storage of history. > > Yes. In addition to. > >> The user history matrix will be quite a bit larger than the user >> recommendation matrix, maybe an order or two larger. > > > I don't think so. And it doesn't matter since this is reduced to > significant cooccurrence and that is typically quite small compared to a > list of recommendations for all users. > >> I have 20 recs for me stored but I've purchases 100's of items, and have >> viewed 1000's. >> > > 20 recs is not sufficient. Typically you need 300 for any given context > and you need to recompute those very frequently. If you use geo-specific > recommendations, you may need thousands of recommendations to have enough > geo-dispersion. The search engine approach can handle all of that on the > fly. > > Also, the cached recs are user x (20-300) non-zeros. The sparsified > item-item cooccurrence matrix is item x 50. Moreover, search engines are > very good at compression. If users >> items, then item x 50 is much > smaller, especially after high quality compression (6:1 is a common > compression ratio). > The end application designed by the ecom customer required less than 10 recs for any given context so 20 gave us of room for runtime context type boosting. Given that precision increased for a year of user history and that we needed to return 20 recs per user and per items the history matrix was nearly 2 orders of magnitude larger than the recs matrix. This was with about 5M users and 500K items over a year. The issue I was asking about was how to store and retrieve history vectors for queries. In our case it looks like some kind of scalable persistence store would be required and since pre-calculated reqs are indeed much smaller... I fully believe your description of how well search engines store their index. The cooccurrence matrix is already sparsified by a similarity metric and any compression that Solr does will help keep the index small. In any case Solr does sharding so it can scale past one machine anyway. >> >> Given that you have to have the entire user history vector to do the query >> and given that this is still a lookup from an even larger matrix than the >> recs/user matrix and given that you have to do the lookup before the Solr >> query It can't be faster than just looking up pre-calculated recs. > > > None of this applies. There is an item x 50 sized search index. There is > a recent history that is available without a lookup. All that is required > is a single Solr query and that can handle multiple kinds of history and > geo-location and user search terms all in a single step. > Yes using a search engine the index is very small but the history vectors are not. Actually I wonder how well Solr would handle a large query? Is the truncation of the history vector required perhaps? >> Something here may be "orders of magnitude" faster, but it isn't the total >> elapsed time to return recs at runtime, right? > > > Actually, it is. Round trip of less than 10ms is common. Precalculation > goes away. Export of recs nearly goes away. Currency of recommendations > is much higher. This is certainly great performance, no doubt. Using a 12 node Cassandra ring (each machine had 16G of memory) spread across two geolocations we got 24,000 tps to a worst case of 5000 tps. The average response for the entire system (which included two internal service layers and one query to cassandra) was 5-10ms per response. >> >> Maybe what you are saying is the time to pre-calculate the recs is 0 since >> they are calculated at runtime but you still have to create the >> cooccurrence matrix so you still need something like mahout hadoop to >> produce a model and you still need to index the model with Solr and you >> still need to lookup user his
Re: Which database should I use with Mahout
Thanks! Could you also add how to learn the weights you talked about, or at least a hint? Learning weights for search engine query terms always sounds like "learning to rank" to me but this always seemed pretty complicated and i never managed to try it out.. On Tue, May 21, 2013 at 8:01 AM, Ted Dunning wrote: > Johannes, > > Your summary is good. > > I would add that the precalculated recommendations can be large enough that > the lookup becomes more expensive. Your point about staleness is very > on-point. > > > On Mon, May 20, 2013 at 10:15 PM, Johannes Schulte < > johannes.schu...@gmail.com> wrote: > > > I think Pat is just saying that > > time(history_lookup) (1) + time (recommendation_calculation) (2) > > > time(precalc_lookop) (3) > > > > since 1 and 3 are assumed to be served by the same system class (key > value > > store, db) with a single key and 2 > 0. > > > > ed is using a lot of information that is available at recommendation time > > and not fetched from a somewhere ("context of delivery", geolocation). > The > > question remaining is why the recent history is available without a > lookup, > > which can only be the case if the recommendation calculation is embedded > in > > a bigger request cycle the history is loaded somewhere else, or it's just > > stored in the browser. > > > > if you would store the classical (netflix/mahout) user-item history in > the > > browser and use a disk matrix thing like lucene for calculation you would > > end up in the same range. > > > > I think the points are more: > > > > > > 1. Having more input's than the classical item-interactions > > (geolocation->item,search_term->item ..) can be very easily carried out > > with search index storing this precalculated "association rules" > > > > 2. Precalculation per user is heavyweight, stale and hard to do if the > > context also plays a role (site the use is on e.g because you have to > have > > the cartesian product of recommendations prepared for every user), while > > "real time" approach can handle it > > > > > > > > > > > > On Tue, May 21, 2013 at 2:00 AM, Ted Dunning > > wrote: > > > > > Inline answers. > > > > > > > > > On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel > > wrote: > > > > > > > ... > > > > You use the user history vector as a query? > > > > > > > > > The most recent suffix of the history vector. How much is used varies > by > > > the purpose. > > > > > > > > > > This will be a list of item IDs and strength-of-preference values > > (maybe > > > > 1s for purchases). > > > > > > > > > Just a list of item x action codes. No strength needed. If you have 5 > > > point ratings, then you can have 5 actions for each item. The > weighting > > > for each action can be learned. > > > > > > > > > > The cooccurrence matrix has columns treated like terms and rows > treated > > > > like documents though both are really items. > > > > > > > > > Well, they are different. The rows are fields within documents > > associated > > > with an item. Other fields include ID and other things. The contents > of > > > the field are the codes associated with the item-action pairs for each > > > non-null column. Usually there is only one action so this reduces to a > > > single column per item. > > > > > > > > > > > > > > > > Does Solr support weighted term lists as queries or do you have to > > throw > > > > out strength-of-preference? > > > > > > > > > I prefer to throw it out even though Solr would not require me to do > so. > > > They weights that I want can be encoded in the document index in any > > case. > > > > > > > > > > I ask because there are cases where the query will have non '1.0' > > values. > > > > When the strength values are just 1 the vector is really only a list > or > > > > terms (items IDs). > > > > > > > > > > I really don't know of any cases where this is really true. There are > > > actions that are categorical. I like to separate them out or to reduce > > to > > > a binary case. > > > > > > > > > > > > > > This technique seems like using a doc as a query but you have reduced > > the > > > > doc to the form of a vector of weighted terms. I was unaware that > Solr > > > > allowed weighted term queries. This is really identical to using Solr > > for > > > > fast doc similarity queries. > > > > > > > > > > It is really more like an ordinary query. Typical recommendation > queries > > > are short since they are only recent history. > > > > > > > > > > > > > > ... > > > > > > > > Seems like you'd rule out browser based storage because you need the > > > > history to train your next model. > > > > > > > > > Nothing says that we can't store data in two places according to use. > > > Browser history is good for the part of the history that becomes the > > > query. Central storage is good for the mass of history that becomes > > input > > > for analytics. > > > > > > At least it would be in addition to a server based storage of history. > > > > > > > > > Yes. In addition to. > > > > > > > >
Re: Which database should I use with Mahout
Johannes, Your summary is good. I would add that the precalculated recommendations can be large enough that the lookup becomes more expensive. Your point about staleness is very on-point. On Mon, May 20, 2013 at 10:15 PM, Johannes Schulte < johannes.schu...@gmail.com> wrote: > I think Pat is just saying that > time(history_lookup) (1) + time (recommendation_calculation) (2) > > time(precalc_lookop) (3) > > since 1 and 3 are assumed to be served by the same system class (key value > store, db) with a single key and 2 > 0. > > ed is using a lot of information that is available at recommendation time > and not fetched from a somewhere ("context of delivery", geolocation). The > question remaining is why the recent history is available without a lookup, > which can only be the case if the recommendation calculation is embedded in > a bigger request cycle the history is loaded somewhere else, or it's just > stored in the browser. > > if you would store the classical (netflix/mahout) user-item history in the > browser and use a disk matrix thing like lucene for calculation you would > end up in the same range. > > I think the points are more: > > > 1. Having more input's than the classical item-interactions > (geolocation->item,search_term->item ..) can be very easily carried out > with search index storing this precalculated "association rules" > > 2. Precalculation per user is heavyweight, stale and hard to do if the > context also plays a role (site the use is on e.g because you have to have > the cartesian product of recommendations prepared for every user), while > "real time" approach can handle it > > > > > > On Tue, May 21, 2013 at 2:00 AM, Ted Dunning > wrote: > > > Inline answers. > > > > > > On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel > wrote: > > > > > ... > > > You use the user history vector as a query? > > > > > > The most recent suffix of the history vector. How much is used varies by > > the purpose. > > > > > > > This will be a list of item IDs and strength-of-preference values > (maybe > > > 1s for purchases). > > > > > > Just a list of item x action codes. No strength needed. If you have 5 > > point ratings, then you can have 5 actions for each item. The weighting > > for each action can be learned. > > > > > > > The cooccurrence matrix has columns treated like terms and rows treated > > > like documents though both are really items. > > > > > > Well, they are different. The rows are fields within documents > associated > > with an item. Other fields include ID and other things. The contents of > > the field are the codes associated with the item-action pairs for each > > non-null column. Usually there is only one action so this reduces to a > > single column per item. > > > > > > > > > > > Does Solr support weighted term lists as queries or do you have to > throw > > > out strength-of-preference? > > > > > > I prefer to throw it out even though Solr would not require me to do so. > > They weights that I want can be encoded in the document index in any > case. > > > > > > > I ask because there are cases where the query will have non '1.0' > values. > > > When the strength values are just 1 the vector is really only a list or > > > terms (items IDs). > > > > > > > I really don't know of any cases where this is really true. There are > > actions that are categorical. I like to separate them out or to reduce > to > > a binary case. > > > > > > > > > > This technique seems like using a doc as a query but you have reduced > the > > > doc to the form of a vector of weighted terms. I was unaware that Solr > > > allowed weighted term queries. This is really identical to using Solr > for > > > fast doc similarity queries. > > > > > > > It is really more like an ordinary query. Typical recommendation queries > > are short since they are only recent history. > > > > > > > > > > ... > > > > > > Seems like you'd rule out browser based storage because you need the > > > history to train your next model. > > > > > > Nothing says that we can't store data in two places according to use. > > Browser history is good for the part of the history that becomes the > > query. Central storage is good for the mass of history that becomes > input > > for analytics. > > > > At least it would be in addition to a server based storage of history. > > > > > > Yes. In addition to. > > > > > > > Another reason you wouldn't rely only on a browser storage is that it > > will > > > be occasionally destroyed. Users span multiple devices these days too. > > > > > > > This can be dealt with using cookie resurrection techniques. Or by > letting > > the user destroy their copy of the history if they like. > > > > The user history matrix will be quite a bit larger than the user > > > recommendation matrix, maybe and order or two larger. > > > > > > I don't think so. And it doesn't matter since this is reduced to > > significant cooccurrence and that is typically quite small compared to a > > list of recommenda
Re: Which database should I use with Mahout
I think Pat is just saying that time(history_lookup) (1) + time (recommendation_calculation) (2) > time(precalc_lookop) (3) since 1 and 3 are assumed to be served by the same system class (key value store, db) with a single key and 2 > 0. ed is using a lot of information that is available at recommendation time and not fetched from a somewhere ("context of delivery", geolocation). The question remaining is why the recent history is available without a lookup, which can only be the case if the recommendation calculation is embedded in a bigger request cycle the history is loaded somewhere else, or it's just stored in the browser. if you would store the classical (netflix/mahout) user-item history in the browser and use a disk matrix thing like lucene for calculation you would end up in the same range. I think the points are more: 1. Having more input's than the classical item-interactions (geolocation->item,search_term->item ..) can be very easily carried out with search index storing this precalculated "association rules" 2. Precalculation per user is heavyweight, stale and hard to do if the context also plays a role (site the use is on e.g because you have to have the cartesian product of recommendations prepared for every user), while "real time" approach can handle it On Tue, May 21, 2013 at 2:00 AM, Ted Dunning wrote: > Inline answers. > > > On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel wrote: > > > ... > > You use the user history vector as a query? > > > The most recent suffix of the history vector. How much is used varies by > the purpose. > > > > This will be a list of item IDs and strength-of-preference values (maybe > > 1s for purchases). > > > Just a list of item x action codes. No strength needed. If you have 5 > point ratings, then you can have 5 actions for each item. The weighting > for each action can be learned. > > > > The cooccurrence matrix has columns treated like terms and rows treated > > like documents though both are really items. > > > Well, they are different. The rows are fields within documents associated > with an item. Other fields include ID and other things. The contents of > the field are the codes associated with the item-action pairs for each > non-null column. Usually there is only one action so this reduces to a > single column per item. > > > > > > Does Solr support weighted term lists as queries or do you have to throw > > out strength-of-preference? > > > I prefer to throw it out even though Solr would not require me to do so. > They weights that I want can be encoded in the document index in any case. > > > > I ask because there are cases where the query will have non '1.0' values. > > When the strength values are just 1 the vector is really only a list or > > terms (items IDs). > > > > I really don't know of any cases where this is really true. There are > actions that are categorical. I like to separate them out or to reduce to > a binary case. > > > > > > This technique seems like using a doc as a query but you have reduced the > > doc to the form of a vector of weighted terms. I was unaware that Solr > > allowed weighted term queries. This is really identical to using Solr for > > fast doc similarity queries. > > > > It is really more like an ordinary query. Typical recommendation queries > are short since they are only recent history. > > > > > > ... > > > > Seems like you'd rule out browser based storage because you need the > > history to train your next model. > > > Nothing says that we can't store data in two places according to use. > Browser history is good for the part of the history that becomes the > query. Central storage is good for the mass of history that becomes input > for analytics. > > At least it would be in addition to a server based storage of history. > > > Yes. In addition to. > > > > Another reason you wouldn't rely only on a browser storage is that it > will > > be occasionally destroyed. Users span multiple devices these days too. > > > > This can be dealt with using cookie resurrection techniques. Or by letting > the user destroy their copy of the history if they like. > > The user history matrix will be quite a bit larger than the user > > recommendation matrix, maybe and order or two larger. > > > I don't think so. And it doesn't matter since this is reduced to > significant cooccurrence and that is typically quite small compared to a > list of recommendations for all users. > > I have 20 recs for me stored but I've purchases 100's of items, and have > > viewed 1000's. > > > > 20 recs is not sufficient. Typically you need 300 for any given context > and you need to recompute those very frequently. If you use geo-specific > recommendations, you may need thousands of recommendations to have enough > geo-dispersion. The search engine approach can handle all of that on the > fly. > > Also, the cached recs are user x (20-300) non-zeros. The sparsified > item-item cooccurrence matrix is item x 50. Moreov
Re: Which database should I use with Mahout
Inline answers. On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel wrote: > ... > You use the user history vector as a query? The most recent suffix of the history vector. How much is used varies by the purpose. > This will be a list of item IDs and strength-of-preference values (maybe > 1s for purchases). Just a list of item x action codes. No strength needed. If you have 5 point ratings, then you can have 5 actions for each item. The weighting for each action can be learned. > The cooccurrence matrix has columns treated like terms and rows treated > like documents though both are really items. Well, they are different. The rows are fields within documents associated with an item. Other fields include ID and other things. The contents of the field are the codes associated with the item-action pairs for each non-null column. Usually there is only one action so this reduces to a single column per item. > Does Solr support weighted term lists as queries or do you have to throw > out strength-of-preference? I prefer to throw it out even though Solr would not require me to do so. They weights that I want can be encoded in the document index in any case. > I ask because there are cases where the query will have non '1.0' values. > When the strength values are just 1 the vector is really only a list or > terms (items IDs). > I really don't know of any cases where this is really true. There are actions that are categorical. I like to separate them out or to reduce to a binary case. > > This technique seems like using a doc as a query but you have reduced the > doc to the form of a vector of weighted terms. I was unaware that Solr > allowed weighted term queries. This is really identical to using Solr for > fast doc similarity queries. > It is really more like an ordinary query. Typical recommendation queries are short since they are only recent history. > > ... > > Seems like you'd rule out browser based storage because you need the > history to train your next model. Nothing says that we can't store data in two places according to use. Browser history is good for the part of the history that becomes the query. Central storage is good for the mass of history that becomes input for analytics. At least it would be in addition to a server based storage of history. Yes. In addition to. > Another reason you wouldn't rely only on a browser storage is that it will > be occasionally destroyed. Users span multiple devices these days too. > This can be dealt with using cookie resurrection techniques. Or by letting the user destroy their copy of the history if they like. The user history matrix will be quite a bit larger than the user > recommendation matrix, maybe and order or two larger. I don't think so. And it doesn't matter since this is reduced to significant cooccurrence and that is typically quite small compared to a list of recommendations for all users. I have 20 recs for me stored but I've purchases 100's of items, and have > viewed 1000's. > 20 recs is not sufficient. Typically you need 300 for any given context and you need to recompute those very frequently. If you use geo-specific recommendations, you may need thousands of recommendations to have enough geo-dispersion. The search engine approach can handle all of that on the fly. Also, the cached recs are user x (20-300) non-zeros. The sparsified item-item cooccurrence matrix is item x 50. Moreover, search engines are very good at compression. If users >> items, then item x 50 is much smaller, especially after high quality compression (6:1 is a common compression ratio). > > Given that you have to have the entire user history vector to do the query > and given that this is still a lookup from an even larger matrix than the > recs/user matrix and given that you have to do the lookup before the Solr > query It can't be faster than just looking up pre-calculated recs. None of this applies. There is an item x 50 sized search index. There is a recent history that is available without a lookup. All that is required is a single Solr query and that can handle multiple kinds of history and geo-location and user search terms all in a single step. > In other words the query to produce the query will be more problematic > than the query to produce the result, right? > Nope. No such thing, therefore cost = 0. > Something here may be "orders of magnitude" faster, but it isn't the total > elapsed time to return recs at runtime, right? Actually, it is. Round trip of less than 10ms is common. Precalculation goes away. Export of recs nearly goes away. Currency of recommendations is much higher. > Maybe what you are saying is the time to pre-calculate the recs is 0 since > they are calculated at runtime but you still have to create the > cooccurrence matrix so you still need something like mahout hadoop to > produce a model and you still need to index the model with Solr and you > still need to look
Re: Which database should I use with Mahout
Hi Pat, On May 20, 2013, at 9:46am, Pat Ferrel wrote: > I certainly have questions about this architecture mentioned below but first > let me make sure I understand. > > You use the user history vector as a query? This will be a list of item IDs > and strength-of-preference values (maybe 1s for purchases). The cooccurrence > matrix has columns treated like terms and rows treated like documents though > both are really items. Does Solr support weighted term lists as queries Yes - you can "boost" individual terms in the query. And you can use payloads on terms in the index to adjust their scores as well. -- Ken > or do you have to throw out strength-of-preference? I ask because there are > cases where the query will have non '1.0' values. When the strength values > are just 1 the vector is really only a list or terms (items IDs). > > This technique seems like using a doc as a query but you have reduced the doc > to the form of a vector of weighted terms. I was unaware that Solr allowed > weighted term queries. This is really identical to using Solr for fast doc > similarity queries. > >>> Using a cooccurrence matrix means you are doing item similairty since >>> there is no user data in the matrix. Or are you talking about using the >>> user history as the query? in which case you have to remember somewhere all >>> users' history and look it up for the query, no? >>> >> >> Yes. You do. And that is the key to making this orders of magnitude >> faster. >> >> But that is generally fairly trivial to do. One option is to keep it in a >> cookie. Another is to use browser persistent storage. Another is to use a >> memory based user profile database. Yet another is to use M7 tables on >> MapR or HBase on other Hadoop distributions. >> > > Seems like you'd rule out browser based storage because you need the history > to train your next model. At least it would be in addition to a server based > storage of history. Another reason you wouldn't rely only on a browser > storage is that it will be occasionally destroyed. Users span multiple > devices these days too. > > The user history matrix will be quite a bit larger than the user > recommendation matrix, maybe and order or two larger. I have 20 recs for me > stored but I've purchases 100's of items, and have viewed 1000's. > > Given that you have to have the entire user history vector to do the query > and given that this is still a lookup from an even larger matrix than the > recs/user matrix and given that you have to do the lookup before the Solr > query It can't be faster than just looking up pre-calculated recs. In other > words the query to produce the query will be more problematic than the query > to produce the result, right? > > Something here may be "orders of magnitude" faster, but it isn't the total > elapsed time to return recs at runtime, right? Maybe what you are saying is > the time to pre-calculate the recs is 0 since they are calculated at runtime > but you still have to create the cooccurrence matrix so you still need > something like mahout hadoop to produce a model and you still need to index > the model with Solr and you still need to lookup user history at runtime. > Indexing with Solr is faster than loading a db (8 hours? They are doing > something wrong) but the query side will be slower unless I've missed > something. > > In any case you *have* introduced a realtime rec calculation. This is able to > use user history that may be seconds old and not yet reflected in the > training data (the cooccurrence matrix) and this is very interesting! > >>> >>> This will scale to thousands or tens of thousands of recommendations per >>> second against 10's of millions of items. The number of users doesn't >>> matter. >>> >> > > Yes, no doubt, but the history lookup is still an issue unless I've missed > something. The NoSQL queries will scale to tens of thousands of recs against > 10s of millions of items but perhaps with larger more complex infrastructure? > Not sure how Solr scales. > > Being semi-ignorant of Solr intuition says that it's doing something to speed > things up like using only part of the data somewhere to do approximations. > Have there been any performance comparisons of say precision of one approach > vs the other or do they return identical results? > > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Re: Which database should I use with Mahout
I certainly have questions about this architecture mentioned below but first let me make sure I understand. You use the user history vector as a query? This will be a list of item IDs and strength-of-preference values (maybe 1s for purchases). The cooccurrence matrix has columns treated like terms and rows treated like documents though both are really items. Does Solr support weighted term lists as queries or do you have to throw out strength-of-preference? I ask because there are cases where the query will have non '1.0' values. When the strength values are just 1 the vector is really only a list or terms (items IDs). This technique seems like using a doc as a query but you have reduced the doc to the form of a vector of weighted terms. I was unaware that Solr allowed weighted term queries. This is really identical to using Solr for fast doc similarity queries. >> Using a cooccurrence matrix means you are doing item similairty since >> there is no user data in the matrix. Or are you talking about using the >> user history as the query? in which case you have to remember somewhere all >> users' history and look it up for the query, no? >> > > Yes. You do. And that is the key to making this orders of magnitude > faster. > > But that is generally fairly trivial to do. One option is to keep it in a > cookie. Another is to use browser persistent storage. Another is to use a > memory based user profile database. Yet another is to use M7 tables on > MapR or HBase on other Hadoop distributions. > Seems like you'd rule out browser based storage because you need the history to train your next model. At least it would be in addition to a server based storage of history. Another reason you wouldn't rely only on a browser storage is that it will be occasionally destroyed. Users span multiple devices these days too. The user history matrix will be quite a bit larger than the user recommendation matrix, maybe and order or two larger. I have 20 recs for me stored but I've purchases 100's of items, and have viewed 1000's. Given that you have to have the entire user history vector to do the query and given that this is still a lookup from an even larger matrix than the recs/user matrix and given that you have to do the lookup before the Solr query It can't be faster than just looking up pre-calculated recs. In other words the query to produce the query will be more problematic than the query to produce the result, right? Something here may be "orders of magnitude" faster, but it isn't the total elapsed time to return recs at runtime, right? Maybe what you are saying is the time to pre-calculate the recs is 0 since they are calculated at runtime but you still have to create the cooccurrence matrix so you still need something like mahout hadoop to produce a model and you still need to index the model with Solr and you still need to lookup user history at runtime. Indexing with Solr is faster than loading a db (8 hours? They are doing something wrong) but the query side will be slower unless I've missed something. In any case you *have* introduced a realtime rec calculation. This is able to use user history that may be seconds old and not yet reflected in the training data (the cooccurrence matrix) and this is very interesting! >> >> This will scale to thousands or tens of thousands of recommendations per >> second against 10's of millions of items. The number of users doesn't >> matter. >> > Yes, no doubt, but the history lookup is still an issue unless I've missed something. The NoSQL queries will scale to tens of thousands of recs against 10s of millions of items but perhaps with larger more complex infrastructure? Not sure how Solr scales. Being semi-ignorant of Solr intuition says that it's doing something to speed things up like using only part of the data somewhere to do approximations. Have there been any performance comparisons of say precision of one approach vs the other or do they return identical results?
Re: Which database should I use with Mahout
On Sun, May 19, 2013 at 8:34 PM, Pat Ferrel wrote: > Won't argue with how fast Solr is, It's another fast and scalable lookup > engine and another option. Especially if you don't need to lookup anything > else by user, in which case you are back to a db... > But remember, it is also doing more than lookup. It is computing scores on items and retaining the highest scoring items. > Using a cooccurrence matrix means you are doing item similairty since > there is no user data in the matrix. Or are you talking about using the > user history as the query? in which case you have to remember somewhere all > users' history and look it up for the query, no? > Yes. You do. And that is the key to making this orders of magnitude faster. But that is generally fairly trivial to do. One option is to keep it in a cookie. Another is to use browser persistent storage. Another is to use a memory based user profile database. Yet another is to use M7 tables on MapR or HBase on other Hadoop distributions. > On May 19, 2013, at 8:09 PM, Ted Dunning wrote: > > On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel wrote: > > > Two basic solutions to this are: factorize (reduces 100s of thousands of > > items to hundreds of 'features') and continue to calculate recs at > runtime, > > which you have to do with Myrrix since mahout does not have an in-memory > > ALS impl, or move to the mahout hadoop recommenders and pre-calculate > recs. > > > > Or sparsify the cooccurrence matrix and run recommendations out of a search > engine. > > This will scale to thousands or tens of thousands of recommendations per > second against 10's of millions of items. The number of users doesn't > matter. > >
Re: Which database should I use with Mahout
Won't argue with how fast Solr is, It's another fast and scalable lookup engine and another option. Especially if you don't need to lookup anything else by user, in which case you are back to a db... Using a cooccurrence matrix means you are doing item similairty since there is no user data in the matrix. Or are you talking about using the user history as the query? in which case you have to remember somewhere all users' history and look it up for the query, no? On May 19, 2013, at 8:09 PM, Ted Dunning wrote: On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel wrote: > Two basic solutions to this are: factorize (reduces 100s of thousands of > items to hundreds of 'features') and continue to calculate recs at runtime, > which you have to do with Myrrix since mahout does not have an in-memory > ALS impl, or move to the mahout hadoop recommenders and pre-calculate recs. > Or sparsify the cooccurrence matrix and run recommendations out of a search engine. This will scale to thousands or tens of thousands of recommendations per second against 10's of millions of items. The number of users doesn't matter.
Re: Which database should I use with Mahout
On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel wrote: > Two basic solutions to this are: factorize (reduces 100s of thousands of > items to hundreds of 'features') and continue to calculate recs at runtime, > which you have to do with Myrrix since mahout does not have an in-memory > ALS impl, or move to the mahout hadoop recommenders and pre-calculate recs. > Or sparsify the cooccurrence matrix and run recommendations out of a search engine. This will scale to thousands or tens of thousands of recommendations per second against 10's of millions of items. The number of users doesn't matter.
Re: Which database should I use with Mahout
Ah, which for completeness, brings up another scaling issue with Mahout. The in-memory mahout recommenders do not pre-calculate all users recs. They keep the preference matrix in-memory and calculate the recommendations at runtime. At some point the size of your data will max a single machine. In my experience this happens by maxing CPU usage before the memory is maxed. I began to hit performance limits with 200,000 items and around 1M users. Two basic solutions to this are: factorize (reduces 100s of thousands of items to hundreds of 'features') and continue to calculate recs at runtime, which you have to do with Myrrix since mahout does not have an in-memory ALS impl, or move to the mahout hadoop recommenders and pre-calculate recs. On May 19, 2013, at 6:34 PM, Sean Owen wrote: (I had in mind non distributed parts of Mahout but the principle is similar, yes.) On May 19, 2013 6:27 PM, "Pat Ferrel" wrote: > Using a Hadoop version of a Mahout recommender will create some number of > recs for all users as its output. Sean is talking about Myrrix I think > which uses factorization to get much smaller models and so can calculate > the recs at runtime for fairly large user sets. > > However if you are using Mahout and Hadoop the question is how to store > and lookup recommendations in the quickest scalable way. You will have a > user ID and perhaps an item ID as a key to the list of recommendations. The > fastest thing to do is have a hashmap in memory, perhaps read in from HDFS. > Remember that Mahout will output the recommendations with internal Mahout > IDs so you will have to replace these in the data with your actual user and > item ids. > > I use a NoSQL DB, either MongoDB or Cassandra but others are fine too, > even MySQL if you can scale it to meet your needs. I end up with two > tables, one has my user ID as a key and recommendations with my item IDs > either ordered or with strengths. The second table has my item ID as the > key with a list of similar items (again sorted or with strengths). At > runtime I may have both a user ID and an item ID context so I get a list > from both tables and combine them at runtime. > > I use a DB for many reasons and let it handle the caching. I never need to > worry about memory management. If you have scaled your DB properly the > lookups will actually be executed like an in-memory hashmap with indexed > keys for ids. Scaling the DB can be done as your user base grows when > needed without affecting the rest of the calculation pipeline. Yes there > will be overhead due to network traffic in a cluster but the flexibility is > worth it for me. If high availability is important you can spread out your > db cluster over multiple data centers without affecting the API for serving > recommendations. I set up the recommendation calculation to run > continuously in the background, replacing values in the two tables as fast > as I can. This allows you to scale update speed (how many machines in the > mahout/hadoop cluster) independently from lookup performance scaling (how > many machines in your db cluster, how much memory do the db machine have). > > On May 19, 2013, at 11:45 AM, Manuel Blechschmidt < > manuel.blechschm...@gmx.de> wrote: > > Hi Tevfik, > I am working with mysql but I would guess that HDFS like Sean suggested > would be a good idea as well. > > There is also a project called sqoop which can be used to transfer data > from relation databases to Hadoop. > > http://sqoop.apache.org/ > > Scribe might be also an option for transferring a lot of data: > https://github.com/facebook/scribe#readme > > I would suggest that you just start with the technology that you know best > and then if you solve the problem as soon as you get them. > > /Manuel > > Am 19.05.2013 um 20:26 schrieb Sean Owen: > >> I think everyone is agreeing that it is essential to only access >> information in memory at run-time, yes, whatever that info may be. >> I don't think the original question was about Hadoop, but, the answer >> is the same: Hadoop mappers are just reading the input serially. There >> is no advantage to a relational database or NoSQL database; they're >> just overkill. HDFS is sufficient, and probably even best of these at >> allowing fast serial access to the data. >> >> On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin >> wrote: >>> Hi Manuel, >>> But if one uses matrix factorization and stores the user and item >>> factors in memory then there will be no database access during >>> recommendation. >>> I thought that the original question was where to store the data and >>> how to give it to hadoop. >>> >>> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt >>> wrote: Hi Tevfik, one request to the recommender could become more then 1000 queries to > the database depending on which recommender you use and the amount of > preferences for the given user. The problem is not if you are using SQL, NoSQL, or any other query > la
Re: Which database should I use with Mahout
On Sun, May 19, 2013 at 6:26 PM, Pat Ferrel wrote: > Using a Hadoop version of a Mahout recommender will create some number of > recs for all users as its output. Sean is talking about Myrrix I think > which uses factorization to get much smaller models and so can calculate > the recs at runtime for fairly large user sets. > The Mahout recommender can also produce a model in the form of item-item matrices that can be used to produce recommendations on the fly from memory-based model. However if you are using Mahout and Hadoop the question is how to store and > lookup recommendations in the quickest scalable way. You will have a user > ID and perhaps an item ID as a key to the list of recommendations. The > fastest thing to do is have a hashmap in memory, perhaps read in from HDFS. Or just use SolR and create the recommendations on the fly. > Remember that Mahout will output the recommendations with internal Mahout > IDs so you will have to r eplace these in the data with your actual user and item ids. > This can be repaired a index time using a search engine as well. > I use a NoSQL DB, either MongoDB or Cassandra but others are fine too, > even MySQL if you can scale it to meet your needs. I end up with two > tables, one has my user ID as a key and recommendations with my item IDs > either ordered or with strengths. The second table has my item ID as the > key with a list of similar items (again sorted or with strengths). At > runtime I may have both a user ID and an item ID context so I get a list > from both tables and combine them at runtime. > MapR has a large bank as a client who used this approach. Exporting recs took 8 hours. Switching to Solr to compute the recommendations decreased export time to under 3 minutes. >
Re: Which database should I use with Mahout
(I had in mind non distributed parts of Mahout but the principle is similar, yes.) On May 19, 2013 6:27 PM, "Pat Ferrel" wrote: > Using a Hadoop version of a Mahout recommender will create some number of > recs for all users as its output. Sean is talking about Myrrix I think > which uses factorization to get much smaller models and so can calculate > the recs at runtime for fairly large user sets. > > However if you are using Mahout and Hadoop the question is how to store > and lookup recommendations in the quickest scalable way. You will have a > user ID and perhaps an item ID as a key to the list of recommendations. The > fastest thing to do is have a hashmap in memory, perhaps read in from HDFS. > Remember that Mahout will output the recommendations with internal Mahout > IDs so you will have to replace these in the data with your actual user and > item ids. > > I use a NoSQL DB, either MongoDB or Cassandra but others are fine too, > even MySQL if you can scale it to meet your needs. I end up with two > tables, one has my user ID as a key and recommendations with my item IDs > either ordered or with strengths. The second table has my item ID as the > key with a list of similar items (again sorted or with strengths). At > runtime I may have both a user ID and an item ID context so I get a list > from both tables and combine them at runtime. > > I use a DB for many reasons and let it handle the caching. I never need to > worry about memory management. If you have scaled your DB properly the > lookups will actually be executed like an in-memory hashmap with indexed > keys for ids. Scaling the DB can be done as your user base grows when > needed without affecting the rest of the calculation pipeline. Yes there > will be overhead due to network traffic in a cluster but the flexibility is > worth it for me. If high availability is important you can spread out your > db cluster over multiple data centers without affecting the API for serving > recommendations. I set up the recommendation calculation to run > continuously in the background, replacing values in the two tables as fast > as I can. This allows you to scale update speed (how many machines in the > mahout/hadoop cluster) independently from lookup performance scaling (how > many machines in your db cluster, how much memory do the db machine have). > > On May 19, 2013, at 11:45 AM, Manuel Blechschmidt < > manuel.blechschm...@gmx.de> wrote: > > Hi Tevfik, > I am working with mysql but I would guess that HDFS like Sean suggested > would be a good idea as well. > > There is also a project called sqoop which can be used to transfer data > from relation databases to Hadoop. > > http://sqoop.apache.org/ > > Scribe might be also an option for transferring a lot of data: > https://github.com/facebook/scribe#readme > > I would suggest that you just start with the technology that you know best > and then if you solve the problem as soon as you get them. > > /Manuel > > Am 19.05.2013 um 20:26 schrieb Sean Owen: > > > I think everyone is agreeing that it is essential to only access > > information in memory at run-time, yes, whatever that info may be. > > I don't think the original question was about Hadoop, but, the answer > > is the same: Hadoop mappers are just reading the input serially. There > > is no advantage to a relational database or NoSQL database; they're > > just overkill. HDFS is sufficient, and probably even best of these at > > allowing fast serial access to the data. > > > > On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin > > wrote: > >> Hi Manuel, > >> But if one uses matrix factorization and stores the user and item > >> factors in memory then there will be no database access during > >> recommendation. > >> I thought that the original question was where to store the data and > >> how to give it to hadoop. > >> > >> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt > >> wrote: > >>> Hi Tevfik, > >>> one request to the recommender could become more then 1000 queries to > the database depending on which recommender you use and the amount of > preferences for the given user. > >>> > >>> The problem is not if you are using SQL, NoSQL, or any other query > language. The problem is the latency of the answers. > >>> > >>> An average tcp package in the same data center takes 500 µs. A main > memory reference 0,1 µs. This means that your main memory of your java > process can be accessed 5000 times faster then any other process like a > database connected via TCP/IP. > >>> > >>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html > >>> > >>> Here you can see a screenshot that shows that database communication > is by far (99%) the slowest component of a recommender request: > >>> > >>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png > >>> > >>> If you do not want to cache your data in your Java process you can use > a complete in memory database technology like SAP HANA > http://www.saphana.com/welcome or EXASOL http://
Re: Which database should I use with Mahout
Using a Hadoop version of a Mahout recommender will create some number of recs for all users as its output. Sean is talking about Myrrix I think which uses factorization to get much smaller models and so can calculate the recs at runtime for fairly large user sets. However if you are using Mahout and Hadoop the question is how to store and lookup recommendations in the quickest scalable way. You will have a user ID and perhaps an item ID as a key to the list of recommendations. The fastest thing to do is have a hashmap in memory, perhaps read in from HDFS. Remember that Mahout will output the recommendations with internal Mahout IDs so you will have to replace these in the data with your actual user and item ids. I use a NoSQL DB, either MongoDB or Cassandra but others are fine too, even MySQL if you can scale it to meet your needs. I end up with two tables, one has my user ID as a key and recommendations with my item IDs either ordered or with strengths. The second table has my item ID as the key with a list of similar items (again sorted or with strengths). At runtime I may have both a user ID and an item ID context so I get a list from both tables and combine them at runtime. I use a DB for many reasons and let it handle the caching. I never need to worry about memory management. If you have scaled your DB properly the lookups will actually be executed like an in-memory hashmap with indexed keys for ids. Scaling the DB can be done as your user base grows when needed without affecting the rest of the calculation pipeline. Yes there will be overhead due to network traffic in a cluster but the flexibility is worth it for me. If high availability is important you can spread out your db cluster over multiple data centers without affecting the API for serving recommendations. I set up the recommendation calculation to run continuously in the background, replacing values in the two tables as fast as I can. This allows you to scale update speed (how many machines in the mahout/hadoop cluster) independently from lookup performance scaling (how many machines in your db cluster, how much memory do the db machine have). On May 19, 2013, at 11:45 AM, Manuel Blechschmidt wrote: Hi Tevfik, I am working with mysql but I would guess that HDFS like Sean suggested would be a good idea as well. There is also a project called sqoop which can be used to transfer data from relation databases to Hadoop. http://sqoop.apache.org/ Scribe might be also an option for transferring a lot of data: https://github.com/facebook/scribe#readme I would suggest that you just start with the technology that you know best and then if you solve the problem as soon as you get them. /Manuel Am 19.05.2013 um 20:26 schrieb Sean Owen: > I think everyone is agreeing that it is essential to only access > information in memory at run-time, yes, whatever that info may be. > I don't think the original question was about Hadoop, but, the answer > is the same: Hadoop mappers are just reading the input serially. There > is no advantage to a relational database or NoSQL database; they're > just overkill. HDFS is sufficient, and probably even best of these at > allowing fast serial access to the data. > > On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin > wrote: >> Hi Manuel, >> But if one uses matrix factorization and stores the user and item >> factors in memory then there will be no database access during >> recommendation. >> I thought that the original question was where to store the data and >> how to give it to hadoop. >> >> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt >> wrote: >>> Hi Tevfik, >>> one request to the recommender could become more then 1000 queries to the >>> database depending on which recommender you use and the amount of >>> preferences for the given user. >>> >>> The problem is not if you are using SQL, NoSQL, or any other query >>> language. The problem is the latency of the answers. >>> >>> An average tcp package in the same data center takes 500 µs. A main memory >>> reference 0,1 µs. This means that your main memory of your java process can >>> be accessed 5000 times faster then any other process like a database >>> connected via TCP/IP. >>> >>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html >>> >>> Here you can see a screenshot that shows that database communication is by >>> far (99%) the slowest component of a recommender request: >>> >>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png >>> >>> If you do not want to cache your data in your Java process you can use a >>> complete in memory database technology like SAP HANA >>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ >>> >>> Nevertheless if you are using these you do not need Mahout anymore. >>> >>> An architecture of a Mahout system can be seen here: >>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png >>>
Re: Which database should I use with Mahout
Hi Tevfik, I am working with mysql but I would guess that HDFS like Sean suggested would be a good idea as well. There is also a project called sqoop which can be used to transfer data from relation databases to Hadoop. http://sqoop.apache.org/ Scribe might be also an option for transferring a lot of data: https://github.com/facebook/scribe#readme I would suggest that you just start with the technology that you know best and then if you solve the problem as soon as you get them. /Manuel Am 19.05.2013 um 20:26 schrieb Sean Owen: > I think everyone is agreeing that it is essential to only access > information in memory at run-time, yes, whatever that info may be. > I don't think the original question was about Hadoop, but, the answer > is the same: Hadoop mappers are just reading the input serially. There > is no advantage to a relational database or NoSQL database; they're > just overkill. HDFS is sufficient, and probably even best of these at > allowing fast serial access to the data. > > On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin > wrote: >> Hi Manuel, >> But if one uses matrix factorization and stores the user and item >> factors in memory then there will be no database access during >> recommendation. >> I thought that the original question was where to store the data and >> how to give it to hadoop. >> >> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt >> wrote: >>> Hi Tevfik, >>> one request to the recommender could become more then 1000 queries to the >>> database depending on which recommender you use and the amount of >>> preferences for the given user. >>> >>> The problem is not if you are using SQL, NoSQL, or any other query >>> language. The problem is the latency of the answers. >>> >>> An average tcp package in the same data center takes 500 µs. A main memory >>> reference 0,1 µs. This means that your main memory of your java process can >>> be accessed 5000 times faster then any other process like a database >>> connected via TCP/IP. >>> >>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html >>> >>> Here you can see a screenshot that shows that database communication is by >>> far (99%) the slowest component of a recommender request: >>> >>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png >>> >>> If you do not want to cache your data in your Java process you can use a >>> complete in memory database technology like SAP HANA >>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ >>> >>> Nevertheless if you are using these you do not need Mahout anymore. >>> >>> An architecture of a Mahout system can be seen here: >>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png >>> >>> Hope that helps >>>Manuel >>> >>> Am 19.05.2013 um 19:20 schrieb Sean Owen: >>> I'm first saying that you really don't want to use the database as a data model directly. It is far too slow. Instead you want to use a data model implementation that reads all of the data, once, serially, into memory. And in that case, it makes no difference where the data is being read from, because it is read just once, serially. A file is just as fine as a fancy database. In fact it's probably easier and faster. On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin wrote: > Thanks Sean, but I could not get your answer. Can you please explain it > again? > > > On Sun, May 19, 2013 at 8:00 PM, Sean Owen wrote: >> It doesn't matter, in the sense that it is never going to be fast >> enough for real-time at any reasonable scale if actually run off a >> database directly. One operation results in thousands of queries. It's >> going to read data into memory anyway and cache it there. So, whatever >> is easiest for you. The simplest solution is a file. >> >> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz >> wrote: >>> Hi, >>> I would like to use Mahout to make recommendations on my web site. >>> Since the data is going to be big, hopefully, I plan to use hadoop >>> implementations of the recommender algorithms. >>> >>> I'm currently storing the data in mysql. Should I continue with it or >>> should I switch to a nosql database such as mongodb or something else? >>> >>> Thanks >>> Ahmet >>> >>> -- >>> Manuel Blechschmidt >>> M.Sc. IT Systems Engineering >>> Dortustr. 57 >>> 14467 Potsdam >>> Mobil: 0173/6322621 >>> Twitter: http://twitter.com/Manuel_B >>> -- Manuel Blechschmidt M.Sc. IT Systems Engineering Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
Re: Which database should I use with Mahout
(Oh, by the way, I realize the original question was about Hadoop. I can't read carefully.) No, HDFS is not good for anything like random access. For input, that's OK, because you don't need random access. So HDFS is just fine. For output, if you are going to then serve these precomputed results at run-time, they need to be in a container appropriate for quick random access. There, a NoSQL store like HBase or something does sound appropriate. You can create an output format that writes directly into it, with a little work. The drawbacks to this approach -- computing results in Hadoop -- is that they are inevitably a bit stale, not real-time, and you have to compute results for everyone, even though very few of those results will be used. Of course, serving is easy and fast. There are hybrid solutions that I can talk to you about offline that get a bit of the best of both worlds. On Sun, May 19, 2013 at 11:37 AM, Ahmet Ylmaz wrote: > Hi Sean, > If I understood you correctly you are saying that I will not need mysql. But > if I store my data on HDFS will I be make fast queries such as > "Return all the ratings of a specific user" > which will be needed for showing the past ratings of a user. > > Ahmet > > > > From: Sean Owen > To: Mahout User List > Sent: Sunday, May 19, 2013 9:26 PM > Subject: Re: Which database should I use with Mahout > > > I think everyone is agreeing that it is essential to only access > information in memory at run-time, yes, whatever that info may be. > I don't think the original question was about Hadoop, but, the answer > is the same: Hadoop mappers are just reading the input serially. There > is no advantage to a relational database or NoSQL database; they're > just overkill. HDFS is sufficient, and probably even best of these at > allowing fast serial access to the data. > > On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin > wrote: >> Hi Manuel, >> But if one uses matrix factorization and stores the user and item >> factors in memory then there will be no database access during >> recommendation. >> I thought that the original question was where to store the data and >> how to give it to hadoop. >> >> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt >> wrote: >>> Hi Tevfik, >>> one request to the recommender could become more then 1000 queries to the >>> database depending on which recommender you use and the amount of >>> preferences for the given user. >>> >>> The problem is not if you are using SQL, NoSQL, or any other query >>> language. The problem is the latency of the answers. >>> >>> An average tcp package in the same data center takes 500 µs. A main memory >>> reference 0,1 µs. This means that your main memory of your java process can >>> be accessed 5000 times faster then any other process like a database >>> connected via TCP/IP. >>> >>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html >>> >>> Here you can see a screenshot that shows that database communication is by >>> far (99%) the slowest component of a recommender request: >>> >>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png >>> >>> If you do not want to cache your data in your Java process you can use a >>> complete in memory database technology like SAP HANA >>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ >>> >>> Nevertheless if you are using these you do not need Mahout anymore. >>> >>> An architecture of a Mahout system can be seen here: >>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png >>> >>> Hope that helps >>> Manuel >>> >>> Am 19.05.2013 um 19:20 schrieb Sean Owen: >>> >>>> I'm first saying that you really don't want to use the database as a >>>> data model directly. It is far too slow. >>>> Instead you want to use a data model implementation that reads all of >>>> the data, once, serially, into memory. And in that case, it makes no >>>> difference where the data is being read from, because it is read just >>>> once, serially. A file is just as fine as a fancy database. In fact >>>> it's probably easier and faster. >>>> >>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin >>>> wrote: >>>>> Thanks Sean, but I could not get your answer. Can you please explain it >>>>> again? >>>>> >>&
Re: Which database should I use with Mahout
Hi Sean, If I understood you correctly you are saying that I will not need mysql. But if I store my data on HDFS will I be make fast queries such as "Return all the ratings of a specific user" which will be needed for showing the past ratings of a user. Ahmet From: Sean Owen To: Mahout User List Sent: Sunday, May 19, 2013 9:26 PM Subject: Re: Which database should I use with Mahout I think everyone is agreeing that it is essential to only access information in memory at run-time, yes, whatever that info may be. I don't think the original question was about Hadoop, but, the answer is the same: Hadoop mappers are just reading the input serially. There is no advantage to a relational database or NoSQL database; they're just overkill. HDFS is sufficient, and probably even best of these at allowing fast serial access to the data. On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin wrote: > Hi Manuel, > But if one uses matrix factorization and stores the user and item > factors in memory then there will be no database access during > recommendation. > I thought that the original question was where to store the data and > how to give it to hadoop. > > On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt > wrote: >> Hi Tevfik, >> one request to the recommender could become more then 1000 queries to the >> database depending on which recommender you use and the amount of >> preferences for the given user. >> >> The problem is not if you are using SQL, NoSQL, or any other query language. >> The problem is the latency of the answers. >> >> An average tcp package in the same data center takes 500 µs. A main memory >> reference 0,1 µs. This means that your main memory of your java process can >> be accessed 5000 times faster then any other process like a database >> connected via TCP/IP. >> >> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html >> >> Here you can see a screenshot that shows that database communication is by >> far (99%) the slowest component of a recommender request: >> >> https://source.apaxo.de/MahoutDatabaseLowPerformance.png >> >> If you do not want to cache your data in your Java process you can use a >> complete in memory database technology like SAP HANA >> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ >> >> Nevertheless if you are using these you do not need Mahout anymore. >> >> An architecture of a Mahout system can be seen here: >> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png >> >> Hope that helps >> Manuel >> >> Am 19.05.2013 um 19:20 schrieb Sean Owen: >> >>> I'm first saying that you really don't want to use the database as a >>> data model directly. It is far too slow. >>> Instead you want to use a data model implementation that reads all of >>> the data, once, serially, into memory. And in that case, it makes no >>> difference where the data is being read from, because it is read just >>> once, serially. A file is just as fine as a fancy database. In fact >>> it's probably easier and faster. >>> >>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin >>> wrote: >>>> Thanks Sean, but I could not get your answer. Can you please explain it >>>> again? >>>> >>>> >>>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen wrote: >>>>> It doesn't matter, in the sense that it is never going to be fast >>>>> enough for real-time at any reasonable scale if actually run off a >>>>> database directly. One operation results in thousands of queries. It's >>>>> going to read data into memory anyway and cache it there. So, whatever >>>>> is easiest for you. The simplest solution is a file. >>>>> >>>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz >>>>> wrote: >>>>>> Hi, >>>>>> I would like to use Mahout to make recommendations on my web site. Since >>>>>> the data is going to be big, hopefully, I plan to use hadoop >>>>>> implementations of the recommender algorithms. >>>>>> >>>>>> I'm currently storing the data in mysql. Should I continue with it or >>>>>> should I switch to a nosql database such as mongodb or something else? >>>>>> >>>>>> Thanks >>>>>> Ahmet >> >> -- >> Manuel Blechschmidt >> M.Sc. IT Systems Engineering >> Dortustr. 57 >> 14467 Potsdam >> Mobil: 0173/6322621 >> Twitter: http://twitter.com/Manuel_B >>
Re: Which database should I use with Mahout
I think everyone is agreeing that it is essential to only access information in memory at run-time, yes, whatever that info may be. I don't think the original question was about Hadoop, but, the answer is the same: Hadoop mappers are just reading the input serially. There is no advantage to a relational database or NoSQL database; they're just overkill. HDFS is sufficient, and probably even best of these at allowing fast serial access to the data. On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin wrote: > Hi Manuel, > But if one uses matrix factorization and stores the user and item > factors in memory then there will be no database access during > recommendation. > I thought that the original question was where to store the data and > how to give it to hadoop. > > On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt > wrote: >> Hi Tevfik, >> one request to the recommender could become more then 1000 queries to the >> database depending on which recommender you use and the amount of >> preferences for the given user. >> >> The problem is not if you are using SQL, NoSQL, or any other query language. >> The problem is the latency of the answers. >> >> An average tcp package in the same data center takes 500 µs. A main memory >> reference 0,1 µs. This means that your main memory of your java process can >> be accessed 5000 times faster then any other process like a database >> connected via TCP/IP. >> >> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html >> >> Here you can see a screenshot that shows that database communication is by >> far (99%) the slowest component of a recommender request: >> >> https://source.apaxo.de/MahoutDatabaseLowPerformance.png >> >> If you do not want to cache your data in your Java process you can use a >> complete in memory database technology like SAP HANA >> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ >> >> Nevertheless if you are using these you do not need Mahout anymore. >> >> An architecture of a Mahout system can be seen here: >> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png >> >> Hope that helps >> Manuel >> >> Am 19.05.2013 um 19:20 schrieb Sean Owen: >> >>> I'm first saying that you really don't want to use the database as a >>> data model directly. It is far too slow. >>> Instead you want to use a data model implementation that reads all of >>> the data, once, serially, into memory. And in that case, it makes no >>> difference where the data is being read from, because it is read just >>> once, serially. A file is just as fine as a fancy database. In fact >>> it's probably easier and faster. >>> >>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin >>> wrote: Thanks Sean, but I could not get your answer. Can you please explain it again? On Sun, May 19, 2013 at 8:00 PM, Sean Owen wrote: > It doesn't matter, in the sense that it is never going to be fast > enough for real-time at any reasonable scale if actually run off a > database directly. One operation results in thousands of queries. It's > going to read data into memory anyway and cache it there. So, whatever > is easiest for you. The simplest solution is a file. > > On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz > wrote: >> Hi, >> I would like to use Mahout to make recommendations on my web site. Since >> the data is going to be big, hopefully, I plan to use hadoop >> implementations of the recommender algorithms. >> >> I'm currently storing the data in mysql. Should I continue with it or >> should I switch to a nosql database such as mongodb or something else? >> >> Thanks >> Ahmet >> >> -- >> Manuel Blechschmidt >> M.Sc. IT Systems Engineering >> Dortustr. 57 >> 14467 Potsdam >> Mobil: 0173/6322621 >> Twitter: http://twitter.com/Manuel_B >>
Re: Which database should I use with Mahout
Hi Manuel, But if one uses matrix factorization and stores the user and item factors in memory then there will be no database access during recommendation. I thought that the original question was where to store the data and how to give it to hadoop. On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt wrote: > Hi Tevfik, > one request to the recommender could become more then 1000 queries to the > database depending on which recommender you use and the amount of preferences > for the given user. > > The problem is not if you are using SQL, NoSQL, or any other query language. > The problem is the latency of the answers. > > An average tcp package in the same data center takes 500 µs. A main memory > reference 0,1 µs. This means that your main memory of your java process can > be accessed 5000 times faster then any other process like a database > connected via TCP/IP. > > http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html > > Here you can see a screenshot that shows that database communication is by > far (99%) the slowest component of a recommender request: > > https://source.apaxo.de/MahoutDatabaseLowPerformance.png > > If you do not want to cache your data in your Java process you can use a > complete in memory database technology like SAP HANA > http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ > > Nevertheless if you are using these you do not need Mahout anymore. > > An architecture of a Mahout system can be seen here: > https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png > > Hope that helps > Manuel > > Am 19.05.2013 um 19:20 schrieb Sean Owen: > >> I'm first saying that you really don't want to use the database as a >> data model directly. It is far too slow. >> Instead you want to use a data model implementation that reads all of >> the data, once, serially, into memory. And in that case, it makes no >> difference where the data is being read from, because it is read just >> once, serially. A file is just as fine as a fancy database. In fact >> it's probably easier and faster. >> >> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin >> wrote: >>> Thanks Sean, but I could not get your answer. Can you please explain it >>> again? >>> >>> >>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen wrote: It doesn't matter, in the sense that it is never going to be fast enough for real-time at any reasonable scale if actually run off a database directly. One operation results in thousands of queries. It's going to read data into memory anyway and cache it there. So, whatever is easiest for you. The simplest solution is a file. On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz wrote: > Hi, > I would like to use Mahout to make recommendations on my web site. Since > the data is going to be big, hopefully, I plan to use hadoop > implementations of the recommender algorithms. > > I'm currently storing the data in mysql. Should I continue with it or > should I switch to a nosql database such as mongodb or something else? > > Thanks > Ahmet > > -- > Manuel Blechschmidt > M.Sc. IT Systems Engineering > Dortustr. 57 > 14467 Potsdam > Mobil: 0173/6322621 > Twitter: http://twitter.com/Manuel_B >
Re: Which database should I use with Mahout
Hi Tevfik, one request to the recommender could become more then 1000 queries to the database depending on which recommender you use and the amount of preferences for the given user. The problem is not if you are using SQL, NoSQL, or any other query language. The problem is the latency of the answers. An average tcp package in the same data center takes 500 µs. A main memory reference 0,1 µs. This means that your main memory of your java process can be accessed 5000 times faster then any other process like a database connected via TCP/IP. http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html Here you can see a screenshot that shows that database communication is by far (99%) the slowest component of a recommender request: https://source.apaxo.de/MahoutDatabaseLowPerformance.png If you do not want to cache your data in your Java process you can use a complete in memory database technology like SAP HANA http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ Nevertheless if you are using these you do not need Mahout anymore. An architecture of a Mahout system can be seen here: https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png Hope that helps Manuel Am 19.05.2013 um 19:20 schrieb Sean Owen: > I'm first saying that you really don't want to use the database as a > data model directly. It is far too slow. > Instead you want to use a data model implementation that reads all of > the data, once, serially, into memory. And in that case, it makes no > difference where the data is being read from, because it is read just > once, serially. A file is just as fine as a fancy database. In fact > it's probably easier and faster. > > On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin > wrote: >> Thanks Sean, but I could not get your answer. Can you please explain it >> again? >> >> >> On Sun, May 19, 2013 at 8:00 PM, Sean Owen wrote: >>> It doesn't matter, in the sense that it is never going to be fast >>> enough for real-time at any reasonable scale if actually run off a >>> database directly. One operation results in thousands of queries. It's >>> going to read data into memory anyway and cache it there. So, whatever >>> is easiest for you. The simplest solution is a file. >>> >>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz >>> wrote: Hi, I would like to use Mahout to make recommendations on my web site. Since the data is going to be big, hopefully, I plan to use hadoop implementations of the recommender algorithms. I'm currently storing the data in mysql. Should I continue with it or should I switch to a nosql database such as mongodb or something else? Thanks Ahmet -- Manuel Blechschmidt M.Sc. IT Systems Engineering Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
Re: Which database should I use with Mahout
ok, got it, thanks. On Sun, May 19, 2013 at 8:20 PM, Sean Owen wrote: > I'm first saying that you really don't want to use the database as a > data model directly. It is far too slow. > Instead you want to use a data model implementation that reads all of > the data, once, serially, into memory. And in that case, it makes no > difference where the data is being read from, because it is read just > once, serially. A file is just as fine as a fancy database. In fact > it's probably easier and faster. > > On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin > wrote: >> Thanks Sean, but I could not get your answer. Can you please explain it >> again? >> >> >> On Sun, May 19, 2013 at 8:00 PM, Sean Owen wrote: >>> It doesn't matter, in the sense that it is never going to be fast >>> enough for real-time at any reasonable scale if actually run off a >>> database directly. One operation results in thousands of queries. It's >>> going to read data into memory anyway and cache it there. So, whatever >>> is easiest for you. The simplest solution is a file. >>> >>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz >>> wrote: Hi, I would like to use Mahout to make recommendations on my web site. Since the data is going to be big, hopefully, I plan to use hadoop implementations of the recommender algorithms. I'm currently storing the data in mysql. Should I continue with it or should I switch to a nosql database such as mongodb or something else? Thanks Ahmet
Re: Which database should I use with Mahout
I'm first saying that you really don't want to use the database as a data model directly. It is far too slow. Instead you want to use a data model implementation that reads all of the data, once, serially, into memory. And in that case, it makes no difference where the data is being read from, because it is read just once, serially. A file is just as fine as a fancy database. In fact it's probably easier and faster. On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin wrote: > Thanks Sean, but I could not get your answer. Can you please explain it again? > > > On Sun, May 19, 2013 at 8:00 PM, Sean Owen wrote: >> It doesn't matter, in the sense that it is never going to be fast >> enough for real-time at any reasonable scale if actually run off a >> database directly. One operation results in thousands of queries. It's >> going to read data into memory anyway and cache it there. So, whatever >> is easiest for you. The simplest solution is a file. >> >> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz >> wrote: >>> Hi, >>> I would like to use Mahout to make recommendations on my web site. Since >>> the data is going to be big, hopefully, I plan to use hadoop >>> implementations of the recommender algorithms. >>> >>> I'm currently storing the data in mysql. Should I continue with it or >>> should I switch to a nosql database such as mongodb or something else? >>> >>> Thanks >>> Ahmet
Re: Which database should I use with Mahout
Thanks Sean, but I could not get your answer. Can you please explain it again? On Sun, May 19, 2013 at 8:00 PM, Sean Owen wrote: > It doesn't matter, in the sense that it is never going to be fast > enough for real-time at any reasonable scale if actually run off a > database directly. One operation results in thousands of queries. It's > going to read data into memory anyway and cache it there. So, whatever > is easiest for you. The simplest solution is a file. > > On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz > wrote: >> Hi, >> I would like to use Mahout to make recommendations on my web site. Since the >> data is going to be big, hopefully, I plan to use hadoop implementations of >> the recommender algorithms. >> >> I'm currently storing the data in mysql. Should I continue with it or should >> I switch to a nosql database such as mongodb or something else? >> >> Thanks >> Ahmet
Re: Which database should I use with Mahout
It doesn't matter, in the sense that it is never going to be fast enough for real-time at any reasonable scale if actually run off a database directly. One operation results in thousands of queries. It's going to read data into memory anyway and cache it there. So, whatever is easiest for you. The simplest solution is a file. On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz wrote: > Hi, > I would like to use Mahout to make recommendations on my web site. Since the > data is going to be big, hopefully, I plan to use hadoop implementations of > the recommender algorithms. > > I'm currently storing the data in mysql. Should I continue with it or should > I switch to a nosql database such as mongodb or something else? > > Thanks > Ahmet