Re: Large scale ranked recommendation
(following up a rather old thread:) Hi Christopher, I understand how you might use nearest neighbors for item-item recommendations, but how do you use it for top N items per user? Thanks! Apu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p25913.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Large scale ranked recommendation
Christopher, that's really a great idea to search in latent factor space rather than computing each entry of matrix, now the complexity of the problem has reduced drastically from naive O(n*m). Since our data is not that huge I will try exact nbrhood search then fallback to approximate if that don't work. I will look into annoy. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10212.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Large scale ranked recommendation
If you are performing recommendations via a latent factor model then I highly recommend you look into methods of "approximate nearest neighbors". At Spotify we batch process top N recommendations for 40M users with a catalog of > 40M items, but we avoid the naive O(n*m) process you are describing by performing an approximate nearest neighbors search. There are a bunch of open source packages you can use including our own https://github.com/spotify/annoy which uses random projections in your latent factor space to build a forest of trees with constant time nearest neighbors lookup. On Fri, Jul 18, 2014 at 1:57 PM, Nick Pentreath wrote: > Agree GPUs may be interesting for this kind of massively parallel linear > algebra on reasonable size vectors. > > These projects might be of interest in this regard: > https://github.com/BIDData/BIDMach > https://github.com/BIDData/BIDMat > https://github.com/dlwh/gust > > Nick > > > > On Fri, Jul 18, 2014 at 7:40 PM, m3.sharma wrote: > >> Thanks Nick real-time suggestion is good, will see if we can add that to >> our >> deployment strategy and you are correct we may not need recommendation for >> each user. >> >> Will try adding more resources and broadcasting item features suggestion >> as >> currently they don't seem to be huge. >> >> As users and items both will continue to grow in future for faster vector >> computations I think few GPU nodes will suffice to serve faster >> recommendation after learning model with SPARK. It will be great to have >> builtin GPU support in SPARK for faster computations to leverage GPU >> capability of nodes for performing these flops faster. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10183.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >
Re: Large scale ranked recommendation
Agree GPUs may be interesting for this kind of massively parallel linear algebra on reasonable size vectors. These projects might be of interest in this regard: https://github.com/BIDData/BIDMach https://github.com/BIDData/BIDMat https://github.com/dlwh/gust Nick On Fri, Jul 18, 2014 at 7:40 PM, m3.sharma wrote: > Thanks Nick real-time suggestion is good, will see if we can add that to > our > deployment strategy and you are correct we may not need recommendation for > each user. > > Will try adding more resources and broadcasting item features suggestion as > currently they don't seem to be huge. > > As users and items both will continue to grow in future for faster vector > computations I think few GPU nodes will suffice to serve faster > recommendation after learning model with SPARK. It will be great to have > builtin GPU support in SPARK for faster computations to leverage GPU > capability of nodes for performing these flops faster. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10183.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: Large scale ranked recommendation
Thanks Nick real-time suggestion is good, will see if we can add that to our deployment strategy and you are correct we may not need recommendation for each user. Will try adding more resources and broadcasting item features suggestion as currently they don't seem to be huge. As users and items both will continue to grow in future for faster vector computations I think few GPU nodes will suffice to serve faster recommendation after learning model with SPARK. It will be great to have builtin GPU support in SPARK for faster computations to leverage GPU capability of nodes for performing these flops faster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10183.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Large scale ranked recommendation
Nick's suggestion is a good approach for your data. The item factors to broadcast should be a few MBs. -Xiangrui > On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux wrote: > > And you might want to apply clustering before. It is likely that every user > and every item are not unique. > > Bertrand Dechoux > > >> On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath >> wrote: >> It is very true that making predictions in batch for all 1 million users >> against the 10k items will be quite onerous in terms of computation. I have >> run into this issue too in making batch predictions. >> >> Some ideas: >> >> 1. Do you really need to generate recommendations for each user in batch? >> How are you serving these recommendations? In most cases, you only need to >> make recs when a user is actively interacting with your service or product >> etc. Doing it all in batch tends to be a big waste of computation resources. >> >> In our system for example we are serving them in real time (as a user >> arrives at a web page, say, our customer hits our API for recs), so we only >> generate the rec at that time. You can take a look at Oryx for this >> (https://github.com/cloudera/oryx) though it does not yet support Spark, you >> may be able to save the model into the correct format in HDFS and have Oryx >> read the data. >> >> 2. If you do need to make the recs in batch, then I would suggest: >> (a) because you have few items, I would collect the item vectors and form a >> matrix. >> (b) broadcast that matrix >> (c) do a mapPartitions on the user vectors. Form a user matrix from the >> vectors in each partition (maybe create quite a few partitions to make each >> user matrix not too big) >> (d) do a value call on the broadcasted item matrix >> (e) now for each partition you have the (small) item matrix and a (larger) >> user matrix. Do a matrix multiply and you end up with a (U x I) matrix with >> the scores for each user in the partition. Because you are using BLAS here, >> it will be significantly faster than individually computed dot products >> (f) sort the scores for each user and take top K >> (g) save or collect and do whatever with the scores >> >> 3. in conjunction with (2) you can try throwing more resources at the >> problem too >> >> If you access the underlying Breeze vectors (I think the toBreeze method is >> private so you may have to re-implement it), you can do all this using >> Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot). >> >> Hope that helps >> >> Nick >> >> >>> On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma wrote: >>> Yes, thats what prediction should be doing, taking dot products or sigmoid >>> function for each user,item pair. For 1 million users and 10 K items data >>> there are 10 billion pairs. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: Large scale ranked recommendation
And you might want to apply clustering before. It is likely that every user and every item are not unique. Bertrand Dechoux On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath wrote: > It is very true that making predictions in batch for all 1 million users > against the 10k items will be quite onerous in terms of computation. I have > run into this issue too in making batch predictions. > > Some ideas: > > 1. Do you really need to generate recommendations for each user in batch? > How are you serving these recommendations? In most cases, you only need to > make recs when a user is actively interacting with your service or product > etc. Doing it all in batch tends to be a big waste of computation resources. > > In our system for example we are serving them in real time (as a user > arrives at a web page, say, our customer hits our API for recs), so we only > generate the rec at that time. You can take a look at Oryx for this ( > https://github.com/cloudera/oryx) though it does not yet support Spark, > you may be able to save the model into the correct format in HDFS and have > Oryx read the data. > > 2. If you do need to make the recs in batch, then I would suggest: > (a) because you have few items, I would collect the item vectors and form > a matrix. > (b) broadcast that matrix > (c) do a mapPartitions on the user vectors. Form a user matrix from the > vectors in each partition (maybe create quite a few partitions to make each > user matrix not too big) > (d) do a value call on the broadcasted item matrix > (e) now for each partition you have the (small) item matrix and a (larger) > user matrix. Do a matrix multiply and you end up with a (U x I) matrix with > the scores for each user in the partition. Because you are using BLAS here, > it will be significantly faster than individually computed dot products > (f) sort the scores for each user and take top K > (g) save or collect and do whatever with the scores > > 3. in conjunction with (2) you can try throwing more resources at the > problem too > > If you access the underlying Breeze vectors (I think the toBreeze method > is private so you may have to re-implement it), you can do all this using > Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot). > > Hope that helps > > Nick > > > On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma wrote: > >> Yes, thats what prediction should be doing, taking dot products or sigmoid >> function for each user,item pair. For 1 million users and 10 K items data >> there are 10 billion pairs. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >
Re: Large scale ranked recommendation
It is very true that making predictions in batch for all 1 million users against the 10k items will be quite onerous in terms of computation. I have run into this issue too in making batch predictions. Some ideas: 1. Do you really need to generate recommendations for each user in batch? How are you serving these recommendations? In most cases, you only need to make recs when a user is actively interacting with your service or product etc. Doing it all in batch tends to be a big waste of computation resources. In our system for example we are serving them in real time (as a user arrives at a web page, say, our customer hits our API for recs), so we only generate the rec at that time. You can take a look at Oryx for this ( https://github.com/cloudera/oryx) though it does not yet support Spark, you may be able to save the model into the correct format in HDFS and have Oryx read the data. 2. If you do need to make the recs in batch, then I would suggest: (a) because you have few items, I would collect the item vectors and form a matrix. (b) broadcast that matrix (c) do a mapPartitions on the user vectors. Form a user matrix from the vectors in each partition (maybe create quite a few partitions to make each user matrix not too big) (d) do a value call on the broadcasted item matrix (e) now for each partition you have the (small) item matrix and a (larger) user matrix. Do a matrix multiply and you end up with a (U x I) matrix with the scores for each user in the partition. Because you are using BLAS here, it will be significantly faster than individually computed dot products (f) sort the scores for each user and take top K (g) save or collect and do whatever with the scores 3. in conjunction with (2) you can try throwing more resources at the problem too If you access the underlying Breeze vectors (I think the toBreeze method is private so you may have to re-implement it), you can do all this using Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot). Hope that helps Nick On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma wrote: > Yes, thats what prediction should be doing, taking dot products or sigmoid > function for each user,item pair. For 1 million users and 10 K items data > there are 10 billion pairs. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: Large scale ranked recommendation
Yes, thats what prediction should be doing, taking dot products or sigmoid function for each user,item pair. For 1 million users and 10 K items data there are 10 billion pairs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Large scale ranked recommendation
Hi, Are you suggesting that taking simple vector dot products or sigmoid function on 10K * 1M data takes 5hrs? On Thu, Jul 17, 2014 at 3:59 PM, m3.sharma wrote: > We are using RegressionModels that comes with *mllib* package in SPARK. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10103.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: Large scale ranked recommendation
We are using RegressionModels that comes with *mllib* package in SPARK. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10103.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Large scale ranked recommendation
Hi, I am trying to develop a recommender system for about 1 million users and 10 thousand items. Currently it's a simple regression based model where for every user, item pair in dataset we generate some features and learn model from it. Till training and evaluation everything is fine the bottleneck is prediction and ranking for deployment, as at the end of day we need to recommend each user top 10 personalized items. To do this for every user I need to use model to predict his rating/preference on all items and take top 10 items from list. Hence after learning the model I need to do 10K X 1million predictions (model.predict(featureVector)). Currently I have the following process, feature vectors are sparse and of length ~300 each. *1. userFeatures:RDD[(Int, Vector)] , itemFeatures:RDD[(Int, Vector)]* I do cartesian product of above to generate every user, item combination and corresponding feature: *2. val allUIFeat:RDD[(Int, Int, Vector)] = userFeatures.cartesian(itemFeatures).map(...)* Then I use the model to do prediction as follow: *3. val allUIPred:RDD[(Int, Int, Double)] = allUIFeat.map{x => (x._1, x._2, model.predict(x._3))}* *4. Then we do group by user and sort to get top 10 items.* We are not able to complete step 3 above, its taking a really long time (~5hrs) to get all the predictions which is really long considering we already have the model and it just needs to do some computation for prediction. I have tried partitioning userFeatures across 800 partitions before doing above steps, still it was of no help. I am using about 100 executor , 2 core, each executor with 2gb RAM. Are there any suggestions to make these predictions fast? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098.html Sent from the Apache Spark User List mailing list archive at Nabble.com.