Nick's suggestion is a good approach for your data. The item factors to broadcast should be a few MBs. -Xiangrui
> On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux <decho...@gmail.com> wrote: > > And you might want to apply clustering before. It is likely that every user > and every item are not unique. > > Bertrand Dechoux > > >> On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath <nick.pentre...@gmail.com> >> wrote: >> It is very true that making predictions in batch for all 1 million users >> against the 10k items will be quite onerous in terms of computation. I have >> run into this issue too in making batch predictions. >> >> Some ideas: >> >> 1. Do you really need to generate recommendations for each user in batch? >> How are you serving these recommendations? In most cases, you only need to >> make recs when a user is actively interacting with your service or product >> etc. Doing it all in batch tends to be a big waste of computation resources. >> >> In our system for example we are serving them in real time (as a user >> arrives at a web page, say, our customer hits our API for recs), so we only >> generate the rec at that time. You can take a look at Oryx for this >> (https://github.com/cloudera/oryx) though it does not yet support Spark, you >> may be able to save the model into the correct format in HDFS and have Oryx >> read the data. >> >> 2. If you do need to make the recs in batch, then I would suggest: >> (a) because you have few items, I would collect the item vectors and form a >> matrix. >> (b) broadcast that matrix >> (c) do a mapPartitions on the user vectors. Form a user matrix from the >> vectors in each partition (maybe create quite a few partitions to make each >> user matrix not too big) >> (d) do a value call on the broadcasted item matrix >> (e) now for each partition you have the (small) item matrix and a (larger) >> user matrix. Do a matrix multiply and you end up with a (U x I) matrix with >> the scores for each user in the partition. Because you are using BLAS here, >> it will be significantly faster than individually computed dot products >> (f) sort the scores for each user and take top K >> (g) save or collect and do whatever with the scores >> >> 3. in conjunction with (2) you can try throwing more resources at the >> problem too >> >> If you access the underlying Breeze vectors (I think the toBreeze method is >> private so you may have to re-implement it), you can do all this using >> Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot). >> >> Hope that helps >> >> Nick >> >> >>> On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma <sharm...@umn.edu> wrote: >>> Yes, thats what prediction should be doing, taking dot products or sigmoid >>> function for each user,item pair. For 1 million users and 10 K items data >>> there are 10 billion pairs. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >