And you might want to apply clustering before. It is likely that every user and every item are not unique.
Bertrand Dechoux On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > It is very true that making predictions in batch for all 1 million users > against the 10k items will be quite onerous in terms of computation. I have > run into this issue too in making batch predictions. > > Some ideas: > > 1. Do you really need to generate recommendations for each user in batch? > How are you serving these recommendations? In most cases, you only need to > make recs when a user is actively interacting with your service or product > etc. Doing it all in batch tends to be a big waste of computation resources. > > In our system for example we are serving them in real time (as a user > arrives at a web page, say, our customer hits our API for recs), so we only > generate the rec at that time. You can take a look at Oryx for this ( > https://github.com/cloudera/oryx) though it does not yet support Spark, > you may be able to save the model into the correct format in HDFS and have > Oryx read the data. > > 2. If you do need to make the recs in batch, then I would suggest: > (a) because you have few items, I would collect the item vectors and form > a matrix. > (b) broadcast that matrix > (c) do a mapPartitions on the user vectors. Form a user matrix from the > vectors in each partition (maybe create quite a few partitions to make each > user matrix not too big) > (d) do a value call on the broadcasted item matrix > (e) now for each partition you have the (small) item matrix and a (larger) > user matrix. Do a matrix multiply and you end up with a (U x I) matrix with > the scores for each user in the partition. Because you are using BLAS here, > it will be significantly faster than individually computed dot products > (f) sort the scores for each user and take top K > (g) save or collect and do whatever with the scores > > 3. in conjunction with (2) you can try throwing more resources at the > problem too > > If you access the underlying Breeze vectors (I think the toBreeze method > is private so you may have to re-implement it), you can do all this using > Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot). > > Hope that helps > > Nick > > > On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma <sharm...@umn.edu> wrote: > >> Yes, thats what prediction should be doing, taking dot products or sigmoid >> function for each user,item pair. For 1 million users and 10 K items data >> there are 10 billion pairs. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >