It is very true that making predictions in batch for all 1 million users against the 10k items will be quite onerous in terms of computation. I have run into this issue too in making batch predictions.
Some ideas: 1. Do you really need to generate recommendations for each user in batch? How are you serving these recommendations? In most cases, you only need to make recs when a user is actively interacting with your service or product etc. Doing it all in batch tends to be a big waste of computation resources. In our system for example we are serving them in real time (as a user arrives at a web page, say, our customer hits our API for recs), so we only generate the rec at that time. You can take a look at Oryx for this ( https://github.com/cloudera/oryx) though it does not yet support Spark, you may be able to save the model into the correct format in HDFS and have Oryx read the data. 2. If you do need to make the recs in batch, then I would suggest: (a) because you have few items, I would collect the item vectors and form a matrix. (b) broadcast that matrix (c) do a mapPartitions on the user vectors. Form a user matrix from the vectors in each partition (maybe create quite a few partitions to make each user matrix not too big) (d) do a value call on the broadcasted item matrix (e) now for each partition you have the (small) item matrix and a (larger) user matrix. Do a matrix multiply and you end up with a (U x I) matrix with the scores for each user in the partition. Because you are using BLAS here, it will be significantly faster than individually computed dot products (f) sort the scores for each user and take top K (g) save or collect and do whatever with the scores 3. in conjunction with (2) you can try throwing more resources at the problem too If you access the underlying Breeze vectors (I think the toBreeze method is private so you may have to re-implement it), you can do all this using Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot). Hope that helps Nick On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma <sharm...@umn.edu> wrote: > Yes, thats what prediction should be doing, taking dot products or sigmoid > function for each user,item pair. For 1 million users and 10 K items data > there are 10 billion pairs. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >