Nick's suggestion is a good approach for your data. The item factors to 
broadcast should be a few MBs. -Xiangrui

> On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux <decho...@gmail.com> wrote:
> 
> And you might want to apply clustering before. It is likely that every user 
> and every item are not unique.
> 
> Bertrand Dechoux
> 
> 
>> On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath <nick.pentre...@gmail.com> 
>> wrote:
>> It is very true that making predictions in batch for all 1 million users 
>> against the 10k items will be quite onerous in terms of computation. I have 
>> run into this issue too in making batch predictions.
>> 
>> Some ideas:
>> 
>> 1. Do you really need to generate recommendations for each user in batch? 
>> How are you serving these recommendations? In most cases, you only need to 
>> make recs when a user is actively interacting with your service or product 
>> etc. Doing it all in batch tends to be a big waste of computation resources.
>> 
>> In our system for example we are serving them in real time (as a user 
>> arrives at a web page, say, our customer hits our API for recs), so we only 
>> generate the rec at that time. You can take a look at Oryx for this 
>> (https://github.com/cloudera/oryx) though it does not yet support Spark, you 
>> may be able to save the model into the correct format in HDFS and have Oryx 
>> read the data.
>> 
>> 2. If you do need to make the recs in batch, then I would suggest:
>> (a) because you have few items, I would collect the item vectors and form a 
>> matrix.
>> (b) broadcast that matrix
>> (c) do a mapPartitions on the user vectors. Form a user matrix from the 
>> vectors in each partition (maybe create quite a few partitions to make each 
>> user matrix not too big)
>> (d) do a value call on the broadcasted item matrix
>> (e) now for each partition you have the (small) item matrix and a (larger) 
>> user matrix. Do a matrix multiply and you end up with a (U x I) matrix with 
>> the scores for each user in the partition. Because you are using BLAS here, 
>> it will be significantly faster than individually computed dot products
>> (f) sort the scores for each user and take top K
>> (g) save or collect and do whatever with the scores
>> 
>> 3. in conjunction with (2) you can try throwing more resources at the 
>> problem too
>> 
>> If you access the underlying Breeze vectors (I think the toBreeze method is 
>> private so you may have to re-implement it), you can do all this using 
>> Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot).
>> 
>> Hope that helps
>> 
>> Nick
>> 
>> 
>>> On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma <sharm...@umn.edu> wrote:
>>> Yes, thats what prediction should be doing, taking dot products or sigmoid
>>> function for each user,item pair. For 1 million users and 10 K items data
>>> there are 10 billion pairs.
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 

Reply via email to