And you might want to apply clustering before. It is likely that every user
and every item are not unique.

Bertrand Dechoux


On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> It is very true that making predictions in batch for all 1 million users
> against the 10k items will be quite onerous in terms of computation. I have
> run into this issue too in making batch predictions.
>
> Some ideas:
>
> 1. Do you really need to generate recommendations for each user in batch?
> How are you serving these recommendations? In most cases, you only need to
> make recs when a user is actively interacting with your service or product
> etc. Doing it all in batch tends to be a big waste of computation resources.
>
> In our system for example we are serving them in real time (as a user
> arrives at a web page, say, our customer hits our API for recs), so we only
> generate the rec at that time. You can take a look at Oryx for this (
> https://github.com/cloudera/oryx) though it does not yet support Spark,
> you may be able to save the model into the correct format in HDFS and have
> Oryx read the data.
>
> 2. If you do need to make the recs in batch, then I would suggest:
> (a) because you have few items, I would collect the item vectors and form
> a matrix.
> (b) broadcast that matrix
> (c) do a mapPartitions on the user vectors. Form a user matrix from the
> vectors in each partition (maybe create quite a few partitions to make each
> user matrix not too big)
> (d) do a value call on the broadcasted item matrix
> (e) now for each partition you have the (small) item matrix and a (larger)
> user matrix. Do a matrix multiply and you end up with a (U x I) matrix with
> the scores for each user in the partition. Because you are using BLAS here,
> it will be significantly faster than individually computed dot products
> (f) sort the scores for each user and take top K
> (g) save or collect and do whatever with the scores
>
> 3. in conjunction with (2) you can try throwing more resources at the
> problem too
>
> If you access the underlying Breeze vectors (I think the toBreeze method
> is private so you may have to re-implement it), you can do all this using
> Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot).
>
> Hope that helps
>
> Nick
>
>
> On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma <sharm...@umn.edu> wrote:
>
>> Yes, thats what prediction should be doing, taking dot products or sigmoid
>> function for each user,item pair. For 1 million users and 10 K items data
>> there are 10 billion pairs.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Reply via email to