It is very true that making predictions in batch for all 1 million users
against the 10k items will be quite onerous in terms of computation. I have
run into this issue too in making batch predictions.

Some ideas:

1. Do you really need to generate recommendations for each user in batch?
How are you serving these recommendations? In most cases, you only need to
make recs when a user is actively interacting with your service or product
etc. Doing it all in batch tends to be a big waste of computation resources.

In our system for example we are serving them in real time (as a user
arrives at a web page, say, our customer hits our API for recs), so we only
generate the rec at that time. You can take a look at Oryx for this (
https://github.com/cloudera/oryx) though it does not yet support Spark, you
may be able to save the model into the correct format in HDFS and have Oryx
read the data.

2. If you do need to make the recs in batch, then I would suggest:
(a) because you have few items, I would collect the item vectors and form a
matrix.
(b) broadcast that matrix
(c) do a mapPartitions on the user vectors. Form a user matrix from the
vectors in each partition (maybe create quite a few partitions to make each
user matrix not too big)
(d) do a value call on the broadcasted item matrix
(e) now for each partition you have the (small) item matrix and a (larger)
user matrix. Do a matrix multiply and you end up with a (U x I) matrix with
the scores for each user in the partition. Because you are using BLAS here,
it will be significantly faster than individually computed dot products
(f) sort the scores for each user and take top K
(g) save or collect and do whatever with the scores

3. in conjunction with (2) you can try throwing more resources at the
problem too

If you access the underlying Breeze vectors (I think the toBreeze method is
private so you may have to re-implement it), you can do all this using
Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot).

Hope that helps

Nick


On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma <sharm...@umn.edu> wrote:

> Yes, thats what prediction should be doing, taking dot products or sigmoid
> function for each user,item pair. For 1 million users and 10 K items data
> there are 10 billion pairs.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to