There is also a batch prediction API in PR https://github.com/apache/spark/pull/3098
Idea here is what Sean said...don't try to reconstruct the whole matrix which will be dense but pick a set of users and calculate topk recommendations for them using dense level 3 blas.....we are going to merge this for 1.4...this is useful in general for cross validating on prec@k measure to tune the model... Right now it uses level 1 blas but the next extension is to use level 3 blas to further make the compute faster... On Mar 18, 2015 6:48 AM, "Sean Owen" <so...@cloudera.com> wrote: > I don't think that you need memory to put the whole joined data set in > memory. However memory is unlikely to be the limiting factor, it's the > massive shuffle. > > OK, you really do have a large recommendation problem if you're > recommending for at least 7M users per day! > > My hunch is that it won't be fast enough to use the simple predict() > or recommendProducts() method repeatedly. There was a proposal to make > a recommendAll() method which you could crib > (https://issues.apache.org/jira/browse/SPARK-3066) but that looks like > still a work in progress since the point there was to do more work to > make it possibly scale. > > You may consider writing a bit of custom code to do the scoring. For > example cache parts of the item-factor matrix in memory on the workers > and score user feature vectors in bulk against them. > > There's a different school of though which is to try to compute only > what you need, on the fly, and cache it if you like. That is good in > that it doesn't waste effort and makes the result fresh, but, of > course, means creating or consuming some other system to do the > scoring and getting *that* to run fast. > > You can also look into techniques like LSH for probabilistically > guessing which tiny subset of all items are worth considering, but > that's also something that needs building more code. > > I'm sure a couple people could chime in on that here but it's kind of > a separate topic. > > On Wed, Mar 18, 2015 at 8:04 AM, Aram Mkrtchyan > <aram.mkrtchyan...@gmail.com> wrote: > > Thanks much for your reply. > > > > By saying on the fly, you mean caching the trained model, and querying it > > for each user joined with 30M products when needed? > > > > Our question is more about the general approach, what if we have 7M DAU? > > How the companies deal with that using Spark? > > > > > > On Wed, Mar 18, 2015 at 3:39 PM, Sean Owen <so...@cloudera.com> wrote: > >> > >> Not just the join, but this means you're trying to compute 600 > >> trillion dot products. It will never finish fast. Basically: don't do > >> this :) You don't in general compute all recommendations for all > >> users, but recompute for a small subset of users that were or are > >> likely to be active soon. (Or compute on the fly.) Is anything like > >> that an option? > >> > >> On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan > >> <aram.mkrtchyan...@gmail.com> wrote: > >> > Trying to build recommendation system using Spark MLLib's ALS. > >> > > >> > Currently, we're trying to pre-build recommendations for all users on > >> > daily > >> > basis. We're using simple implicit feedbacks and ALS. > >> > > >> > The problem is, we have 20M users and 30M products, and to call the > main > >> > predict() method, we need to have the cartesian join for users and > >> > products, > >> > which is too huge, and it may take days to generate only the join. Is > >> > there > >> > a way to avoid cartesian join to make the process faster? > >> > > >> > Currently we have 8 nodes with 64Gb of RAM, I think it should be > enough > >> > for > >> > the data. > >> > > >> > val users: RDD[Int] = ??? // RDD with 20M userIds > >> > val products: RDD[Int] = ??? // RDD with 30M productIds > >> > val ratings : RDD[Rating] = ??? // RDD with all user->product > >> > feedbacks > >> > > >> > val model = new ALS().setRank(10).setIterations(10) > >> > .setLambda(0.0001).setImplicitPrefs(true) > >> > .setAlpha(40).run(ratings) > >> > > >> > val usersProducts = users.cartesian(products) > >> > val recommendations = model.predict(usersProducts) > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >