Thanks gen for helpful post.

Thank you Sean, we're currently exploring this world of recommendations
with Spark, and your posts are very helpful to us.
We've noticed that you're a co-author of "Advanced Analytics with Spark",
just not to get to deep into offtopic, will it be finished soon?

On Wed, Mar 18, 2015 at 5:47 PM, Sean Owen <so...@cloudera.com> wrote:

> I don't think that you need memory to put the whole joined data set in
> memory. However memory is unlikely to be the limiting factor, it's the
> massive shuffle.
>
> OK, you really do have a large recommendation problem if you're
> recommending for at least 7M users per day!
>
> My hunch is that it won't be fast enough to use the simple predict()
> or recommendProducts() method repeatedly. There was a proposal to make
> a recommendAll() method which you could crib
> (https://issues.apache.org/jira/browse/SPARK-3066) but that looks like
> still a work in progress since the point there was to do more work to
> make it possibly scale.
>
> You may consider writing a bit of custom code to do the scoring. For
> example cache parts of the item-factor matrix in memory on the workers
> and score user feature vectors in bulk against them.
>
> There's a different school of though which is to try to compute only
> what you need, on the fly, and cache it if you like. That is good in
> that it doesn't waste effort and makes the result fresh, but, of
> course, means creating or consuming some other system to do the
> scoring and getting *that* to run fast.
>
> You can also look into techniques like LSH for probabilistically
> guessing which tiny subset of all items are worth considering, but
> that's also something that needs building more code.
>
> I'm sure a couple people could chime in on that here but it's kind of
> a separate topic.
>
> On Wed, Mar 18, 2015 at 8:04 AM, Aram Mkrtchyan
> <aram.mkrtchyan...@gmail.com> wrote:
> > Thanks much for your reply.
> >
> > By saying on the fly, you mean caching the trained model, and querying it
> > for each user joined with 30M products when needed?
> >
> > Our question is more about the general approach, what if we have 7M DAU?
> > How the companies deal with that using Spark?
> >
> >
> > On Wed, Mar 18, 2015 at 3:39 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> Not just the join, but this means you're trying to compute 600
> >> trillion dot products. It will never finish fast. Basically: don't do
> >> this :) You don't in general compute all recommendations for all
> >> users, but recompute for a small subset of users that were or are
> >> likely to be active soon. (Or compute on the fly.) Is anything like
> >> that an option?
> >>
> >> On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan
> >> <aram.mkrtchyan...@gmail.com> wrote:
> >> > Trying to build recommendation system using Spark MLLib's ALS.
> >> >
> >> > Currently, we're trying to pre-build recommendations for all users on
> >> > daily
> >> > basis. We're using simple implicit feedbacks and ALS.
> >> >
> >> > The problem is, we have 20M users and 30M products, and to call the
> main
> >> > predict() method, we need to have the cartesian join for users and
> >> > products,
> >> > which is too huge, and it may take days to generate only the join. Is
> >> > there
> >> > a way to avoid cartesian join to make the process faster?
> >> >
> >> > Currently we have 8 nodes with 64Gb of RAM, I think it should be
> enough
> >> > for
> >> > the data.
> >> >
> >> > val users: RDD[Int] = ???           // RDD with 20M userIds
> >> > val products: RDD[Int] = ???        // RDD with 30M productIds
> >> > val ratings : RDD[Rating] = ???     // RDD with all user->product
> >> > feedbacks
> >> >
> >> > val model = new ALS().setRank(10).setIterations(10)
> >> >   .setLambda(0.0001).setImplicitPrefs(true)
> >> >   .setAlpha(40).run(ratings)
> >> >
> >> > val usersProducts = users.cartesian(products)
> >> > val recommendations = model.predict(usersProducts)
> >
> >
>

Reply via email to