I don't think that you need memory to put the whole joined data set in
memory. However memory is unlikely to be the limiting factor, it's the
massive shuffle.

OK, you really do have a large recommendation problem if you're
recommending for at least 7M users per day!

My hunch is that it won't be fast enough to use the simple predict()
or recommendProducts() method repeatedly. There was a proposal to make
a recommendAll() method which you could crib
(https://issues.apache.org/jira/browse/SPARK-3066) but that looks like
still a work in progress since the point there was to do more work to
make it possibly scale.

You may consider writing a bit of custom code to do the scoring. For
example cache parts of the item-factor matrix in memory on the workers
and score user feature vectors in bulk against them.

There's a different school of though which is to try to compute only
what you need, on the fly, and cache it if you like. That is good in
that it doesn't waste effort and makes the result fresh, but, of
course, means creating or consuming some other system to do the
scoring and getting *that* to run fast.

You can also look into techniques like LSH for probabilistically
guessing which tiny subset of all items are worth considering, but
that's also something that needs building more code.

I'm sure a couple people could chime in on that here but it's kind of
a separate topic.

On Wed, Mar 18, 2015 at 8:04 AM, Aram Mkrtchyan
<aram.mkrtchyan...@gmail.com> wrote:
> Thanks much for your reply.
>
> By saying on the fly, you mean caching the trained model, and querying it
> for each user joined with 30M products when needed?
>
> Our question is more about the general approach, what if we have 7M DAU?
> How the companies deal with that using Spark?
>
>
> On Wed, Mar 18, 2015 at 3:39 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Not just the join, but this means you're trying to compute 600
>> trillion dot products. It will never finish fast. Basically: don't do
>> this :) You don't in general compute all recommendations for all
>> users, but recompute for a small subset of users that were or are
>> likely to be active soon. (Or compute on the fly.) Is anything like
>> that an option?
>>
>> On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan
>> <aram.mkrtchyan...@gmail.com> wrote:
>> > Trying to build recommendation system using Spark MLLib's ALS.
>> >
>> > Currently, we're trying to pre-build recommendations for all users on
>> > daily
>> > basis. We're using simple implicit feedbacks and ALS.
>> >
>> > The problem is, we have 20M users and 30M products, and to call the main
>> > predict() method, we need to have the cartesian join for users and
>> > products,
>> > which is too huge, and it may take days to generate only the join. Is
>> > there
>> > a way to avoid cartesian join to make the process faster?
>> >
>> > Currently we have 8 nodes with 64Gb of RAM, I think it should be enough
>> > for
>> > the data.
>> >
>> > val users: RDD[Int] = ???           // RDD with 20M userIds
>> > val products: RDD[Int] = ???        // RDD with 30M productIds
>> > val ratings : RDD[Rating] = ???     // RDD with all user->product
>> > feedbacks
>> >
>> > val model = new ALS().setRank(10).setIterations(10)
>> >   .setLambda(0.0001).setImplicitPrefs(true)
>> >   .setAlpha(40).run(ratings)
>> >
>> > val usersProducts = users.cartesian(products)
>> > val recommendations = model.predict(usersProducts)
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to