Hi,

If you do cartesian join to predict users' preference over all the
products, I think that 8 nodes with 64GB ram would not be enough for the
data.
Recently, I used als for a similar situation, but just 10M users and 0.1M
products, the minimum requirement is 9 nodes with 10GB RAM.
Moreover, even the program pass, the time of treatment will be very long.
Maybe you should try to reduce the set to predict for each client, as in
practice, you never need predict the preference of all products to make a
recommendation.

Hope this will be helpful.

Cheers
Gen


On Wed, Mar 18, 2015 at 12:13 PM, Aram Mkrtchyan <
aram.mkrtchyan...@gmail.com> wrote:

> Trying to build recommendation system using Spark MLLib's ALS.
>
> Currently, we're trying to pre-build recommendations for all users on
> daily basis. We're using simple implicit feedbacks and ALS.
>
> The problem is, we have 20M users and 30M products, and to call the main
> predict() method, we need to have the cartesian join for users and
> products, which is too huge, and it may take days to generate only the
> join. Is there a way to avoid cartesian join to make the process faster?
>
> Currently we have 8 nodes with 64Gb of RAM, I think it should be enough
> for the data.
>
> val users: RDD[Int] = ???           // RDD with 20M userIds
> val products: RDD[Int] = ???        // RDD with 30M productIds
> val ratings : RDD[Rating] = ???     // RDD with all user->product feedbacks
>
> val model = new ALS().setRank(10).setIterations(10)
>   .setLambda(0.0001).setImplicitPrefs(true)
>   .setAlpha(40).run(ratings)
>
> val usersProducts = users.cartesian(products)
> val recommendations = model.predict(usersProducts)
>
>

Reply via email to