Not just the join, but this means you're trying to compute 600
trillion dot products. It will never finish fast. Basically: don't do
this :) You don't in general compute all recommendations for all
users, but recompute for a small subset of users that were or are
likely to be active soon. (Or compute on the fly.) Is anything like
that an option?

On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan
<aram.mkrtchyan...@gmail.com> wrote:
> Trying to build recommendation system using Spark MLLib's ALS.
>
> Currently, we're trying to pre-build recommendations for all users on daily
> basis. We're using simple implicit feedbacks and ALS.
>
> The problem is, we have 20M users and 30M products, and to call the main
> predict() method, we need to have the cartesian join for users and products,
> which is too huge, and it may take days to generate only the join. Is there
> a way to avoid cartesian join to make the process faster?
>
> Currently we have 8 nodes with 64Gb of RAM, I think it should be enough for
> the data.
>
> val users: RDD[Int] = ???           // RDD with 20M userIds
> val products: RDD[Int] = ???        // RDD with 30M productIds
> val ratings : RDD[Rating] = ???     // RDD with all user->product feedbacks
>
> val model = new ALS().setRank(10).setIterations(10)
>   .setLambda(0.0001).setImplicitPrefs(true)
>   .setAlpha(40).run(ratings)
>
> val usersProducts = users.cartesian(products)
> val recommendations = model.predict(usersProducts)

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to