Not just the join, but this means you're trying to compute 600 trillion dot products. It will never finish fast. Basically: don't do this :) You don't in general compute all recommendations for all users, but recompute for a small subset of users that were or are likely to be active soon. (Or compute on the fly.) Is anything like that an option?
On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan <aram.mkrtchyan...@gmail.com> wrote: > Trying to build recommendation system using Spark MLLib's ALS. > > Currently, we're trying to pre-build recommendations for all users on daily > basis. We're using simple implicit feedbacks and ALS. > > The problem is, we have 20M users and 30M products, and to call the main > predict() method, we need to have the cartesian join for users and products, > which is too huge, and it may take days to generate only the join. Is there > a way to avoid cartesian join to make the process faster? > > Currently we have 8 nodes with 64Gb of RAM, I think it should be enough for > the data. > > val users: RDD[Int] = ??? // RDD with 20M userIds > val products: RDD[Int] = ??? // RDD with 30M productIds > val ratings : RDD[Rating] = ??? // RDD with all user->product feedbacks > > val model = new ALS().setRank(10).setIterations(10) > .setLambda(0.0001).setImplicitPrefs(true) > .setAlpha(40).run(ratings) > > val usersProducts = users.cartesian(products) > val recommendations = model.predict(usersProducts) --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org