Trying to build recommendation system using Spark MLLib's ALS.

Currently, we're trying to pre-build recommendations for all users on daily
basis. We're using simple implicit feedbacks and ALS.

The problem is, we have 20M users and 30M products, and to call the main
predict() method, we need to have the cartesian join for users and
products, which is too huge, and it may take days to generate only the
join. Is there a way to avoid cartesian join to make the process faster?

Currently we have 8 nodes with 64Gb of RAM, I think it should be enough for
the data.

val users: RDD[Int] = ???           // RDD with 20M userIds
val products: RDD[Int] = ???        // RDD with 30M productIds
val ratings : RDD[Rating] = ???     // RDD with all user->product feedbacks

val model = new ALS().setRank(10).setIterations(10)
  .setLambda(0.0001).setImplicitPrefs(true)
  .setAlpha(40).run(ratings)

val usersProducts = users.cartesian(products)
val recommendations = model.predict(usersProducts)

Reply via email to