I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields `userId` and `productId`. I have **no product ratings**, just info on what products users have bought, that's all. So to train ALS I use: def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel
(http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$) This API requires `Rating` object: Rating(user: Int, product: Int, rating: Double) On the other hand documentation on `trainImplicit` tells: *Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by users to some products, in the form of (userID, productID, **preference**) pairs.* When I set rating / preferences to `1` as in: val ratings = sc.textFile(new File(dir, file).toString).map { line => val fields = line.split(",") // format: (randomNumber, Rating(userId, productId, rating)) (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0)) } val training = ratings.filter(x => x._1 < 60) .values .repartition(numPartitions) .cache() val validation = ratings.filter(x => x._1 >= 60 && x._1 < 80) .values .repartition(numPartitions) .cache() val test = ratings.filter(x => x._1 >= 80).values.cache() And then train ALSL: val model = ALS.trainImplicit(ratings, rank, numIter) I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value: val validationRmse = computeRmse(model, validation, numValidation) /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = { val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product))) val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating)) .join(data.map(x => ((x.user, x.product), x.rating))) .values math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n) } So my question is: to what value should I set `rating` in: Rating(user: Int, product: Int, rating: Double) for implicit training (in `ALS.trainImplicit` method) ? **Update** With: val alpha = 40 val lambda = 0.01 I get: Got 1895593 ratings from 17471 users on 462685 products. Training: 1136079, validation: 380495, test: 379019 RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10. RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20. RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10. RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20. The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481. baselineRmse: 0.0 testRmse: 0.7302343904091481 The best model improves the baseline by -Infinity%. Which is still a big error, I guess. Also I get strange baseline improvement where baseline model is simply mean (1). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLib-How-to-set-preferences-for-ALS-implicit-feedback-in-Collaborative-Filtering-tp21185.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org