Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17090 I commented further on the [JIRA](https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15898855&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15898855). Sorry if my other comments here and on JIRA were unclear. But the proposed schema for input to `RankingEvaluator` is: ### Schema 1 ``` +------+-------+------+----------+ |userId|movieId|rating|prediction| +------+-------+------+----------+ | 230| 318| 5.0| 4.2403245| | 230| 3424| 4.0| null| | 230| 81191| null| 4.317455| +------+-------+------+----------+ ``` You will notice that `rating` and `prediction` columns can be `null`. This is by design. There are three cases shown above: 1. 1st row indicates a (user-item) pair that occurs in *both* the ground-truth set *and* the top-k predictions; 2. 2nd row indicates a (user-item) pair that occurs in the ground-truth set, *but not* in the top-k predictions; 3. 3rd row indicates a (user-item) pair that occurs in the top-k predictions, *but not* in the ground-truth set. _Note_ for reference, the input to the current `mllib` `RankingMetrics` is: ### Schema 2 ``` RDD[(true labels array, predicted labels array)], i.e. RDD of ([318, 3424, 7139,...], [81191, 93040, 31...]) ``` (So actually neither of the above schemas are easily compatible with the return schema here - but I think it is not really necessary to match the `mllib.RankingMetrics` format) ### ALS cross-validation My proposal for fitting ALS into cross-validation is the `ALSModel.transform` will output a DF of **Schema 1** - *only* when the parameters `k` and `recommendFor` are appropriately set, and the input DF contains both `user` and `item` columns. In practice, this scenario will occur during cross-validation only. So what I am saying is that ALS itself (not the evaluator) must know how to return the correct DataFrame output from `transform` such that it can be used in a cross-validation as input to the `RankingEvaluator`. __Concretely:__ ```scala val als = new ALS().setRecommendFor("user").setK(10) val validator = new TrainValidationSplit() .setEvaluator(new RankingEvaluator().setK(10)) .setEstimator(als) .setEstimatorParamMaps(...) val bestModel = validator.fit(ratings) ``` So while it is complex under the hood - to users it's simply a case of setting 2 params and the rest is as normal. Now, we have the best model selected by cross-validation. We can make recommendations using these convenience methods (I think it will need a cast): ```scala val recommendations = bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10) ``` Alternatively, the `transform` version looks like this: ```scala val usersDF = ... +------+ |userId| +------+ | 1| | 2| | 3| +------+ val recommendations = bestModel.transform(usersDF) ``` So the questions: 1. should we support the above `transform`-based recommendations? Or only support it for cross-validation purposes as a special case? 2. if we do, what should the output schema of the above `transform` version look like? It must certainly match the output of `recommendX` methods. The options are: (1) The schema in this PR: **Pros**: as you mention above - also more "compact" **Cons**: doesn't match up so closely with the `transform` "cross-validation" schema above (2) The schema below. It is basically an "exploded" version of option (1) ``` +------+-------+----------+ |userId|movieId|prediction| +------+-------+----------+ | 1| 1| 4.3| | 1| 5| 3.2| | 1| 9| 2.1| +------+-------+----------+ ``` **Pros***: matches more closely with the cross-validation / evaluator input format. Perhaps slightly more "dataframe-like". **Cons**: less compact; lose ordering?; may require more munging to save to external data stores etc. Anyway sorry for hijacking this PR discussion - but as I think you can see, the evaluator / ALS transform interplay is a bit subtle and requires some thought to get the right approach.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org