[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902639#comment-15902639 ]
Nick Pentreath edited comment on SPARK-14409 at 3/9/17 8:05 AM: ---------------------------------------------------------------- I commented on the [PR for SPARK-19535|https://github.com/apache/spark/pull/17090#issuecomment-284648012] and am copying that comment here for future reference as it contains further detail of the discussion: ===== Sorry if my other comments here and on JIRA were unclear. But the proposed schema for input to RankingEvaluator is: *Schema 1* {noformat} +------+-------+------+----------+ |userId|movieId|rating|prediction| +------+-------+------+----------+ | 230| 318| 5.0| 4.2403245| | 230| 3424| 4.0| null| | 230| 81191| null| 4.317455| +------+-------+------+----------+ {noformat} You will notice that rating and prediction columns can be null. This is by design. There are three cases shown above: # 1st row indicates a (user-item) pair that occurs in both the ground-truth set and the top-k predictions; # 2nd row indicates a (user-item) pair that occurs in the ground-truth set, but not in the top-k predictions; # 3rd row indicates a (user-item) pair that occurs in the top-k predictions, but not in the ground-truth set. Note for reference, the input to the current mllib RankingMetrics is: *Schema 2* {noformat} RDD[(true labels array, predicted labels array)], i.e. RDD of ([318, 3424, 7139,...], [81191, 93040, 31...]) {noformat} (So actually neither of the above schemas are easily compatible with the return schema here - but I think it is not really necessary to match the mllib.RankingMetrics format) *ALS cross-validation* My proposal for fitting ALS into cross-validation is the ALSModel.transform will output a DF of Schema 1 - only when the parameters k and recommendFor are appropriately set, and the input DF contains both user and item columns. In practice, this scenario will occur during cross-validation only. So what I am saying is that ALS itself (not the evaluator) must know how to return the correct DataFrame output from transform such that it can be used in a cross-validation as input to the RankingEvaluator. Concretely: {code} val als = new ALS().setRecommendFor("user").setK(10) val validator = new TrainValidationSplit() .setEvaluator(new RankingEvaluator().setK(10)) .setEstimator(als) .setEstimatorParamMaps(...) val bestModel = validator.fit(ratings) {code} So while it is complex under the hood - to users it's simply a case of setting 2 params and the rest is as normal. Now, we have the best model selected by cross-validation. We can make recommendations using these convenience methods (I think it will need a cast): {code} val recommendations = bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10) {code} Alternatively, the transform version looks like this: {code} val usersDF = ... +------+ |userId| +------+ | 1| | 2| | 3| +------+ val recommendations = bestModel.transform(usersDF) {code} So the questions: * should we support the above transform-based recommendations? Or only support it for cross-validation purposes as a special case? * if we do, what should the output schema of the above transform version look like? It must certainly match the output of recommendX methods. The options are: *(1) The schema in this PR:* *Pros*: as you mention above - also more "compact" *Cons*: doesn't match up so closely with the transform "cross-validation" schema above *(2) The schema below. It is basically an "exploded" version of option (1)* {noformat} +------+-------+----------+ |userId|movieId|prediction| +------+-------+----------+ | 1| 1| 4.3| | 1| 5| 3.2| | 1| 9| 2.1| +------+-------+----------+ {noformat} *Pros*: matches more closely with the cross-validation / evaluator input format. Perhaps slightly more "dataframe-like". *Cons*: less compact; lose ordering?; may require more munging to save to external data stores etc. Anyway sorry for hijacking this PR discussion - but as I think you can see, the evaluator / ALS transform interplay is a bit subtle and requires some thought to get the right approach. was (Author: mlnick): I commented on the [PR for SPARK-19535|https://github.com/apache/spark/pull/17090#issuecomment-284648012] and am copying that comment here for future reference as it contains further detail of the discussion: ===== {noformat} Sorry if my other comments here and on JIRA were unclear. But the proposed schema for input to RankingEvaluator is: Schema 1 +------+-------+------+----------+ |userId|movieId|rating|prediction| +------+-------+------+----------+ | 230| 318| 5.0| 4.2403245| | 230| 3424| 4.0| null| | 230| 81191| null| 4.317455| +------+-------+------+----------+ You will notice that rating and prediction columns can be null. This is by design. There are three cases shown above: 1st row indicates a (user-item) pair that occurs in both the ground-truth set and the top-k predictions; 2nd row indicates a (user-item) pair that occurs in the ground-truth set, but not in the top-k predictions; 3rd row indicates a (user-item) pair that occurs in the top-k predictions, but not in the ground-truth set. Note for reference, the input to the current mllib RankingMetrics is: Schema 2 RDD[(true labels array, predicted labels array)], i.e. RDD of ([318, 3424, 7139,...], [81191, 93040, 31...]) (So actually neither of the above schemas are easily compatible with the return schema here - but I think it is not really necessary to match the mllib.RankingMetrics format) ALS cross-validation My proposal for fitting ALS into cross-validation is the ALSModel.transform will output a DF of Schema 1 - only when the parameters k and recommendFor are appropriately set, and the input DF contains both user and item columns. In practice, this scenario will occur during cross-validation only. So what I am saying is that ALS itself (not the evaluator) must know how to return the correct DataFrame output from transform such that it can be used in a cross-validation as input to the RankingEvaluator. Concretely: val als = new ALS().setRecommendFor("user").setK(10) val validator = new TrainValidationSplit() .setEvaluator(new RankingEvaluator().setK(10)) .setEstimator(als) .setEstimatorParamMaps(...) val bestModel = validator.fit(ratings) So while it is complex under the hood - to users it's simply a case of setting 2 params and the rest is as normal. Now, we have the best model selected by cross-validation. We can make recommendations using these convenience methods (I think it will need a cast): val recommendations = bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10) Alternatively, the transform version looks like this: val usersDF = ... +------+ |userId| +------+ | 1| | 2| | 3| +------+ val recommendations = bestModel.transform(usersDF) So the questions: should we support the above transform-based recommendations? Or only support it for cross-validation purposes as a special case? if we do, what should the output schema of the above transform version look like? It must certainly match the output of recommendX methods. The options are: (1) The schema in this PR: Pros: as you mention above - also more "compact" Cons: doesn't match up so closely with the transform "cross-validation" schema above (2) The schema below. It is basically an "exploded" version of option (1) +------+-------+----------+ |userId|movieId|prediction| +------+-------+----------+ | 1| 1| 4.3| | 1| 5| 3.2| | 1| 9| 2.1| +------+-------+----------+ Pros*: matches more closely with the cross-validation / evaluator input format. Perhaps slightly more "dataframe-like". Cons: less compact; lose ordering?; may require more munging to save to external data stores etc. Anyway sorry for hijacking this PR discussion - but as I think you can see, the evaluator / ALS transform interplay is a bit subtle and requires some thought to get the right approach. {noformat} > Investigate adding a RankingEvaluator to ML > ------------------------------------------- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML > Reporter: Nick Pentreath > Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org