[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15907063#comment-15907063 ] Danilo Ascione commented on SPARK-14409: I updated the [PR |https://github.com/apache/spark/pull/16618] with the ranking metrics computations as UDF (as suggested [here|https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15896933&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15896933]). I focused on minimizing changes to the ranking metrics implementation from the mlib package (basically, only the UDF part). > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902649#comment-15902649 ] Nick Pentreath commented on SPARK-14409: [~josephkb] in reference to your [PR comment|https://github.com/apache/spark/pull/17090#issuecomment-284827573]: Really the input schema for evaluation is fairly simple - a set of ground truth ids and a (sorted) set of predicted ids, for each query (/user). The exact format (arrays like for {{mllib}} version, "exploded" version proposed in this JIRA) is not relevant in itself. Rather, the format selected is actually dictated by the {{Pipeline}} API - specifically, a model's prediction output schema from {{transform}} must be compatible with the evaluator's input schema for {{evaluate}}. The schema proposed above is - I believe - the only one that is compatible with both "linear model" style things such as `LogisticRegression` for ad CTR prediction and learning-to-rank settings, as well as recommendation tasks. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902639#comment-15902639 ] Nick Pentreath commented on SPARK-14409: I commented on the [PR for SPARK-19535|https://github.com/apache/spark/pull/17090#issuecomment-284648012] and am copying that comment here for future reference as it contains further detail of the discussion: = {noformat} Sorry if my other comments here and on JIRA were unclear. But the proposed schema for input to RankingEvaluator is: Schema 1 +--+---+--+--+ |userId|movieId|rating|prediction| +--+---+--+--+ | 230|318| 5.0| 4.2403245| | 230| 3424| 4.0| null| | 230| 81191| null| 4.317455| +--+---+--+--+ You will notice that rating and prediction columns can be null. This is by design. There are three cases shown above: 1st row indicates a (user-item) pair that occurs in both the ground-truth set and the top-k predictions; 2nd row indicates a (user-item) pair that occurs in the ground-truth set, but not in the top-k predictions; 3rd row indicates a (user-item) pair that occurs in the top-k predictions, but not in the ground-truth set. Note for reference, the input to the current mllib RankingMetrics is: Schema 2 RDD[(true labels array, predicted labels array)], i.e. RDD of ([318, 3424, 7139,...], [81191, 93040, 31...]) (So actually neither of the above schemas are easily compatible with the return schema here - but I think it is not really necessary to match the mllib.RankingMetrics format) ALS cross-validation My proposal for fitting ALS into cross-validation is the ALSModel.transform will output a DF of Schema 1 - only when the parameters k and recommendFor are appropriately set, and the input DF contains both user and item columns. In practice, this scenario will occur during cross-validation only. So what I am saying is that ALS itself (not the evaluator) must know how to return the correct DataFrame output from transform such that it can be used in a cross-validation as input to the RankingEvaluator. Concretely: val als = new ALS().setRecommendFor("user").setK(10) val validator = new TrainValidationSplit() .setEvaluator(new RankingEvaluator().setK(10)) .setEstimator(als) .setEstimatorParamMaps(...) val bestModel = validator.fit(ratings) So while it is complex under the hood - to users it's simply a case of setting 2 params and the rest is as normal. Now, we have the best model selected by cross-validation. We can make recommendations using these convenience methods (I think it will need a cast): val recommendations = bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10) Alternatively, the transform version looks like this: val usersDF = ... +--+ |userId| +--+ | 1| | 2| | 3| +--+ val recommendations = bestModel.transform(usersDF) So the questions: should we support the above transform-based recommendations? Or only support it for cross-validation purposes as a special case? if we do, what should the output schema of the above transform version look like? It must certainly match the output of recommendX methods. The options are: (1) The schema in this PR: Pros: as you mention above - also more "compact" Cons: doesn't match up so closely with the transform "cross-validation" schema above (2) The schema below. It is basically an "exploded" version of option (1) +--+---+--+ |userId|movieId|prediction| +--+---+--+ | 1| 1| 4.3| | 1| 5| 3.2| | 1| 9| 2.1| +--+---+--+ Pros*: matches more closely with the cross-validation / evaluator input format. Perhaps slightly more "dataframe-like". Cons: less compact; lose ordering?; may require more munging to save to external data stores etc. Anyway sorry for hijacking this PR discussion - but as I think you can see, the evaluator / ALS transform interplay is a bit subtle and requires some thought to get the right approach. {noformat} > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898855#comment-15898855 ] Nick Pentreath commented on SPARK-14409: [~josephkb] the proposed input schema above encompasses that - the {{labelCol}} is the true relevance score (rating, confidence, etc), while the {{predictionCol}} is the predicted relevance (rating, confidence, etc). Note we can name these columns something more specific ({{labelCol}} and {{predictionCol}} are re-used really from the other evaluators). This also allows "weighted" forms of ranking metric later (e.g. some metrics can incorporate the true relevance score into the computation which serves as a form of weighting of the metric) - the metrics we currently have don't do that. So for now the true relevance can serve as a filter - for example, when computing the ranking metric for recommendation, we *don't* want to include negative ratings in the "ground truth set of relevant documents" - hence the {{goodThreshold}} param above (I would rather call it something like {{relevanceThreshold}} myself). *Note* that there are 2 formats I detail in my comment above - the first is the the actual schema of the {{DataFrame}} used as input to the {{RankingEvaluator}} - this must therefore be the schema of the DF output by {{model.transform}} (whether that is ALS for recommendation, a logistic regression for ad prediction, or whatever). The second format mentioned is simply illustrating the *intermediate internal transformation* that the evaluator will do in the {{evaluate}} method. You can see a rough draft of it in Danilo's PR [here|https://github.com/apache/spark/pull/16618/files#diff-0345c4cb1878d3bb0d84297202fdc95fR93]. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898823#comment-15898823 ] Joseph K. Bradley commented on SPARK-14409: --- Thanks [~nick.pentre...@gmail.com]! I like this general approach. A few initial thoughts: Schema for evaluator: * Some evaluators will take rating or confidence values as well. Will those be appended as extra columns? * If a recommendation model like ALSModel returns top K recommendations for each user, that will not fit the RankingEvaluator input. Do you plan to have RankingEvaluator or CrossValidator handle efficient calculation of top K recommendations? * Relatedly, I'll comment on the schema in [https://github.com/apache/spark/pull/17090] directly in that PR in case we want to make changes in a quick follow-up. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897198#comment-15897198 ] Danilo Ascione commented on SPARK-14409: Thank you [~mlnick] for taking time to thing about this. I like the generalization for the most common scenarios. The Evaluator approach is already implemented in [#16618|https://github.com/apache/spark/pull/16618]. I'll find time to update the PR with the proposed generalization and the ranking metrics computations as UDFs. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896933#comment-15896933 ] Nick Pentreath commented on SPARK-14409: I've thought about this a lot over the past few days, and I think the approach should be in line with that suggested by [~roberto.mirizzi] & [~danilo.ascione]. *Goal* Provide a DataFrame-based ranking evaluator that is general enough to handle common scenarios such as recommendations (ALS), search ranking, ad click prediction using ranking metrics (e.g. recent Kaggle competitions for illustration: [Outbrain Ad Clicks using MAP|https://www.kaggle.com/c/outbrain-click-prediction#evaluation], [Expedia Hotel Search Ranking using NDCG|https://www.kaggle.com/c/expedia-personalized-sort#evaluation]). *RankingEvaluator input format* {{evaluate}} would take a {{DataFrame}} with columns: * {{queryCol}} - the column containing "query id" (e.g. "query" for cases such as search ranking; "user" for recommendations; "impression" for ad click prediction/ranking, etc); * {{documentCol}} - the column containing "document id" (e.g. "document" in search, "item" in recommendation, "ad" in ad ranking, etc); * {{labelCol}} (or maybe {{relevanceCol}} to be more precise) - the column containing the true relevance score for a query-document pair (e.g. in recommendations this would be the "rating"). This column will only be used for filtering out "irrelevant" documents from the ground-truth set (see Param {{goodThreshold}} mentioned [above|https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15826901&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15826901)]); * {{predictionCol}} - the column containing the predicted relevance score for a query-document pair. The predicted ids will be ordered by this column for computing ranking metrics (for which order matters in predictions but generally not for ground-truth which is treated as a set). The reasoning is that this format is flexible & generic enough to encompass the diverse use cases mentioned above. Here is an illustrative example from recommendations as a special case: {code} +--+---+--+--+ |userId|movieId|rating|prediction| +--+---+--+--+ | 230|318| 5.0| 4.2403245| | 230| 3424| 4.0| null| | 230| 81191| null| 4.317455| +--+---+--+--+ {code} You will notice that {{rating}} and {{prediction}} columns can be {{null}}. This is by design. There are three cases shown above: # 1st row indicates a query-document (user-item) pair that occurs in *both* the ground-truth set and the top-k predictions; # 2nd row indicates a user-item pair that occurs in the ground-truth set, but *not* in the top-k predictions; # 3rd row indicates a user-item pair that *does not* occur in the ground-truth set, but *does* occur in the top-k predictions; *Note* that while technically the input allows both these columns to be {{null}} in practice that won't occur since a query-document pair must occur in at least one of the ground-truth set or predictions. If it does occur for some reason it can be ignored. *Evaluator approach* The evaluator will perform a window function over {{queryCol}} and order by {{predictionCol}} within each query. Then, {{collect_list}} can be used to arrive at the following intermediate format: {code} +--+++ |userId| true_labels|predicted_labels| +--+++ | 230|[318, 3424, 7139,...|[81191, 93040, 31...| +--+++ {code} *Relationship to RankingMetrics* Technically the intermediate format above is the same format as used for {{RankingMetrics}}, and perhaps we could simple wrap the {{mllib}} version. *Note* however that the {{mllib}} class is parameterized by the type of "document": {code}RankingMetrics[T]{code} I believe for the generic case we must support both {{NumericType}} and {{StringType}} for id columns (rather than restricting to {{Int}} as in Danilo & Roberto versions above). So either: # the evaluator must be similarly parameterized; or # we will need to re-write the ranking metrics computations as UDFs as follows: {code} udf { (predicted: Seq[Any], actual: Seq[Any]) => ... {code} I strongly prefer option #2 as it is more flexible and in keeping with the DataFrame style of Spark ML components (as a side note, this will give us a chance to review the implementations & naming of metrics, since there are some issues with a few of the metrics). That is my proposal (sorry Yong, this is quite different now from the work you've done in your PR). If Yong or Danilo has time to update his PR in this direction, let me know. Thanks! > Investigate adding a RankingEvaluator to ML >
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895619#comment-15895619 ] Danilo Ascione commented on SPARK-14409: I can help with both PR. Please consider that the solution in [PR 16618|https://github.com/apache/spark/pull/16618] is a Dataframe api based version of that in [PR 12461|https://github.com/apache/spark/pull/12461]. Any way, I'd like to help review an alternative solution. Thanks! > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883431#comment-15883431 ] Roberto Mirizzi commented on SPARK-14409: - [~mlnick] my implementation was conceptually close to what we already have for the existing mllib. If you look at the example in http://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#ranking-systems they do exactly what I do with goodThreshold parameter. As you can see in my approach, I am using collect_list and windowing, and I simply pass the Dataset to the evaluator, similar to what we have for other evaluators in ml. IMO, that's the approach that has continuity with other existing evaluators. However, if you think we should also support array columns, we can add that too. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883417#comment-15883417 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] for the reminder. I will take a look and update the PR as needed. (I am on the road until next Wednesday. Will try to get it by the end of next week.) > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882174#comment-15882174 ] Nick Pentreath commented on SPARK-14409: The other option is to work with [~danilo.ascione] PR here: https://github.com/apache/spark/pull/16618 if Yong does not have time to update. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882163#comment-15882163 ] Nick Pentreath commented on SPARK-14409: [~roberto.mirizzi] the {{goodThreshold}} param seems pretty reasonable in this context to exclude irrelevant items. I think it can be a good {{expertParam}} addition. Ok, I think that a first pass at this should just aim to replicate what we have exposed in {{mllib}} and wrap {{RankingMetrics}}. Initially we can look at: (a) supporting numeric columns and doing the windowing & {{collect_list}} approach to feed into {{RankingMetrics}}; (b) support Array columns and feed directly into {{RankingMetrics}} or (c) support both. [~yongtang] already did a PR here: https://github.com/apache/spark/pull/12461. It is fairly complete and also includes MRR. [~yongtang] are you able to work on reviving that PR? If os, [~roberto.mirizzi] [~danilo.ascione] are you able to help review that PR? > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880324#comment-15880324 ] Nick Pentreath commented on SPARK-14409: [~roberto.mirizzi] If using the current {{ALS.transform}} output as input to the {{RankingEvaluator}}, as envisaged here, the model will predict a score for each {{user-item}} pair in the evaluation set. For each user, the ground truth is exactly this distinct set of items. By definition the top-k items ranked by predicted sore will be in the ground truth set, since {{ALS}} is only scoring {{user-item}} pairs *that already exist in the evaluation set*. So how is it possible *not* to get a perfect score, since all top-k recommended items will be "relevant"? Unless you are cutting off the ground truth set at {{k}} too - in which case that does not sound like a correct computation to me. By contrast, if {{ALS.transform}} output a set of top-k items for each user, where the items are scored from *the entire set of possible candidate items*, then computing the ranking metric of that top-k set against the actual ground truth for each user is correct. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880312#comment-15880312 ] Nick Pentreath commented on SPARK-14409: [~danilo.ascione] Yes, your solution is generic assuming the input {{DataFrame}} is {{| user | item | predicted_score | actual score |}}, and that any of {{predicted_score}} or {{actual_score}} could be missing. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826901#comment-15826901 ] Roberto Mirizzi commented on SPARK-14409: - [~srowen] I've updated the code to generalize K. I've also added a couple of lines to deal with NaN (it probably could be further generalized, but it's a good start). In the code I propose I simply re-use the class *org.apache.spark.mllib.evaluation.RankingMetrics* already available in Spark since 1.2.0. The class only offers *p@k*, *ndcg@k* and *map* (as you can also see here: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html#ranking-systems). That's why they are the only one also available in my implementation. AUC or ROC are under *BinaryClassificationMetrics*. I haven't wrapped them yet, but I could do that too later. The motivation behind for *goodThreshold* is that the ground truth may also contain items that user doesn't like. However, when you compute accuracy metric, you want to make sure you compare only against the set of items that the user likes. As you can see in my code it's set to 0 by default, so unless specified, everything in the user profile will be considered. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826309#comment-15826309 ] Danilo Ascione commented on SPARK-14409: [~srowen] [~mlnick] Also about the top-k problem ("You are comparing the top-k items as predicted by the model to the top-k items as defined by the label."). My solution is different in this: it evaluates each label (from the pair user-item) against the top-k items as predicted by the model (for each user). Does this makes sense to you? > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826285#comment-15826285 ] Danilo Ascione commented on SPARK-14409: [~mlnick] This is a snippet to illustrate how I have dealt with the "null" problem: {code} val predictionAndLabels: DataFrame = dataset .join(topAtk, Seq($(queryCol)), "outer") //outer join to deal with nulls in "label" column .withColumn("topAtk", coalesce(col("topAtk"), mapToEmptyArray_())) //coalease to deal with nulls in "prediction" column .select($(labelCol), "topAtk") {code} >From line 111 of >[RankingEvaluator|https://github.com/apache/spark/pull/16618/files#diff-0345c4cb1878d3bb0d84297202fdc95f] > (I opened a PR for better readability) > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826248#comment-15826248 ] Apache Spark commented on SPARK-14409: -- User 'daniloascione' has created a pull request for this issue: https://github.com/apache/spark/pull/16618 > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826186#comment-15826186 ] Nick Pentreath commented on SPARK-14409: Yes to be more clear, I would expect that the {{k}} param would be specified as in Danilo's version, for example. I do like the use of windowing to achieve the sort within each user. This approach would also not work well with purely implicit data (unweighted). If everything is relevant in the ground truth then the model would score perfectly each time. It sort of works for the explicit rating case or the implicit case with "preference weights" since the ground truth then has an inherent ordering. Still I think the evaluator must be able to deal with the case of generating recommendations from the full item set. This means that the "label" and "prediction" columns could contains nulls. e.g. where an item exists in the ground truth but is not recommended (hence no score), the "prediction" column would be null. While if an item is recommended but is not in ground truth, the "label" column would be null. See my comments in SPARK-13857 for details. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826129#comment-15826129 ] Sean Owen commented on SPARK-14409: --- BTW [~roberto.mirizzi] there are much simpler ways to write your match statements with regexes if needed, and no reason to arbitrarily support only k <= 10. We usually move to a pull request with [WIP] in the title if trying to review substantial code but maybe we're not there yet. What's the need for goodThreshold? All of the ranking metrics supported here are a function of the top k predictions and top k "ground truth" relevant items from a held-out set. Typically. I think that this is also implementable as top-k per user query? but based on label not prediction. This is probably a workable design to support precision and recall and MAP, but I don't think it's a design that will support more general ranking metrics like AUC. Hm, I haven't thought this through, but maybe the existing, separate support for AUC would cover this case? I know it exists in MLlib. I agree that this would have to be applied to the original data set, and not to a subset as picked out by ALS. You are comparing the top-k items as predicted by the model to the top-k items as defined by the label. I'm accustomed to actually holding out those top-k from training too. I don't know how easy that is to work into this design, and at some scale, it probably won't skew the evaluation too much. But if the model is given all the answers including all the top-k best items, then we're really just testing its ability to reconstruct the input and a model that trivially returns answers based on the input data directly would score perfectly. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826078#comment-15826078 ] Nick Pentreath commented on SPARK-14409: [~danilo.ascione] [~roberto.mirizzi] thanks for the code examples. Both seem reasonable and I like the DataFrame-based solutions here. The ideal solution would likely take a few elements from each design. One aspect that concerns me is how are you generating recommendations from ALS? It appears that you will be using the current output of {{ALS.transform}}. So you're computing a ranking metric in a scenario where you only recommend the subset of user-item combinations that occur in the evaluation data set. So it is sort of like a "re-ranking" evaluation metric in a sense. I'd expect the ranking metric here to quite dramatically overestimate true performance, since in the real word you would generate recommendations from the complete set of available items. cc [~srowen] thoughts? > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825707#comment-15825707 ] Danilo Ascione commented on SPARK-14409: I have implemented a Dataframe api based RankingEvaluator that can be used in model selection pipeline (Cross-Validation). The approach is similar to that of [~roberto.mirizzi]. I posted some usage code in [SPARK-13857|https://issues.apache.org/jira/browse/SPARK-13857?focusedCommentId=15822021&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15822021] for discussion. Code is here https://github.com/daniloascione/spark/commit/c93ab86d35984e9f70a3b4f543fb88f5541333f0 I would appreciate some feedback. Thanks! > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824774#comment-15824774 ] Roberto Mirizzi commented on SPARK-14409: - I implemented the RankingEvaluator to be used with ALS. Here's the code {code:java} package org.apache.spark.ml.evaluation import org.apache.spark.annotation.Experimental import org.apache.spark.ml.evaluation.Evaluator import org.apache.spark.ml.param.{Params, Param, ParamMap, ParamValidators} import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils} import org.apache.spark.mllib.evaluation.RankingMetrics import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.Dataset import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{IntegerType, DoubleType, FloatType} /** * Created by Roberto Mirizzi on 12/5/16. */ /** * :: Experimental :: * Evaluator for ranking, which expects two input columns: prediction and label. */ @Experimental final class RankingEvaluator(override val uid: String) extends Evaluator with HasUserCol with HasItemCol with HasPredictionCol with HasLabelCol with DefaultParamsWritable { def this() = this(Identifiable.randomUID("rankEval")) /** * Param for metric name in evaluation. Supports: * - `"map"` (default): mean average precision * - `"p@k"`: precision@k (1 <= k <= 10) * - `"ndcg@k"`: normalized discounted cumulative gain@k (1 <= k <= 10) * * @group param */ val metricName: Param[String] = { val allowedParams = ParamValidators.inArray(Array("map", "p@1", "p@2", "p@3", "p@4", "p@5", "p@6", "p@7", "p@8", "p@9", "p@10", "ndcg@1", "ndcg@2", "ndcg@3", "ndcg@4", "ndcg@5", "ndcg@6", "ndcg@7", "ndcg@8", "ndcg@9", "ndcg@10")) new Param(this, "metricName", "metric name in evaluation (map|p@1|p@2|p@3|p@4|p@5|p@6|p@7|p@8|p@9|p@10|" + "ndcg@1|ndcg@2|ndcg@3|ndcg@4|ndcg@5|ndcg@6|ndcg@7|ndcg@8|ndcg@9|ndcg@10)", allowedParams) } val goodThreshold: Param[String] = { new Param(this, "goodThreshold", "threshold for good labels") } /** @group getParam */ def getMetricName: String = $(metricName) /** @group setParam */ def setMetricName(value: String): this.type = set(metricName, value) /** @group getParam */ def getGoodThreshold: Double = $(goodThreshold).toDouble /** @group setParam */ def setGoodThreshold(value: Double): this.type = set(goodThreshold, value.toString) /** @group setParam */ def setUserCol(value: String): this.type = set(userCol, value) /** @group setParam */ def setItemCol(value: String): this.type = set(itemCol, value) /** @group setParam */ def setLabelCol(value: String): this.type = set(labelCol, value) /** @group setParam */ def setPredictionCol(value: String): this.type = set(predictionCol, value) setDefault(metricName -> "map") setDefault(goodThreshold -> "0") override def evaluate(dataset: Dataset[_]): Double = { val spark = dataset.sparkSession import spark.implicits._ val schema = dataset.schema SchemaUtils.checkNumericType(schema, $(userCol)) SchemaUtils.checkNumericType(schema, $(itemCol)) SchemaUtils.checkColumnTypes(schema, $(labelCol), Seq(DoubleType, FloatType)) SchemaUtils.checkColumnTypes(schema, $(predictionCol), Seq(DoubleType, FloatType)) val windowByUserRankByPrediction = Window.partitionBy(col($(userCol))).orderBy(col($(predictionCol)).desc) val windowByUserRankByRating = Window.partitionBy(col($(userCol))).orderBy(col($(labelCol)).desc) val predictionDataset = dataset.select(col($(userCol)).cast(IntegerType), col($(itemCol)).cast(IntegerType), col($(predictionCol)).cast(FloatType), row_number().over(windowByUserRankByPrediction).as("rank")) .where(s"rank <= 10") .groupBy(col($(userCol))) .agg(collect_list(col($(itemCol))).as("prediction_list")) .withColumnRenamed($(userCol), "predicted_userId") .as[(Int, Array[Int])] predictionDataset.show() //// alternative to the above query //dataset.createOrReplaceTempView("sortedRanking") //spark.sql("SELECT _1 AS predicted_userId, collect_list(_2) AS prediction_list FROM " + // "(SELECT *, row_number() OVER (PARTITION BY _1 ORDER BY _4 DESC) AS rank FROM sortedRanking) x " + // "WHERE rank <= 10 " + // "GROUP BY predicted_userId").as[(Int, Array[Int])] val actualDataset = dataset.select(col($(userCol)).cast(IntegerType), col($(itemCol)).cast(IntegerType), row_number().over(windowByUserRankByRating)) .where(col($(labelCol)).cast(DoubleType) > $(goodThreshold)) .groupBy(col($(userCol))) .agg(collect_list(col($(itemCol))).as("actual_list")) .withColumnRenamed($(userCol), "actual_userId") .as[(Int, Array[Int])] actualDataset.show() val predictionAndLabels =
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245084#comment-15245084 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] for the references. I will take a look at those and see what we could do with it. By the way, initially I though I could easily calling RankingMetrics in mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am having some trouble in implementation because the ` @Since("2.0.0") override def evaluate(dataset: Dataset[_]): Double ` in `RankingEvaluator` is not so easy to be converted into RankingMetrics's (`RDD[(Array[T], Array[T])]`). I will do some further investigation. If I can not find a easy way to convert the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly implementing the methods in new ml.evaluation (instead of calling mllib.evaluation). > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245077#comment-15245077 ] Apache Spark commented on SPARK-14409: -- User 'yongtang' has created a pull request for this issue: https://github.com/apache/spark/pull/12461 > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240797#comment-15240797 ] Nick Pentreath commented on SPARK-14409: [~yongtang] [~josephkb] it would also be useful to try to ensure that the {{RankingEvaluator}} can handle more general ranking problems than recommendations, e.g. https://www.kaggle.com/c/expedia-personalized-sort/details/evaluation, https://www.kaggle.com/c/yandex-personalized-web-search-challenge and http://research.microsoft.com/en-us/projects/mslr/. Perhaps we can use some of these datasets to decide on the input data schema semantics etc. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240607#comment-15240607 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] [~josephkb]. Yes I think wrapping RankingMetrics could be the first step and reimplementing all RankingEvaluator methods in ML using DataFrames would be good after that. I will work on the reimplementation in several followup PRs. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240252#comment-15240252 ] Joseph K. Bradley commented on SPARK-14409: --- Thanks for writing this! I just made a few comments too. Wrapping RankingMetrics seems fine to me, though later on it would be worth re-implementing it using DataFrames and testing performance changes. The initial PR should not add new metrics, but follow-up ones can. Also, we'll need to follow up this issue with one to think about how to use ALS with CrossValidator. I'll comment on the linked JIRA for that. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238707#comment-15238707 ] Nick Pentreath commented on SPARK-14409: Given the amount of existing code in mllib RankingMetrics, I would go with your suggested approach of adding to RankingMetrics and wrapping that. It can also be useful for users of the old mllib API. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238462#comment-15238462 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] for the review. I was planning to add MRR to RankingMetrics and then wrap that as a first step. But if you think it makes sense, I can reimplement from scratch. Please let me know which way would be better and I will move forward with it. Thanks. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236784#comment-15236784 ] Nick Pentreath commented on SPARK-14409: Thanks for working up the design doc. I made a few comments. Overall I think this makes sense - do you plan to reimplement from scratch or add MRR to RankingMetrics and wrap that? > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236541#comment-15236541 ] Yong Tang commented on SPARK-14409: --- [~mlnick] [~josephkb] I added a short doc in google driver with comment enabled: https://docs.google.com/document/d/1YEvf5eEm2vRcALJs39yICWmUx6xFW5j8DvXFWbRbStE/edit?usp=sharing Please let me know if there is any feedback. Thanks > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228741#comment-15228741 ] Yong Tang commented on SPARK-14409: --- [~josephkb] Sure. Let me do some investigation on other libraries then I will add a design doc. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228719#comment-15228719 ] Joseph K. Bradley commented on SPARK-14409: --- If you do work on it, it would be useful to post a short design doc since there are more types of options for ranking evaluation than for classification and regression. This could include looking at what other libraries support and what is commonly used in literature. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227743#comment-15227743 ] Yong Tang commented on SPARK-14409: --- [~mlnick] I can work on this issue if no one has started yet. Thanks. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org