[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-13 Thread Danilo Ascione (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15907063#comment-15907063
 ] 

Danilo Ascione commented on SPARK-14409:


I updated the [PR |https://github.com/apache/spark/pull/16618] with the ranking 
metrics computations as UDF (as suggested 
[here|https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15896933&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15896933]).
 I focused on minimizing changes to the ranking metrics implementation from the 
mlib package (basically, only the UDF part).

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902649#comment-15902649
 ] 

Nick Pentreath commented on SPARK-14409:


[~josephkb] in reference to your [PR 
comment|https://github.com/apache/spark/pull/17090#issuecomment-284827573]:

Really the input schema for evaluation is fairly simple - a set of ground truth 
ids and a (sorted) set of predicted ids, for each query (/user). The exact 
format (arrays like for {{mllib}} version, "exploded" version proposed in this 
JIRA) is not relevant in itself. Rather, the format selected is actually 
dictated by the {{Pipeline}} API - specifically, a model's prediction output 
schema from {{transform}} must be compatible with the evaluator's input schema 
for {{evaluate}}.

The schema proposed above is - I believe - the only one that is compatible with 
both "linear model" style things such as `LogisticRegression` for ad CTR 
prediction and learning-to-rank settings, as well as recommendation tasks.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902639#comment-15902639
 ] 

Nick Pentreath commented on SPARK-14409:


I commented on the [PR for 
SPARK-19535|https://github.com/apache/spark/pull/17090#issuecomment-284648012] 
and am copying that comment here for future reference as it contains further 
detail of the discussion:

=
{noformat}
Sorry if my other comments here and on JIRA were unclear. But the proposed 
schema for input to RankingEvaluator is:

Schema 1

+--+---+--+--+
|userId|movieId|rating|prediction|
+--+---+--+--+
|   230|318|   5.0| 4.2403245|
|   230|   3424|   4.0|  null|
|   230|  81191|  null|  4.317455|
+--+---+--+--+
You will notice that rating and prediction columns can be null. This is by 
design. There are three cases shown above:

1st row indicates a (user-item) pair that occurs in both the ground-truth set 
and the top-k predictions;
2nd row indicates a (user-item) pair that occurs in the ground-truth set, but 
not in the top-k predictions;
3rd row indicates a (user-item) pair that occurs in the top-k predictions, but 
not in the ground-truth set.
Note for reference, the input to the current mllib RankingMetrics is:

Schema 2

RDD[(true labels array, predicted labels array)],
i.e.
RDD of ([318, 3424, 7139,...], [81191, 93040, 31...])
(So actually neither of the above schemas are easily compatible with the return 
schema here - but I think it is not really necessary to match the 
mllib.RankingMetrics format)

ALS cross-validation

My proposal for fitting ALS into cross-validation is the ALSModel.transform 
will output a DF of Schema 1 - only when the parameters k and recommendFor are 
appropriately set, and the input DF contains both user and item columns. In 
practice, this scenario will occur during cross-validation only.

So what I am saying is that ALS itself (not the evaluator) must know how to 
return the correct DataFrame output from transform such that it can be used in 
a cross-validation as input to the RankingEvaluator.

Concretely:

val als = new ALS().setRecommendFor("user").setK(10)
val validator = new TrainValidationSplit()
  .setEvaluator(new RankingEvaluator().setK(10))
  .setEstimator(als)
  .setEstimatorParamMaps(...)
val bestModel = validator.fit(ratings)
So while it is complex under the hood - to users it's simply a case of setting 
2 params and the rest is as normal.

Now, we have the best model selected by cross-validation. We can make 
recommendations using these convenience methods (I think it will need a cast):

val recommendations = 
bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10)
Alternatively, the transform version looks like this:

val usersDF = ...
+--+
|userId|
+--+
| 1|
| 2|
| 3|
+--+
val recommendations = bestModel.transform(usersDF)
So the questions:

should we support the above transform-based recommendations? Or only support it 
for cross-validation purposes as a special case?
if we do, what should the output schema of the above transform version look 
like? It must certainly match the output of recommendX methods.
The options are:

(1) The schema in this PR:
Pros: as you mention above - also more "compact"
Cons: doesn't match up so closely with the transform "cross-validation" schema 
above

(2) The schema below. It is basically an "exploded" version of option (1)

+--+---+--+
|userId|movieId|prediction|
+--+---+--+
| 1|  1|   4.3|
| 1|  5|   3.2|
| 1|  9|   2.1|
+--+---+--+
Pros*: matches more closely with the cross-validation / evaluator input format. 
Perhaps slightly more "dataframe-like".
Cons: less compact; lose ordering?; may require more munging to save to 
external data stores etc.

Anyway sorry for hijacking this PR discussion - but as I think you can see, the 
evaluator / ALS transform interplay is a bit subtle and requires some thought 
to get the right approach.
{noformat}

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898855#comment-15898855
 ] 

Nick Pentreath commented on SPARK-14409:


[~josephkb] the proposed input schema above encompasses that - the {{labelCol}} 
is the true relevance score (rating, confidence, etc), while the 
{{predictionCol}} is the predicted relevance (rating, confidence, etc). Note we 
can name these columns something more specific ({{labelCol}} and 
{{predictionCol}} are re-used really from the other evaluators).

This also allows "weighted" forms of ranking metric later (e.g. some metrics 
can incorporate the true relevance score into the computation which serves as a 
form of weighting of the metric) - the metrics we currently have don't do that. 
So for now the true relevance can serve as a filter - for example, when 
computing the ranking metric for recommendation, we *don't* want to include 
negative ratings in the "ground truth set of relevant documents" - hence the 
{{goodThreshold}} param above (I would rather call it something like 
{{relevanceThreshold}} myself).

*Note* that there are 2 formats I detail in my comment above - the first is the 
the actual schema of the {{DataFrame}} used as input to the 
{{RankingEvaluator}} - this must therefore be the schema of the DF output by 
{{model.transform}} (whether that is ALS for recommendation, a logistic 
regression for ad prediction, or whatever).

The second format mentioned is simply illustrating the *intermediate internal 
transformation* that the evaluator will do in the {{evaluate}} method. You can 
see a rough draft of it in Danilo's PR 
[here|https://github.com/apache/spark/pull/16618/files#diff-0345c4cb1878d3bb0d84297202fdc95fR93].

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898823#comment-15898823
 ] 

Joseph K. Bradley commented on SPARK-14409:
---

Thanks [~nick.pentre...@gmail.com]!  I like this general approach.  A few 
initial thoughts:

Schema for evaluator:
* Some evaluators will take rating or confidence values as well.  Will those be 
appended as extra columns?
* If a recommendation model like ALSModel returns top K recommendations for 
each user, that will not fit the RankingEvaluator input.  Do you plan to have 
RankingEvaluator or CrossValidator handle efficient calculation of top K 
recommendations?
* Relatedly, I'll comment on the schema in 
[https://github.com/apache/spark/pull/17090] directly in that PR in case we 
want to make changes in a quick follow-up.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Danilo Ascione (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897198#comment-15897198
 ] 

Danilo Ascione commented on SPARK-14409:


Thank you [~mlnick] for taking time to thing about this. 

I like the generalization for the most common scenarios. 

The Evaluator approach is already implemented in 
[#16618|https://github.com/apache/spark/pull/16618]. I'll find time to update 
the PR with the proposed generalization and the ranking metrics computations as 
UDFs.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896933#comment-15896933
 ] 

Nick Pentreath commented on SPARK-14409:


I've thought about this a lot over the past few days, and I think the approach 
should be in line with that suggested by [~roberto.mirizzi] & [~danilo.ascione].

*Goal*

Provide a DataFrame-based ranking evaluator that is general enough to handle 
common scenarios such as recommendations (ALS), search ranking, ad click 
prediction using ranking metrics (e.g. recent Kaggle competitions for 
illustration: [Outbrain Ad Clicks using 
MAP|https://www.kaggle.com/c/outbrain-click-prediction#evaluation], [Expedia 
Hotel Search Ranking using 
NDCG|https://www.kaggle.com/c/expedia-personalized-sort#evaluation]).

*RankingEvaluator input format*

{{evaluate}} would take a {{DataFrame}} with columns:

* {{queryCol}} - the column containing "query id" (e.g. "query" for cases such 
as search ranking; "user" for recommendations; "impression" for ad click 
prediction/ranking, etc);
* {{documentCol}} - the column containing "document id" (e.g. "document" in 
search, "item" in recommendation, "ad" in ad ranking, etc);
* {{labelCol}} (or maybe {{relevanceCol}} to be more precise) - the column 
containing the true relevance score for a query-document pair (e.g. in 
recommendations this would be the "rating"). This column will only be used for 
filtering out "irrelevant" documents from the ground-truth set (see Param 
{{goodThreshold}} mentioned 
[above|https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15826901&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15826901)]);
* {{predictionCol}} - the column containing the predicted relevance score for a 
query-document pair. The predicted ids will be ordered by this column for 
computing ranking metrics (for which order matters in predictions but generally 
not for ground-truth which is treated as a set).

The reasoning is that this format is flexible & generic enough to encompass the 
diverse use cases mentioned above.

Here is an illustrative example from recommendations as a special case:

{code}
+--+---+--+--+
|userId|movieId|rating|prediction|
+--+---+--+--+
|   230|318|   5.0| 4.2403245|
|   230|   3424|   4.0|  null|
|   230|  81191|  null|  4.317455|
+--+---+--+--+
{code}

You will notice that {{rating}} and {{prediction}} columns can be {{null}}. 
This is by design. There are three cases shown above:

# 1st row indicates a query-document (user-item) pair that occurs in *both* the 
ground-truth set and the top-k predictions;
# 2nd row indicates a user-item pair that occurs in the ground-truth set, but 
*not* in the top-k predictions;
# 3rd row indicates a user-item pair that *does not* occur in the ground-truth 
set, but *does* occur in the top-k predictions;

*Note* that while technically the input allows both these columns to be 
{{null}} in practice that won't occur since a query-document pair must occur in 
at least one of the ground-truth set or predictions. If it does occur for some 
reason it can be ignored.

*Evaluator approach*

The evaluator will perform a window function over {{queryCol}} and order by 
{{predictionCol}} within each query. Then, {{collect_list}} can be used to 
arrive at the following intermediate format:

{code}
+--+++
|userId| true_labels|predicted_labels|
+--+++
|   230|[318, 3424, 7139,...|[81191, 93040, 31...|
+--+++
{code}

*Relationship to RankingMetrics*

Technically the intermediate format above is the same format as used for 
{{RankingMetrics}}, and perhaps we could simple wrap the {{mllib}} version. 
*Note* however that the {{mllib}} class is parameterized by the type of 
"document": {code}RankingMetrics[T]{code}

I believe for the generic case we must support both {{NumericType}} and 
{{StringType}} for id columns (rather than restricting to {{Int}} as in Danilo 
& Roberto versions above). So either:
# the evaluator must be similarly parameterized; or
# we will need to re-write the ranking metrics computations as UDFs as follows: 
{code} udf { (predicted: Seq[Any], actual: Seq[Any]) => ... {code} 

I strongly prefer option #2 as it is more flexible and in keeping with the 
DataFrame style of Spark ML components (as a side note, this will give us a 
chance to review the implementations & naming of metrics, since there are some 
issues with a few of the metrics).


That is my proposal (sorry Yong, this is quite different now from the work 
you've done in your PR). If Yong or Danilo has time to update his PR in this 
direction, let me know.

Thanks!

> Investigate adding a RankingEvaluator to ML
> 

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-04 Thread Danilo Ascione (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895619#comment-15895619
 ] 

Danilo Ascione commented on SPARK-14409:


I can help with both PR. Please consider that the solution in [PR 
16618|https://github.com/apache/spark/pull/16618] is a Dataframe api based 
version of that in [PR 12461|https://github.com/apache/spark/pull/12461]. Any 
way, I'd like to help review an alternative solution. Thanks!

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-24 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883431#comment-15883431
 ] 

Roberto Mirizzi commented on SPARK-14409:
-

[~mlnick] my implementation was conceptually close to what we already have for 
the existing mllib. If you look at the example in 
http://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#ranking-systems
 they do exactly what I do with goodThreshold parameter.
As you can see in my approach, I am using collect_list and windowing, and I 
simply pass the Dataset to the evaluator, similar to what we have for other 
evaluators in ml.
IMO, that's the approach that has continuity with other existing evaluators. 
However, if you think we should also support array columns, we can add that too.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-24 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883417#comment-15883417
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the reminder. I will take a look and update the PR as 
needed. (I am on the road until next Wednesday. Will try to get it by the end 
of next week.)

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882174#comment-15882174
 ] 

Nick Pentreath commented on SPARK-14409:


The other option is to work with [~danilo.ascione] PR here: 
https://github.com/apache/spark/pull/16618 if Yong does not have time to update.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882163#comment-15882163
 ] 

Nick Pentreath commented on SPARK-14409:


[~roberto.mirizzi] the {{goodThreshold}} param seems pretty reasonable in this 
context to exclude irrelevant items. I think it can be a good {{expertParam}} 
addition.

Ok, I think that a first pass at this should just aim to replicate what we have 
exposed in {{mllib}} and wrap {{RankingMetrics}}. Initially we can look at: (a) 
supporting numeric columns and doing the windowing & {{collect_list}} approach 
to feed into {{RankingMetrics}}; (b) support Array columns and feed directly 
into {{RankingMetrics}} or (c) support both.

[~yongtang] already did a PR here: https://github.com/apache/spark/pull/12461. 
It is fairly complete and also includes MRR. [~yongtang] are you able to work 
on reviving that PR? If os, [~roberto.mirizzi] [~danilo.ascione] are you able 
to help review that PR?

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880324#comment-15880324
 ] 

Nick Pentreath commented on SPARK-14409:


[~roberto.mirizzi] If using the current {{ALS.transform}} output as input to 
the {{RankingEvaluator}}, as envisaged here, the model will predict a score for 
each {{user-item}} pair in the evaluation set. For each user, the ground truth 
is exactly this distinct set of items. By definition the top-k items ranked by 
predicted sore will be in the ground truth set, since {{ALS}} is only scoring 
{{user-item}} pairs *that already exist in the evaluation set*. So how is it 
possible *not* to get a perfect score, since all top-k recommended items will 
be "relevant"?

Unless you are cutting off the ground truth set at {{k}} too - in which case 
that does not sound like a correct computation to me.

By contrast, if {{ALS.transform}} output a set of top-k items for each user, 
where the items are scored from *the entire set of possible candidate items*, 
then computing the ranking metric of that top-k set against the actual ground 
truth for each user is correct.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880312#comment-15880312
 ] 

Nick Pentreath commented on SPARK-14409:


[~danilo.ascione] Yes, your solution is generic assuming the input 
{{DataFrame}} is {{| user | item | predicted_score | actual score |}}, and that 
any of {{predicted_score}} or {{actual_score}} could be missing.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826901#comment-15826901
 ] 

Roberto Mirizzi commented on SPARK-14409:
-

[~srowen] I've updated the code to generalize K. 
I've also added a couple of lines to deal with NaN (it probably could be 
further generalized, but it's a good start).

In the code I propose I simply re-use the class 
*org.apache.spark.mllib.evaluation.RankingMetrics* already available in Spark 
since 1.2.0. The class only offers *p@k*, *ndcg@k* and *map* (as you can also 
see here: 
https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html#ranking-systems).
 That's why they are the only one also available in my implementation. 
AUC or ROC are under *BinaryClassificationMetrics*. I haven't wrapped them yet, 
but I could do that too later. 

The motivation behind for *goodThreshold* is that the ground truth may also 
contain items that user doesn't like. However, when you compute accuracy 
metric, you want to make sure you compare only against the set of items that 
the user likes. As you can see in my code it's set to 0 by default, so unless 
specified, everything in the user profile will be considered.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Danilo Ascione (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826309#comment-15826309
 ] 

Danilo Ascione commented on SPARK-14409:


[~srowen] [~mlnick] Also about the top-k problem ("You are comparing the top-k 
items as predicted by the model to the top-k items as defined by the label."). 
My solution is different in this: it evaluates each label (from the pair 
user-item) against the top-k items as predicted by the model (for each user). 
Does this makes sense to you?

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Danilo Ascione (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826285#comment-15826285
 ] 

Danilo Ascione commented on SPARK-14409:


[~mlnick] This is a snippet to illustrate how I have dealt with the "null" 
problem:
{code}
val predictionAndLabels: DataFrame = dataset
  .join(topAtk, Seq($(queryCol)), "outer") //outer join to deal with nulls 
in "label" column
  .withColumn("topAtk", coalesce(col("topAtk"), mapToEmptyArray_())) 
//coalease to deal with nulls in "prediction" column
  .select($(labelCol), "topAtk")
{code}
>From line 111 of 
>[RankingEvaluator|https://github.com/apache/spark/pull/16618/files#diff-0345c4cb1878d3bb0d84297202fdc95f]
> (I opened a PR for better readability)

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826248#comment-15826248
 ] 

Apache Spark commented on SPARK-14409:
--

User 'daniloascione' has created a pull request for this issue:
https://github.com/apache/spark/pull/16618

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826186#comment-15826186
 ] 

Nick Pentreath commented on SPARK-14409:


Yes to be more clear, I would expect that the {{k}} param would be specified as 
in Danilo's version, for example. I do like the use of windowing to achieve the 
sort within each user.

This approach would also not work well with purely implicit data (unweighted). 
If everything is relevant in the ground truth then the model would score 
perfectly each time. It sort of works for the explicit rating case or the 
implicit case with "preference weights" since the ground truth then has an 
inherent ordering. 

Still I think the evaluator must be able to deal with the case of generating 
recommendations from the full item set. This means that the "label" and 
"prediction" columns could contains nulls.
e.g. where an item exists in the ground truth but is not recommended (hence no 
score), the "prediction" column would be null. While if an item is recommended 
but is not in ground truth, the "label" column would be null. See my comments 
in SPARK-13857 for details.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826129#comment-15826129
 ] 

Sean Owen commented on SPARK-14409:
---

BTW [~roberto.mirizzi] there are much simpler ways to write your match 
statements with regexes if needed, and no reason to arbitrarily support only k 
<= 10. We usually move to a pull request with [WIP] in the title if trying to 
review substantial code but maybe we're not there yet.

What's the need for goodThreshold? All of the ranking metrics supported here 
are a function of the top k predictions and top k "ground truth" relevant items 
from a held-out set. Typically. I think that this is also implementable as 
top-k per user query? but based on label not prediction.

This is probably a workable design to support precision and recall and MAP, but 
I don't think it's a design that will support more general ranking metrics like 
AUC. Hm, I haven't thought this through, but maybe the existing, separate 
support for AUC would cover this case? I know it exists in MLlib.

I agree that this would have to be applied to the original data set, and not to 
a subset as picked out by ALS. You are comparing the top-k items as predicted 
by the model to the top-k items as defined by the label. I'm accustomed to 
actually holding out those top-k from training too. I don't know how easy that 
is to work into this design, and at some scale, it probably won't skew the 
evaluation too much. But if the model is given all the answers including all 
the top-k best items, then we're really just testing its ability to reconstruct 
the input and a model that trivially returns answers based on the input data 
directly would score perfectly.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826078#comment-15826078
 ] 

Nick Pentreath commented on SPARK-14409:


[~danilo.ascione] [~roberto.mirizzi] thanks for the code examples. Both seem 
reasonable and I like the DataFrame-based solutions here. The ideal solution 
would likely take a few elements from each design.

One aspect that concerns me is how are you generating recommendations from ALS? 
It appears that you will be using the current output of {{ALS.transform}}. So 
you're computing a ranking metric in a scenario where you only recommend the 
subset of user-item combinations that occur in the evaluation data set. So it 
is sort of like a "re-ranking" evaluation metric in a sense. I'd expect the 
ranking metric here to quite dramatically overestimate true performance, since 
in the real word you would generate recommendations from the complete set of 
available items.

cc [~srowen] thoughts?

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Danilo Ascione (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825707#comment-15825707
 ] 

Danilo Ascione commented on SPARK-14409:


I have implemented a Dataframe api based RankingEvaluator that can be used in 
model selection pipeline (Cross-Validation). 
The approach is similar to that of [~roberto.mirizzi].
I posted some usage code in 
[SPARK-13857|https://issues.apache.org/jira/browse/SPARK-13857?focusedCommentId=15822021&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15822021]
 for discussion. 

Code is here 
https://github.com/daniloascione/spark/commit/c93ab86d35984e9f70a3b4f543fb88f5541333f0

I would appreciate some feedback. Thanks!

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-16 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824774#comment-15824774
 ] 

Roberto Mirizzi commented on SPARK-14409:
-

I implemented the RankingEvaluator to be used with ALS. Here's the code

{code:java}
package org.apache.spark.ml.evaluation

import org.apache.spark.annotation.Experimental
import org.apache.spark.ml.evaluation.Evaluator
import org.apache.spark.ml.param.{Params, Param, ParamMap, ParamValidators}
import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, 
Identifiable, SchemaUtils}
import org.apache.spark.mllib.evaluation.RankingMetrics
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, DoubleType, FloatType}

/**
  * Created by Roberto Mirizzi on 12/5/16.
  */
/**
  * :: Experimental ::
  * Evaluator for ranking, which expects two input columns: prediction and 
label.
  */
@Experimental
final class RankingEvaluator(override val uid: String)
  extends Evaluator with HasUserCol with HasItemCol with HasPredictionCol with 
HasLabelCol with DefaultParamsWritable {

  def this() = this(Identifiable.randomUID("rankEval"))

  /**
* Param for metric name in evaluation. Supports:
* - `"map"` (default): mean average precision
* - `"p@k"`: precision@k (1 <= k <= 10)
* - `"ndcg@k"`: normalized discounted cumulative gain@k (1 <= k <= 10)
*
* @group param
*/
  val metricName: Param[String] = {
val allowedParams = ParamValidators.inArray(Array("map", "p@1", "p@2", 
"p@3", "p@4", "p@5", "p@6", "p@7", "p@8", "p@9", "p@10",
  "ndcg@1", "ndcg@2", "ndcg@3", "ndcg@4", "ndcg@5", "ndcg@6", "ndcg@7", 
"ndcg@8", "ndcg@9", "ndcg@10"))
new Param(this, "metricName", "metric name in evaluation 
(map|p@1|p@2|p@3|p@4|p@5|p@6|p@7|p@8|p@9|p@10|" +
  
"ndcg@1|ndcg@2|ndcg@3|ndcg@4|ndcg@5|ndcg@6|ndcg@7|ndcg@8|ndcg@9|ndcg@10)", 
allowedParams)
  }

  val goodThreshold: Param[String] = {
new Param(this, "goodThreshold", "threshold for good labels")
  }

  /** @group getParam */
  def getMetricName: String = $(metricName)

  /** @group setParam */
  def setMetricName(value: String): this.type = set(metricName, value)

  /** @group getParam */
  def getGoodThreshold: Double = $(goodThreshold).toDouble

  /** @group setParam */
  def setGoodThreshold(value: Double): this.type = set(goodThreshold, 
value.toString)

  /** @group setParam */
  def setUserCol(value: String): this.type = set(userCol, value)

  /** @group setParam */
  def setItemCol(value: String): this.type = set(itemCol, value)

  /** @group setParam */
  def setLabelCol(value: String): this.type = set(labelCol, value)

  /** @group setParam */
  def setPredictionCol(value: String): this.type = set(predictionCol, value)

  setDefault(metricName -> "map")
  setDefault(goodThreshold -> "0")

  override def evaluate(dataset: Dataset[_]): Double = {
val spark = dataset.sparkSession
import spark.implicits._

val schema = dataset.schema
SchemaUtils.checkNumericType(schema, $(userCol))
SchemaUtils.checkNumericType(schema, $(itemCol))
SchemaUtils.checkColumnTypes(schema, $(labelCol), Seq(DoubleType, 
FloatType))
SchemaUtils.checkColumnTypes(schema, $(predictionCol), Seq(DoubleType, 
FloatType))

val windowByUserRankByPrediction = 
Window.partitionBy(col($(userCol))).orderBy(col($(predictionCol)).desc)
val windowByUserRankByRating = 
Window.partitionBy(col($(userCol))).orderBy(col($(labelCol)).desc)

val predictionDataset = dataset.select(col($(userCol)).cast(IntegerType),
  col($(itemCol)).cast(IntegerType),
  col($(predictionCol)).cast(FloatType), 
row_number().over(windowByUserRankByPrediction).as("rank"))
  .where(s"rank <= 10")
  .groupBy(col($(userCol)))
  .agg(collect_list(col($(itemCol))).as("prediction_list"))
  .withColumnRenamed($(userCol), "predicted_userId")
  .as[(Int, Array[Int])]

predictionDataset.show()

//// alternative to the above query
//dataset.createOrReplaceTempView("sortedRanking")
//spark.sql("SELECT _1 AS predicted_userId, collect_list(_2) AS 
prediction_list FROM " +
//  "(SELECT *, row_number() OVER (PARTITION BY _1 ORDER BY _4 DESC) AS 
rank FROM sortedRanking) x " +
//  "WHERE rank <= 10 " +
//  "GROUP BY predicted_userId").as[(Int, Array[Int])]

val actualDataset = dataset.select(col($(userCol)).cast(IntegerType),
  col($(itemCol)).cast(IntegerType),
  row_number().over(windowByUserRankByRating))
  .where(col($(labelCol)).cast(DoubleType) > $(goodThreshold))
  .groupBy(col($(userCol)))
  .agg(collect_list(col($(itemCol))).as("actual_list"))
  .withColumnRenamed($(userCol), "actual_userId")
  .as[(Int, Array[Int])]

actualDataset.show()

val predictionAndLabels = 

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-17 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245084#comment-15245084
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the references. I will take a look at those and see what 
we could do with it.

By the way, initially I though I could easily calling RankingMetrics in 
mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am 
having some trouble in implementation because the 
`
  @Since("2.0.0")
  override def evaluate(dataset: Dataset[_]): Double
`
in `RankingEvaluator` is not so easy to be converted into RankingMetrics's 
(`RDD[(Array[T], Array[T])]`).

I will do some further investigation. If I can not find a easy way to convert 
the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly 
implementing the methods in new ml.evaluation (instead of calling 
mllib.evaluation).

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245077#comment-15245077
 ] 

Apache Spark commented on SPARK-14409:
--

User 'yongtang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12461

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-14 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240797#comment-15240797
 ] 

Nick Pentreath commented on SPARK-14409:


[~yongtang] [~josephkb] it would also be useful to try to ensure that the 
{{RankingEvaluator}} can handle more general ranking problems than 
recommendations, e.g. 
https://www.kaggle.com/c/expedia-personalized-sort/details/evaluation, 
https://www.kaggle.com/c/yandex-personalized-web-search-challenge and 
http://research.microsoft.com/en-us/projects/mslr/. Perhaps we can use some of 
these datasets to decide on the input data schema semantics etc.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-13 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240607#comment-15240607
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] [~josephkb]. Yes I think wrapping RankingMetrics could be the 
first step and reimplementing all RankingEvaluator methods in ML using 
DataFrames would be good after that. I will work on the reimplementation in 
several followup PRs.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240252#comment-15240252
 ] 

Joseph K. Bradley commented on SPARK-14409:
---

Thanks for writing this!  I just made a few comments too.  Wrapping 
RankingMetrics seems fine to me, though later on it would be worth 
re-implementing it using DataFrames and testing performance changes.  The 
initial PR should not add new metrics, but follow-up ones can.

Also, we'll need to follow up this issue with one to think about how to use ALS 
with CrossValidator.  I'll comment on the linked JIRA for that.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238707#comment-15238707
 ] 

Nick Pentreath commented on SPARK-14409:


Given the amount of existing code in mllib RankingMetrics, I would go with your 
suggested approach of adding to RankingMetrics and wrapping that. It can also 
be useful for users of the old mllib API.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-12 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238462#comment-15238462
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the review. I was planning to add MRR to RankingMetrics 
and then wrap that as a first step. But if you think it makes sense, I can 
reimplement from scratch. Please let me know which way would be better and I 
will move forward with it. Thanks.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236784#comment-15236784
 ] 

Nick Pentreath commented on SPARK-14409:


Thanks for working up the design doc. I made a few comments. Overall I think 
this makes sense - do you plan to reimplement from scratch or add MRR to 
RankingMetrics and wrap that?

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-11 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236541#comment-15236541
 ] 

Yong Tang commented on SPARK-14409:
---

[~mlnick] [~josephkb] I added a short doc in google driver with comment enabled:
https://docs.google.com/document/d/1YEvf5eEm2vRcALJs39yICWmUx6xFW5j8DvXFWbRbStE/edit?usp=sharing
Please let me know if there is any feedback. Thanks

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-06 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228741#comment-15228741
 ] 

Yong Tang commented on SPARK-14409:
---

[~josephkb] Sure. Let me do some investigation on other libraries then I will 
add a design doc.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228719#comment-15228719
 ] 

Joseph K. Bradley commented on SPARK-14409:
---

If you do work on it, it would be useful to post a short design doc since there 
are more types of options for ranking evaluation than for classification and 
regression.  This could include looking at what other libraries support and 
what is commonly used in literature.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-05 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227743#comment-15227743
 ] 

Yong Tang commented on SPARK-14409:
---

[~mlnick] I can work on this issue if no one has started yet. Thanks.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org