[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-17 Thread Danilo Ascione (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822021#comment-15822021
 ] 

Danilo Ascione edited comment on SPARK-13857 at 1/17/17 4:12 PM:
-

I have a pipeline similar to [~abudd2014]'s one. I have implemented a dataframe 
api based RankingEvaluator that takes care of getting the top K recommendations 
at the evaluation phase of the pipeline, and it can be used in model selection 
pipeline (Cross-Validation). 
Sample usage code:
{code}
val als = new ALS() //input dataframe (userId, itemId, clicked)
  .setUserCol("userId")
  .setItemCol("itemId")
  .setRatingCol("clicked")
  .setImplicitPrefs(true)

val paramGrid = new ParamGridBuilder()
.addGrid(als.regParam, Array(0.01,0.1))
.addGrid(als.alpha, Array(40.0, 1.0))
.build()

val evaluator = new RankingEvaluator()
.setMetricName("mpr") //Mean Percentile Rank
.setLabelCol("itemId")
.setPredictionCol("prediction")
.setQueryCol("userId")
.setK(5) //Top K
 
val cv = new CrossValidator()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3)

val crossValidatorModel = cv.fit(inputDF)

// Print the average metrics per ParamGrid entry
val avgMetricsParamGrid = crossValidatorModel.avgMetrics

// Combine with paramGrid to see how they affect the overall metrics
val combined = paramGrid.zip(avgMetricsParamGrid)
{code}

Then the resulting "bestModel" from cross validation model is used to generate 
the top K recommendations in batches.

RankingEvaluator code is here 
[https://github.com/apache/spark/pull/16618/files#diff-0345c4cb1878d3bb0d84297202fdc95f]

I would appreciate any feedback. Thanks!



was (Author: danilo.ascione):
I have a pipeline similar to [~abudd2014]'s one. I have implemented a dataframe 
api based RankingEvaluator that takes care of getting the top K recommendations 
at the evaluation phase of the pipeline, and it can be used in model selection 
pipeline (Cross-Validation). 
Sample usage code:
{code}
val als = new ALS() //input dataframe (userId, itemId, clicked)
  .setUserCol("userId")
  .setItemCol("itemId")
  .setRatingCol("clicked")
  .setImplicitPrefs(true)

val paramGrid = new ParamGridBuilder()
.addGrid(als.regParam, Array(0.01,0.1))
.addGrid(als.alpha, Array(40.0, 1.0))
.build()

val evaluator = new RankingEvaluator()
.setMetricName("mpr") //Mean Percentile Rank
.setLabelCol("itemId")
.setPredictionCol("prediction")
.setQueryCol("userId")
.setK(5) //Top K
 
val cv = new CrossValidator()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3)

val crossValidatorModel = cv.fit(inputDF)

// Print the average metrics per ParamGrid entry
val avgMetricsParamGrid = crossValidatorModel.avgMetrics

// Combine with paramGrid to see how they affect the overall metrics
val combined = paramGrid.zip(avgMetricsParamGrid)
{code}

Then the resulting "bestModel" from cross validation model is used to generate 
the top K recommendations in batches.

RankingEvaluator code is here 
[https://github.com/daniloascione/spark/commit/c93ab86d35984e9f70a3b4f543fb88f5541333f0]

I would appreciate any feedback. Thanks!


> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-12-25 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15777650#comment-15777650
 ] 

Debasish Das edited comment on SPARK-13857 at 12/26/16 5:57 AM:


item->item and user->user was done in an old PR I had...if there is interest I 
can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213


was (Author: debasish83):
item->item and user->user was done in an old PR I had...if there is interested 
I can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-14 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241360#comment-15241360
 ] 

Nick Pentreath edited comment on SPARK-13857 at 4/14/16 3:33 PM:
-

For now I won't do this, but later we could add a Param {{recommendType}} - 
type of recommendation strategy. Could be {{recommend}} for recommendAll or 
{{similar}} for item-item or user-user similarity.


was (Author: mlnick):
For now I won't do this, but later we could add a Param `recommendType` - type 
of recommendation strategy. Could be `recommend` for recommendAll or 
`similarity` for item-item or user-user similarity.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237074#comment-15237074
 ] 

Nick Pentreath edited comment on SPARK-13857 at 4/12/16 12:28 PM:
--

Do we want to support user-user and item-item similarity computation too? It's 
expensive in general (in the case of a small item set, one can broadcast the 
item vectors, or use colSimilarity on a transposed {{RowMatrix}}, but this is 
not feasible in large-scale cases). But it's not necessarily more expensive 
than top-k items for each user (depending on the user and item sizes involved). 
Or at least, if we offer user-item top-k, then is there a reason _not_ to offer 
item-item top-k similar items?


was (Author: mlnick):
Do we want to support user-user and item-item similarity computation too? It's 
expensive in general (in the case of a small item set, one can broadcast the 
item vectors, or use colSimilarity on a transposed {{RowMatrix}}, but this is 
not that feasible in large-scale cases). But it's not necessarily more 
expensive than top-k items for each user (depending on the user and item sizes 
involved). Or at least, if we offer user-item top-k, then is there a reason 
_not_ to offer item-item top-k similar items?

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237067#comment-15237067
 ] 

Nick Pentreath edited comment on SPARK-13857 at 4/12/16 12:22 PM:
--

My main point is that in cross-validation, essentially the "problem" is we need 
the input dataset to contain the ground truth "actual" column for each unique 
user (not each row in the original input DF). The format of the input DF for 
{{fit}} is not compatible with that of (the proposed) 
{{RankingEvaluator.evaluate}}, and {{TrainValidateSplit.fit}} takes only one DF 
(which is passed to both {{Estimator.fit}} and {{Evaluator.evaluate}}), e.g.

{code}
// input DF for ALS.fit
+++--+
|user|item|rating|
+++--+
|   1|  10|   1.0|
|   1|  20|   3.0|
|   1|  30|   5.0|
|   2|  20|   5.0|
|   2|  40|   3.0|
|   3|  10|   5.0|
|   3|  30|   4.0|
|   3|  40|   1.0|
+++--+
// input DF for RankingEvaluator.evaluate
++--+-+
|user|   topk   |  actual |
++--+-+
|   1|  [10, 20]| [40, 20]|
|   2|  [30, 40]| [30, 10]|
|   3|  [20, 30]| [20, 30]|
++--+-+
{code}

My point in #2 above, was the we could have ALS handle it:
{code}
val input: DataFrame = ...
+++--+
|user|item|rating|
+++--+
|   1|  10|   1.0|
|   1|  20|   3.0|
|   1|  30|   5.0|
|   2|  20|   5.0|
|   2|  40|   3.0|
|   3|  10|   5.0|
|   3|  30|   4.0|
|   3|  40|   1.0|
+++--+
val model = als.fit(input)
val predictions = 
model.setK(2).setUserTopKCol("user").setWithActual(true).transform(input)
++--+-+
|user|   topk   |  actual |
++--+-+
|   1|  [10, 20]| [40, 20]|
|   2|  [30, 40]| [30, 10]|
|   3|  [20, 30]| [20, 30]|
++--+-+
evaluator.setLabelCol("actual").setPredictionCol("topk").evaluate(predictions)
{code}

.. but this requires the input DF to {{transform}} to be the same as for 
{{fit}} , and requires some processing of that DF which adds some overhead 
(e.g. grouping by user to get the ground truth items for each user id, and 
{{input.select("user").distinct}}). However, this overhead is unavoidable for 
evaluation at least, as one does need to compute the ground truth and the 
unique user set for making recommendations. This is not "natural" for the case 
when you just want to make recommendations (e.g. using the best model from 
evaluation), since you'd normally just want to pass in a DF of users to top-k:
{code}
val input: DataFrame = ...
++
|user|
++
|   1|
|   3|
|   2|
++
model.setK(2).setUserTopKCol("user").transform(input).show
++--+
|user|topk  |
++--+
|   1|  [10, 20]|
|   2|  [30, 40]|
|   3|  [20, 30]|
++--+
{code}

So overall it just feels a little clunky. It feels like it will be somewhat 
tricky for users to tweak the correct param settings to get it to work, but 
perhaps it's the best approach, combined with a good example in the docs. Also 
[~josephkb] was concerned about different type for the prediction column 
depending on params - but I'd propose we have a separate column for top-k and 
set the column in the evaluator accordingly (as in example above).


was (Author: mlnick):
My main point is that in cross-validation, essentially the "problem" is we need 
the input dataset to contain the ground truth "actual" column for each unique 
user (not each row in the original input DF). The format of the input DF for 
{{fit}} is not compatible with that of (the proposed) 
{{RankingEvaluator.evaluate}}, and {{TrainValidateSplit.fit}} takes only one DF 
(which is passed to both {{Estimator.fit}} and {{Evaluator.evaluate}}), e.g.

{{code}}
// input DF for ALS.fit
+++--+
|user|item|rating|
+++--+
|   1|  10|   1.0|
|   1|  20|   3.0|
|   1|  30|   5.0|
|   2|  20|   5.0|
|   2|  40|   3.0|
|   3|  10|   5.0|
|   3|  30|   4.0|
|   3|  40|   1.0|
+++--+
// input DF for RankingEvaluator.evaluate
++--+-+
|user|   topk   |  actual |
++--+-+
|   1|  [10, 20]| [40, 20]|
|   2|  [30, 40]| [30, 10]|
|   3|  [20, 30]| [20, 30]|
++--+-+
{code}

My point in #2 above, was the we could have ALS handle it:
{code}
val input: DataFrame = ...
+++--+
|user|item|rating|
+++--+
|   1|  10|   1.0|
|   1|  20|   3.0|
|   1|  30|   5.0|
|   2|  20|   5.0|
|   2|  40|   3.0|
|   3|  10|   5.0|
|   3|  30|   4.0|
|   3|  40|   1.0|
+++--+
val model = als.fit(input)
val predictions = 
model.setK(2).setUserTopKCol("user").setWithActual(true).transform(input)
++--+-+
|user|   topk   |  actual |
++--+-+
|   1|  [10, 20]| [40, 20]|
|   2|  [30, 40]| [30, 10]|
|   3|  [20, 30]| [20, 30]|
++--+-+

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236796#comment-15236796
 ] 

Nick Pentreath edited comment on SPARK-13857 at 4/12/16 9:24 AM:
-

[~mengxr] [~josephkb] [~srowen] would like to get your thoughts on this.

In an ideal world, this is what train-validation split with ALS with ranking 
evaluation would look like:

{code}
// Prepare training and test data.
val ratings = ...
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

// set up ALS with top-k prediction
val als = new ALS()
  .setMaxIter(5)
  .setImplicitPrefs(true)
  .setK(10)
  .setTopKInputCol("user")
  .setTopKOutputCol("topk")

// build param grid
val paramGrid = new ParamGridBuilder()
  .addGrid(als.regParam, Array(0.01, 0.05, 0.1))
  .addGrid(als.alpha, Array(1.0, 10.0, 20.0))
  .build()
// ranking evaluator with appropriate prediction column
val evaluator = new RankingEvaluator()
  .setPredictionCol("topk")
  .setMetricName("mapk")
  .setK(10)
  .setLabelCol("actual")
val trainValidationSplit = new TrainValidationSplit()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for 
validation.
  .setTrainRatio(0.8)

// Run train validation split, and choose the best set of parameters.
val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of 
parameters
// that performed best.
model.transform(test)
  .select("user", "actual", "topk")
  .show()
{code}

This issue is, the input dataset to {{fit}} ALS is DF of {{(userId, itemId, 
rating)}} rows. The input to {{transform}} with the top-k option enabled is a 
DF of {{userId}} rows, while the input to {{evaluate}} is a DF of {{(userId, 
actual)}} rows, where {{actual}} is an array of ground truth item ids {{(id1, 
id2, ...)}}. So it doesn't work out the box.

I see three solutions:
# have {{RankingEvaluator}} and/or the cross-validation classes handle this in 
some generic way (it would be good to understand how other ranking evaluation 
use cases could look in order to also support them).
# have {{ALS}} handle it in {{transform}} - perhaps an option to output a 
{{topk}} column and an {{actual}} column. This would require that the input DF 
to {{transform}} with the top-k option is in the same form as for {{transform}} 
normally. It would require a distinct on the {{userId}} column to only predict 
for unique user ids, and may be a bit convoluted to make it work. In general 
this code path would only really be used in cross-validation (as for predicting 
with the final model, one normally just wants to pass in a DF of user ids).
# have specialized versions of {{TrainValidationSplit}} and {{CrossValidator}} 
to handle the recommendation case.

#3 is not actually as crazy as it may sound - since the recommendation case is 
a little different (the same might be the case for say learning-to-rank on 
queries etc), and even the way to split the dataset into train,test is 
different (e.g. in recommender systems, often data is sampled by userid, such 
as split a fraction of ratings for each user into the train and test sets, etc).


was (Author: mlnick):
[~mengxr] [~josephkb] [~srowen] would like to get your thoughts on this.

In an ideal world, this is what train-validation split with ALS with ranking 
evaluation would look like:

{code}
// Prepare training and test data.
val ratings = ...
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

// set up ALS with top-k prediction
val als = new ALS()
  .setMaxIter(5)
  .setImplicitPrefs(true)
  .setK(10)
  .setTopKInputCol("user")
  .setTopKOutputCol("topk")

// build param grid
val paramGrid = new ParamGridBuilder()
  .addGrid(als.regParam, Array(0.01, 0.05, 0.1))
  .addGrid(als.alpha, Array(1.0, 10.0, 20.0))
  .build()
// ranking evaluator with appropriate prediction column
val evaluator = new RankingEvaluator()
  .setPredictionCol("topk")
  .setMetricName("mapk")
  .setK(10)
  .setLabelCol("actual")
val trainValidationSplit = new TrainValidationSplit()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for 
validation.
  .setTrainRatio(0.8)

// Run train validation split, and choose the best set of parameters.
val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of 
parameters
// that performed best.
model.transform(test)
  .select("user", "actual", "topk")
  .show()
{code}

This issue is, the input dataset to {{fit}} ALS is DF of {{(userId, itemId, 
rating)}} rows. The input to {{transform}} with the top-k option enabled is a 
DF of {{userId}} rows, while the input to {{evaluate}} is a DF of {{(userId, 
actual)}} rows, where {{actual}} is an array of 

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236796#comment-15236796
 ] 

Nick Pentreath edited comment on SPARK-13857 at 4/12/16 9:20 AM:
-

[~mengxr] [~josephkb] [~srowen] would like to get your thoughts on this.

In an ideal world, this is what train-validation split with ALS with ranking 
evaluation would look like:

{code}
// Prepare training and test data.
val ratings = ...
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

// set up ALS with top-k prediction
val als = new ALS()
  .setMaxIter(5)
  .setImplicitPrefs(true)
  .setK(10)
  .setTopKInputCol("user")
  .setTopKOutputCol("topk")

// build param grid
val paramGrid = new ParamGridBuilder()
  .addGrid(als.regParam, Array(0.01, 0.05, 0.1))
  .addGrid(als.alpha, Array(1.0, 10.0, 20.0))
  .build()
// ranking evaluator with appropriate prediction column
val evaluator = new RankingEvaluator()
  .setPredictionCol("topk")
  .setMetricName("mapk")
  .setK(10)
  .setLabelCol("actual")
val trainValidationSplit = new TrainValidationSplit()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for 
validation.
  .setTrainRatio(0.8)

// Run train validation split, and choose the best set of parameters.
val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of 
parameters
// that performed best.
model.transform(test)
  .select("user", "actual", "topk")
  .show()
{code}

This issue is, the input dataset to {{fit}} ALS is DF of {{(userId, itemId, 
rating)}} rows. The input to {{transform}} with the top-k option enabled is a 
DF of {{userId}} rows, while the input to {{evaluate}} is a DF of {{(userId, 
actual)}} rows, where {{actual}} is an array of ground truth item ids {{(id1, 
id2, ...)}}. So it doesn't work out the box.

I see three solutions:
# have {{RankingEvaluator}} and/or the cross-validation classes handle this in 
some generic way (it would be good to understand how other ranking evaluation 
use cases could look in order to also support them).
# have {{ALS}} handle it in {{transform}} - perhaps an option to output a 
{{topk}} column and an {{actual}} column. This would require that the input DF 
to {{transform}} with the top-k option is in the same form as for {{transform}} 
normally. It would require a distinct on the {{userId}} column to only predict 
for unique user ids, and may be a bit convoluted to make it work.
# have specialized versions of {{TrainValidationSplit}} and {{CrossValidator}} 
to handle the recommendation case.

#3 is not actually as crazy as it may sound - since the recommendation case is 
a little different (the same might be the case for say learning-to-rank on 
queries etc), and even the way to split the dataset into train,test is 
different (e.g. in recommender systems, often data is sampled by userid, such 
as split a fraction of ratings for each user into the train and test sets, etc).


was (Author: mlnick):
[~mengxr] [~josephkb]

In an ideal world, this is what train-validation split with ALS with ranking 
evaluation would look like:

{code}
// Prepare training and test data.
val ratings = ...
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

// set up ALS with top-k prediction
val als = new ALS()
  .setMaxIter(5)
  .setImplicitPrefs(true)
  .setK(10)
  .setTopKInputCol("user")
  .setTopKOutputCol("topk")

// build param grid
val paramGrid = new ParamGridBuilder()
  .addGrid(als.regParam, Array(0.01, 0.05, 0.1))
  .addGrid(als.alpha, Array(1.0, 10.0, 20.0))
  .build()
// ranking evaluator with appropriate prediction column
val evaluator = new RankingEvaluator()
  .setPredictionCol("topk")
  .setMetricName("mapk")
  .setK(10)
  .setLabelCol("actual")
val trainValidationSplit = new TrainValidationSplit()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for 
validation.
  .setTrainRatio(0.8)

// Run train validation split, and choose the best set of parameters.
val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of 
parameters
// that performed best.
model.transform(test)
  .select("user", "actual", "topk")
  .show()
{code}

This issue is, the input dataset to {{fit}} ALS is DF of {{(userId, itemId, 
rating)}} rows. The input to {{transform}} with the top-k option enabled is a 
DF of {{userId}} rows, while the input to {{evaluate}} is a DF of {{(userId, 
actual)}} rows, where {{actual}} is an array of ground truth item ids {{(id1, 
id2, ...)}}. So it doesn't work out the box.

I see three solutions:
# have {{RankingEvaluator}} and/or the cross-validation classes handle this in 
some generic way (it would be good to 

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236796#comment-15236796
 ] 

Nick Pentreath edited comment on SPARK-13857 at 4/12/16 9:19 AM:
-

[~mengxr] [~josephkb]

In an ideal world, this is what train-validation split with ALS with ranking 
evaluation would look like:

{code}
// Prepare training and test data.
val ratings = ...
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

// set up ALS with top-k prediction
val als = new ALS()
  .setMaxIter(5)
  .setImplicitPrefs(true)
  .setK(10)
  .setTopKInputCol("user")
  .setTopKOutputCol("topk")

// build param grid
val paramGrid = new ParamGridBuilder()
  .addGrid(als.regParam, Array(0.01, 0.05, 0.1))
  .addGrid(als.alpha, Array(1.0, 10.0, 20.0))
  .build()
// ranking evaluator with appropriate prediction column
val evaluator = new RankingEvaluator()
  .setPredictionCol("topk")
  .setMetricName("mapk")
  .setK(10)
  .setLabelCol("actual")
val trainValidationSplit = new TrainValidationSplit()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for 
validation.
  .setTrainRatio(0.8)

// Run train validation split, and choose the best set of parameters.
val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of 
parameters
// that performed best.
model.transform(test)
  .select("user", "actual", "topk")
  .show()
{code}

This issue is, the input dataset to {{fit}} ALS is DF of {{(userId, itemId, 
rating)}} rows. The input to {{transform}} with the top-k option enabled is a 
DF of {{userId}} rows, while the input to {{evaluate}} is a DF of {{(userId, 
actual)}} rows, where {{actual}} is an array of ground truth item ids {{(id1, 
id2, ...)}}. So it doesn't work out the box.

I see three solutions:
# have {{RankingEvaluator}} and/or the cross-validation classes handle this in 
some generic way (it would be good to understand how other ranking evaluation 
use cases could look in order to also support them).
# have {{ALS}} handle it in {{transform}} - perhaps an option to output a 
{{topk}} column and an {{actual}} column. This would require that the input DF 
to {{transform}} with the top-k option is in the same form as for {{transform}} 
normally. It would require a distinct on the {{userId}} column to only predict 
for unique user ids, and may be a bit convoluted to make it work.
# have specialized versions of {{TrainValidationSplit}} and {{CrossValidator}} 
to handle the recommendation case.

#3 is not actually as crazy as it may sound - since the recommendation case is 
a little different (the same might be the case for say learning-to-rank on 
queries etc), and even the way to split the dataset into train,test is 
different (e.g. in recommender systems, often data is sampled by userid, such 
as split a fraction of ratings for each user into the train and test sets, etc).


was (Author: mlnick):
[~mengxr] [~josephkb]

In an ideal world, this is what train-validation split with ALS would look like:

{code}
// Prepare training and test data.
val ratings = ...
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

// set up ALS with top-k prediction
val als = new ALS()
  .setMaxIter(5)
  .setImplicitPrefs(true)
  .setK(10)
  .setTopKInputCol("user")
  .setTopKOutputCol("topk")

// build param grid
val paramGrid = new ParamGridBuilder()
  .addGrid(als.regParam, Array(0.01, 0.05, 0.1))
  .addGrid(als.alpha, Array(1.0, 10.0, 20.0))
  .build()
// ranking evaluator with appropriate prediction column
val evaluator = new RankingEvaluator()
  .setPredictionCol("topk")
  .setMetricName("mapk")
  .setK(10)
  .setLabelCol("actual")
val trainValidationSplit = new TrainValidationSplit()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for 
validation.
  .setTrainRatio(0.8)

// Run train validation split, and choose the best set of parameters.
val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of 
parameters
// that performed best.
model.transform(test)
  .select("user", "actual", "topk")
  .show()
{code}

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as 

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-05 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227126#comment-15227126
 ] 

Joseph K. Bradley edited comment on SPARK-13857 at 4/5/16 9:04 PM:
---

I'd prefer to have a consistent schema for a given output column.

I think it would be hard to extend transform() since the number of rows may not 
match.  If transform() is outputting 1 row per training/test instance (a (user, 
item) pair), then it cannot also output 1 row per user or 1 row per item.

I'd prefer to add recommendItems, recommendUsers methods for now.  If a user 
has a need for them in a Pipeline, we could later add support within 
transform().  I haven't yet thought through how this would interact with model 
selection/evaluation though.

How does that sound?


was (Author: josephkb):
I'd prefer to have a consistent schema for a given output column.

I think it would be hard to extend transform() since the number of rows may not 
match.  If transform() is outputting 1 row per training/test instance (a (user, 
item) pair), then it cannot also output 1 row per user or 1 row per item.

I'd prefer to add recommendItems, recommendUsers methods for now.  If a user 
has a need for them in a Pipeline, we could later add support within 
transform().

How does that sound?

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:45 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, 10)
val topKUsersForItems = model.recommendUsers(df, 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2 - requires to (re)specify the user / item input col in the input DF
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195702#comment-15195702
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:42 AM:
-

Also, what's nice in the ML API is that SPARK-10802 is essentially taken care 
of by passing in a DataFrame with the users of interest, e.g.
{code}
val users = df.filter(df("age") > 21)
val topK = model.setK(10).setUserTopKCol("userTopK").transform(users)
{code}


was (Author: mlnick):
Also, what's nice in the ML API is that SPARK-10802 is essentially taken care 
of by passing in a DataFrame with the users of interest, e.g.
{code}
val users = df.filter(df("age") > 21)
val topK = model.setK(10).setTopKCol("userId").transform(users)
{code}

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:42 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2 - requires to (re)specify the user / item input col in the input DF
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, 10)
val topKUsersForItems = model.recommendUsers(df, 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:41 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, 10)
val topKUsersForItems = model.recommendUsers(df, 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:38 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding a method such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:38 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding a method such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding a method such as {{predictTopK}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and save the resulting DF - so 
perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1
val topKItemsForUsers = model.setK(10).setTopKCol("userId").transform(df)

// Option 2
val topKItemsForUsers = model.predictTopK("userId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit more neatly 
into the {{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org