[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-04-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982695#comment-15982695
 ] 

Nick Pentreath commented on SPARK-13857:


I'm going to close this as superseded by SPARK-19535. 

However, the discussion here should still serve as a reference for making 
{{ALS.transform}} able to support ranking metrics for cross-validation in 
SPARK-14409.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-02-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860108#comment-15860108
 ] 

Joseph K. Bradley commented on SPARK-13857:
---

Hi all, catching up these many ALS discussions now.  This work to support 
evaluation and tuning for recommendation is great, but I'm worried about it not 
being resolved in time for 2.2.  I've heard a lot of requests for the plain 
functionality available in spark.mllib for recommendUsers/Products, so I'd 
recommend we just add those methods for now as a short-term solution.  Let's 
keep working on the evaluation/tuning plans too.  I'll create a JIRA for adding 
basic recommendUsers/Products methods.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-13 Thread Danilo Ascione (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822021#comment-15822021
 ] 

Danilo Ascione commented on SPARK-13857:


I have a pipeline similar to [~abudd2014]'s one. I have implemented a dataframe 
api based RankingEvaluator that takes care of getting the top K recommendations 
at the evaluation phase of the pipeline, and it can be used in model selection 
pipeline (Cross-Validation). 
Sample usage code:
{code}
val als = new ALS() //input dataframe (userId, itemId, clicked)
  .setUserCol("userId")
  .setItemCol("itemId")
  .setRatingCol("clicked")
  .setImplicitPrefs(true)

val paramGrid = new ParamGridBuilder()
.addGrid(als.regParam, Array(0.01,0.1))
.addGrid(als.alpha, Array(40.0, 1.0))
.build()

val evaluator = new RankingEvaluator()
.setMetricName("mpr") //Mean Percentile Rank
.setLabelCol("itemId")
.setPredictionCol("prediction")
.setQueryCol("userId")
.setK(5) //Top K
 
val cv = new CrossValidator()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3)

val crossValidatorModel = cv.fit(inputDF)

// Print the average metrics per ParamGrid entry
val avgMetricsParamGrid = crossValidatorModel.avgMetrics

// Combine with paramGrid to see how they affect the overall metrics
val combined = paramGrid.zip(avgMetricsParamGrid)
{code}

Then the resulting "bestModel" from cross validation model is used to generate 
the top K recommendations in batches.

RankingEvaluator code is here 
[https://github.com/daniloascione/spark/commit/c93ab86d35984e9f70a3b4f543fb88f5541333f0]

I would appreciate any feedback. Thanks!


> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-12 Thread Alan Budd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821070#comment-15821070
 ] 

Alan Budd commented on SPARK-13857:
---

I just had a short email conversation with [~mlnick] with regards to this JIRA 
issue. I'm very interested in the functionality with regards to my project, 
which is creating an implicit-feedback ALS recommendation engine for a website 
using URLs as the item. Essentially, my pipeline will consists of:

# A DataFrame consisting of:
*# user id column (userID)
*# URL column (URLid)
*# an aggregate count for each id/url pair (for the rating/preference column) 
(count)
# Creating a {{ParamGridBuilder()}} to optimize the regularization parameter, 
{{regParam}}.
# Training the model using {{ALS()}}, with the following:
{code}
.setMaxIter(5)
.setImplicitPrefs(true)
.setUserCol("userID")
.setItemCol("URLid")
.setRatingCol("count")
{code}
# Optimize the {{regParam}} hyperparamter using the {{CrossValidator()}} 
functionality.

When the ML Pipeline is built using the above steps, resulting in a 
{{org.apache.spark.ml.PipelineModel}} object, the final step will be to use 
this pipeline model to generate the top K recommendations for every user in the 
model (in batches) and export that DataFrame for use in real-time calls.

[~mlnick], I hope that this provides a little insight of a desired production 
use-case and can help drive this issue towards production. On a last note, I 
would definitely encourage plenty of documentation with examples for how to use 
it in an ML Pipeline (or a stand-alone ALS model, i.e. a 
{{org.apache.spark.ml.recommendation.ALSModel}} object) for people desiring to 
use in a production environment. Let me know if you need me to elaborate on any 
further details!

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15820747#comment-15820747
 ] 

Nick Pentreath commented on SPARK-13857:


My view is in practice brute-force is never going to be efficient enough for 
any non-trivial size problem. So I'd definitely like to incorporate the ANN 
stuff into the top-k recommendation here. Once SignRandomProjection is in I'll 
take a deeper look at item-item / user-user sim for DF-API.

We could also add a form of LSH appropriate for dot product space for user-item 
recs.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15820746#comment-15820746
 ] 

Nick Pentreath commented on SPARK-13857:


My view is in practice brute-force is never going to be efficient enough for 
any non-trivial size problem. So I'd definitely like to incorporate the ANN 
stuff into the top-k recommendation here. Once SignRandomProjection is in I'll 
take a deeper look at item-item / user-user sim for DF-API.

We could also add a form of LSH appropriate for dot product space for user-item 
recs.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-12-25 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15777650#comment-15777650
 ] 

Debasish Das commented on SPARK-13857:
--

item->item and user->user was done in an old PR I had...if there is interested 
I can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251765#comment-15251765
 ] 

Apache Spark commented on SPARK-13857:
--

User 'MLnick' has created a pull request for this issue:
https://github.com/apache/spark/pull/12574

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-14 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241360#comment-15241360
 ] 

Nick Pentreath commented on SPARK-13857:


For now I won't do this, but later we could add a Param `recommendType` - type 
of recommendation strategy. Could be `recommend` for recommendAll or 
`similarity` for item-item or user-user similarity.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237074#comment-15237074
 ] 

Nick Pentreath commented on SPARK-13857:


Do we want to support user-user and item-item similarity computation too? It's 
expensive in general (in the case of a small item set, one can broadcast the 
item vectors, or use colSimilarity on a transposed {{RowMatrix}}, but this is 
not that feasible in large-scale cases). But it's not necessarily more 
expensive than top-k items for each user (depending on the user and item sizes 
involved). Or at least, if we offer user-item top-k, then is there a reason 
_not_ to offer item-item top-k similar items?

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237067#comment-15237067
 ] 

Nick Pentreath commented on SPARK-13857:


My main point is that in cross-validation, essentially the "problem" is we need 
the input dataset to contain the ground truth "actual" column for each unique 
user (not each row in the original input DF). The format of the input DF for 
{{fit}} is not compatible with that of (the proposed) 
{{RankingEvaluator.evaluate}}, and {{TrainValidateSplit.fit}} takes only one DF 
(which is passed to both {{Estimator.fit}} and {{Evaluator.evaluate}}), e.g.

{{code}}
// input DF for ALS.fit
+++--+
|user|item|rating|
+++--+
|   1|  10|   1.0|
|   1|  20|   3.0|
|   1|  30|   5.0|
|   2|  20|   5.0|
|   2|  40|   3.0|
|   3|  10|   5.0|
|   3|  30|   4.0|
|   3|  40|   1.0|
+++--+
// input DF for RankingEvaluator.evaluate
++--+-+
|user|   topk   |  actual |
++--+-+
|   1|  [10, 20]| [40, 20]|
|   2|  [30, 40]| [30, 10]|
|   3|  [20, 30]| [20, 30]|
++--+-+
{code}

My point in #2 above, was the we could have ALS handle it:
{code}
val input: DataFrame = ...
+++--+
|user|item|rating|
+++--+
|   1|  10|   1.0|
|   1|  20|   3.0|
|   1|  30|   5.0|
|   2|  20|   5.0|
|   2|  40|   3.0|
|   3|  10|   5.0|
|   3|  30|   4.0|
|   3|  40|   1.0|
+++--+
val model = als.fit(input)
val predictions = 
model.setK(2).setUserTopKCol("user").setWithActual(true).transform(input)
++--+-+
|user|   topk   |  actual |
++--+-+
|   1|  [10, 20]| [40, 20]|
|   2|  [30, 40]| [30, 10]|
|   3|  [20, 30]| [20, 30]|
++--+-+
evaluator.setLabelCol("actual").setPredictionCol("topk").evaluate(predictions)
{code}

.. but this requires the input DF to {{transform}} to be the same as for 
{{fit}} , and requires some processing of that DF which adds some overhead 
(e.g. grouping by user to get the ground truth items for each user id, and 
{{input.select("user").distinct}}). However, this overhead is unavoidable for 
evaluation at least, as one does need to compute the ground truth and the 
unique user set for making recommendations. This is not "natural" for the case 
when you just want to make recommendations (e.g. using the best model from 
evaluation), since you'd normally just want to pass in a DF of users to top-k:
{code}
val input: DataFrame = ...
++
|user|
++
|   1|
|   3|
|   2|
++
model.setK(2).setUserTopKCol("user").transform(input).show
++--+
|user|topk  |
++--+
|   1|  [10, 20]|
|   2|  [30, 40]|
|   3|  [20, 30]|
++--+
{code}

So overall it just feels a little clunky. It feels like it will be somewhat 
tricky for users to tweak the correct param settings to get it to work, but 
perhaps it's the best approach, combined with a good example in the docs. Also 
[~josephkb] was concerned about different type for the prediction column 
depending on params - but I'd propose we have a separate column for top-k and 
set the column in the evaluator accordingly (as in example above).

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237001#comment-15237001
 ] 

Sean Owen commented on SPARK-13857:
---

Yeah the semantics of a recommender are different from a simple supervised 
learning problem. There are two key operations: recommend items to users, make 
a point estimate for a user-item pair. (Recommending users to items is 
analogous.) These require different prediction and evaluation semantics. To 
make it work the behavior must vary according to the structure of the input DF. 

If the DF only has a user column, then output a recommended item column 
containing a list of item IDs. If the DF has a user and item column, output a 
estimated rating/strength column. This then implies that for evaluation, the 
input DF has to have these output columns, respectively, for comparison.

I think this is just restating what's above, but is this possible and then is 
this not the most direct way to solve this?

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236796#comment-15236796
 ] 

Nick Pentreath commented on SPARK-13857:


[~mengxr] [~josephkb]

In an ideal world, this is what train-validation split with ALS would look like:

{code}
// Prepare training and test data.
val ratings = ...
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

// set up ALS with top-k prediction
val als = new ALS()
  .setMaxIter(5)
  .setImplicitPrefs(true)
  .setK(10)
  .setTopKInputCol("user")
  .setTopKOutputCol("topk")

// build param grid
val paramGrid = new ParamGridBuilder()
  .addGrid(als.regParam, Array(0.01, 0.05, 0.1))
  .addGrid(als.alpha, Array(1.0, 10.0, 20.0))
  .build()
// ranking evaluator with appropriate prediction column
val evaluator = new RankingEvaluator()
  .setPredictionCol("topk")
  .setMetricName("mapk")
  .setK(10)
  .setLabelCol("actual")
val trainValidationSplit = new TrainValidationSplit()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for 
validation.
  .setTrainRatio(0.8)

// Run train validation split, and choose the best set of parameters.
val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of 
parameters
// that performed best.
model.transform(test)
  .select("user", "actual", "topk")
  .show()
{code}

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-06 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228869#comment-15228869
 ] 

Nick Pentreath commented on SPARK-13857:


[~josephkb] an {{Evaluator}} might work with ALS by itself, allowing some basic 
manual evaluation metrics on the recommender model, but it won't work as part 
of {{TrainValidationSplit}} or {{CrossValidator}} as this does work with 
{{transform}}.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-05 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227818#comment-15227818
 ] 

Nick Pentreath commented on SPARK-13857:


I was thinking the behaviour of {{transform}} can depend on the input params. 
By default, it takes a DF with cols {{userId}} and {{itemId}}, and outputs 
predictions for each {{user, item}} pair. If {{setK(10).setTopKCol("userId")}} 
is called, then it takes a DF with only one input col {{"userId"}} and outputs 
top-k predictions for each user (and vice-versa for items).

The input DF would be different for each approach (the former is likely to be 
say the "test set" in evaluation for something like RMSE, while the latter is 
the set of userIds for ranking-style evaluation or making the type of 
predictions actually useful in practice).

In fact, the current {{transform}} is in practice only useful for evaluating 
RMSE. If users try to use this method for top-k style predictions it is then 
very inefficient.

I will dig into evaluators a bit and see if there is a way to accommodate 
evaluation using the other methods {{recommendItems}} etc.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-05 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227126#comment-15227126
 ] 

Joseph K. Bradley commented on SPARK-13857:
---

I'd prefer to have a consistent schema for a given output column.

I think it would be hard to extend transform() since the number of rows may not 
match.  If transform() is outputting 1 row per training/test instance (a (user, 
item) pair), then it cannot also output 1 row per user or 1 row per item.

I'd prefer to add recommendItems, recommendUsers methods for now.  If a user 
has a need for them in a Pipeline, we could later add support within 
transform().

How does that sound?

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-05 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227107#comment-15227107
 ] 

Joseph K. Bradley commented on SPARK-13857:
---

I like the RegressionEvaluator + RankingEvaluator option too.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-05 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227103#comment-15227103
 ] 

Joseph K. Bradley commented on SPARK-13857:
---

Linking [SPARK-14412] which is the only other missing item I know of for ALS

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203268#comment-15203268
 ] 

Nick Pentreath commented on SPARK-13857:


Another option here is just to use {{predictionCol}}. What is the general 
opinion of {{transform}} returning a different schema for {{predictionCol}} 
depending on the params? e.g. {{val predictions = model.transform(df)}} would 
return a {{Double}} col, while {{val predictions = 
model.setK(10).setTopKCol("userId").transform(df)}} would return a column of 
{{Array}} for the top-k predictions.


> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203264#comment-15203264
 ] 

Nick Pentreath commented on SPARK-13857:


I'll work up something - I think an evaluator for recommendation could support 
"regression" metrics (for explicit) or ranking (for both cases). We could (a) 
create a special {{RecommendationEvaluator}}, or (b) simply re-use 
{{RegressionEvaluator}} and create a {{RankingEvaluator}} - we should probably 
do (b), but I guess we could add a convenience {{RecommendationEvaluator}}.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202240#comment-15202240
 ] 

Xiangrui Meng commented on SPARK-13857:
---

+1. We need to figure out the semantics in a pipeline context and how to 
connect with evaluation. Could you suggest some example code for 
cross-validating ALS with a ranking metric like NDCG?

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-15 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195702#comment-15195702
 ] 

Nick Pentreath commented on SPARK-13857:


Also, what's nice in the ML API is that SPARK-10802 is essentially taken care 
of by passing in a DataFrame with the users of interest, e.g.
{code}
val users = df.filter(df("age") > 21)
val topK = model.setK(10).setTopKCol("userId").transform(users)
{code}

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-15 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195696#comment-15195696
 ] 

Nick Pentreath commented on SPARK-13857:


There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding a method such as {{predictTopK}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and save the resulting DF - so 
perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1
val topKItemsForUsers = model.setK(10).setTopKCol("userId").transform(df)

// Option 2
val topKItemsForUsers = model.predictTopK("userId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit more neatly 
into the {{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org