[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

MLnick Mon, 06 Mar 2017 23:54:40 -0800

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    I commented further on the 
[JIRA](https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15898855&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15898855).
    
    Sorry if my other comments here and on JIRA were unclear. But the proposed 
schema for input to `RankingEvaluator` is:
    
    ### Schema 1
    ```
    +------+-------+------+----------+
    |userId|movieId|rating|prediction|
    +------+-------+------+----------+
    |   230|    318|   5.0| 4.2403245|
    |   230|   3424|   4.0|      null|
    |   230|  81191|  null|  4.317455|
    +------+-------+------+----------+
    ```
    
    You will notice that `rating` and `prediction` columns can be `null`. This 
is by design. There are three cases shown above:
    1. 1st row indicates a (user-item) pair that occurs in *both* the 
ground-truth set *and* the top-k predictions;
    2. 2nd row indicates a (user-item) pair that occurs in the ground-truth 
set, *but not* in the top-k predictions;
    3. 3rd row indicates a (user-item) pair that occurs in the top-k 
predictions, *but not* in the ground-truth set.
    
    _Note_ for reference, the input to the current `mllib` `RankingMetrics` is:
    
    ### Schema 2
    ```
    RDD[(true labels array, predicted labels array)],
    i.e.
    RDD of ([318, 3424, 7139,...], [81191, 93040, 31...])
    ```
    
    (So actually neither of the above schemas are easily compatible with the 
return schema here - but I think it is not really necessary to match the 
`mllib.RankingMetrics` format)
    
    ### ALS cross-validation
    
    My proposal for fitting ALS into cross-validation is the 
`ALSModel.transform` will output a DF of **Schema 1** - *only* when the 
parameters `k` and `recommendFor` are appropriately set, and the input DF 
contains both `user` and `item` columns. In practice, this scenario will occur 
during cross-validation only. 
    
    So what I am saying is that ALS itself (not the evaluator) must know how to 
return the correct DataFrame output from `transform` such that it can be used 
in a cross-validation as input to the `RankingEvaluator`.
    
    __Concretely:__
    ```scala
    val als = new ALS().setRecommendFor("user").setK(10)
    val validator = new TrainValidationSplit()
      .setEvaluator(new RankingEvaluator().setK(10))
      .setEstimator(als)
      .setEstimatorParamMaps(...)
    val bestModel = validator.fit(ratings)
    ```
    
    So while it is complex under the hood - to users it's simply a case of 
setting 2 params and the rest is as normal.
    
    Now, we have the best model selected by cross-validation. We can make 
recommendations using these convenience methods (I think it will need a cast):
    
    ```scala
    val recommendations = 
bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10)
    ```
    
    Alternatively, the `transform` version looks like this:
    ```scala
    val usersDF = ...
    +------+
    |userId|
    +------+
    |     1|
    |     2|
    |     3|
    +------+
    val recommendations = bestModel.transform(usersDF)
    ```
    
    So the questions:
    1. should we support the above `transform`-based recommendations? Or only 
support it for cross-validation purposes as a special case?
    2. if we do, what should the output schema of the above `transform` version 
look like? It must certainly match the output of `recommendX` methods.
    
    The options are:
    
    (1) The schema in this PR: 
    **Pros**: as you mention above - also more "compact"
    **Cons**: doesn't match up so closely with the `transform` 
"cross-validation" schema above
    
    (2) The schema below. It is basically an "exploded" version of option (1)
    
    ```
    +------+-------+----------+
    |userId|movieId|prediction|
    +------+-------+----------+
    |     1|      1|       4.3|
    |     1|      5|       3.2|
    |     1|      9|       2.1|
    +------+-------+----------+
    ```
    
    **Pros***: matches more closely with the cross-validation / evaluator input 
format. Perhaps slightly more "dataframe-like".
    **Cons**: less compact; lose ordering?; may require more munging to save to 
external data stores etc. 
    
    Anyway sorry for hijacking this PR discussion - but as I think you can see, 
the evaluator / ALS transform interplay is a bit subtle and requires some 
thought to get the right approach.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Reply via email to