Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/8112#issuecomment-209806115
  
    @sethah could you take a look at the discussion in 
[SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489), 
[SPARK-14409](https://issues.apache.org/jira/browse/SPARK-14409) and 
[SPARK-13857](https://issues.apache.org/jira/browse/SPARK-13857) as it might 
relate to this PR?
    
    Essentially, here we want to sample by class label. In evaluating a 
recommender, we may wish to sample the data set by say the user id column, such 
that the ratings for each user are distributed across the folds (or train/test 
split).
    
    It looks from a quick pass that this will work for that use case, excepting 
e.g. 
https://github.com/apache/spark/pull/8112/files#diff-0069187abc8ca287bd4000fadc6f5de6R298
 because one might have many millions of users. So we would need a way to 
simply have the same sampling ratio for each key (which it seems is what you're 
doing actually in the `CrossValidator` and `TrainValidationSplit`?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to