Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/8112#issuecomment-209806115 @sethah could you take a look at the discussion in [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489), [SPARK-14409](https://issues.apache.org/jira/browse/SPARK-14409) and [SPARK-13857](https://issues.apache.org/jira/browse/SPARK-13857) as it might relate to this PR? Essentially, here we want to sample by class label. In evaluating a recommender, we may wish to sample the data set by say the user id column, such that the ratings for each user are distributed across the folds (or train/test split). It looks from a quick pass that this will work for that use case, excepting e.g. https://github.com/apache/spark/pull/8112/files#diff-0069187abc8ca287bd4000fadc6f5de6R298 because one might have many millions of users. So we would need a way to simply have the same sampling ratio for each key (which it seems is what you're doing actually in the `CrossValidator` and `TrainValidationSplit`?)
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org