GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/14640
[SPARK-17055] add labelKFold to CrossValidator ## What changes were proposed in this pull request? This patch improves the CrossValidator by adding a new training/validation split method -labelKFold, which splits data based on data labels and makes sure that the same label is not in both testing and training sets. This is necessary, for example when data is gathered from different subjects by testing and training on different subjects, i.e., learning cat specific features, and it can avoid over-fitting. ## How was this patch tested? Unit test was added to MLUtilsSuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VinceShieh/spark labelKFold2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14640.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14640 ---- commit cbb78bce4022bfc46f570264de4087a01a84b281 Author: Vincent Xie <vincent....@intel.com> Date: 2016-08-08T13:28:08Z Add labelKFold to cross validation Currently, only KFold is supported in cross validation. But in cases when data is gathered from different subjects and we want to avoid over-fitting. labelKFold is a variation of k-fold which ensures that the same label is not in both testing and training sets. Unit test -'test labelKFold', is also added in MLUtilsSuite Signed-off-by: Vincent Xie <vincent....@intel.com> Signed-off-by: VinceShieh <vincent....@intel.com> commit 461d696aa6aa41818be31dc1628e3282e560854a Author: VinceShieh <vincent....@intel.com> Date: 2016-08-15T01:53:51Z Merge remote-tracking branch 'origin/master' into labelKFold2 ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org