GitHub user VinceShieh opened a pull request:

    https://github.com/apache/spark/pull/14640

    [SPARK-17055] add labelKFold to CrossValidator

    ## What changes were proposed in this pull request?
    
    This patch improves the CrossValidator by adding a new training/validation 
split method -labelKFold, which splits data based on data labels and makes sure 
that the same label is not in both testing and training sets. 
    
    This is necessary, for example when data is gathered from different 
subjects by testing and training on different subjects, i.e., learning cat 
specific features, and it can avoid over-fitting.
    
    ## How was this patch tested?
    
    Unit test was added to MLUtilsSuite.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/VinceShieh/spark labelKFold2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14640.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14640
    
----
commit cbb78bce4022bfc46f570264de4087a01a84b281
Author: Vincent Xie <vincent....@intel.com>
Date:   2016-08-08T13:28:08Z

    Add labelKFold to cross validation
    
    Currently, only KFold is supported in cross validation. But in cases
    when data is gathered from different subjects and we want to avoid
    over-fitting. labelKFold is a variation of k-fold which ensures that
    the same label is not in both testing and training sets.
    
    Unit test -'test labelKFold', is also added in MLUtilsSuite
    
    Signed-off-by: Vincent Xie <vincent....@intel.com>
    Signed-off-by: VinceShieh <vincent....@intel.com>

commit 461d696aa6aa41818be31dc1628e3282e560854a
Author: VinceShieh <vincent....@intel.com>
Date:   2016-08-15T01:53:51Z

    Merge remote-tracking branch 'origin/master' into labelKFold2

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to