[jira] [Commented] (MADLIB-1119) Train-test split

ASF GitHub Bot (JIRA) Mon, 14 Aug 2017 10:33:45 -0700

    [ 
https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126041#comment-16126041
 ]


ASF GitHub Bot commented on MADLIB-1119:
----------------------------------------

GitHub user cooper-sloan opened a pull request:

    https://github.com/apache/incubator-madlib/pull/166

    Sample: test_train_split

    JIRA: MADLIB-1119
    
    Add utility to sample test and train
    data from an input table.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cooper-sloan/incubator-madlib test_train_split

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/166.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #166
    
----
commit beafde5d3d218ba1d0b9f95843ee2186eb621edb
Author: Cooper Sloan <cooper.sl...@gmail.com>
Date:   2017-08-11T22:19:43Z

    Sample: test_train_split
    
    JIRA: MADLIB-1119
    
    Add utility to sample test and train
    data from an input table.

----


> Train-test split
> ----------------
>
>                 Key: MADLIB-1119
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1119
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Sampling
>            Reporter: Frank McQuillan
>             Fix For: v1.12
>
>
> Context
> See related story on stratified sampling 
> https://issues.apache.org/jira/browse/MADLIB-986
> Story
> As a data scientist, I want to split a data table into training and test sets 
> including grouping support, so that I use the result sets for model 
> development in the usual way.
> The MVP for this story is:
> * support split by group
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> * output one table with a new test/train column, or optionally two separate 
> tables
> Proposed Interface
> {code}
> train_test_split ( 
>                                    source_table,    
>                                    output_table,
>                                    train_proportion,
>                                    test_proportion, -- optional
>                                    grouping_col -- optional
>                                    with_replacement, -- optional
>                                    target_cols -- optional
>                                    separate_output_tables -- optional
>                                 )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table.   A new INTEGER column on the right 
> called 'split' will identify 1 for train set and 0 for test set,
> unless the 'separate_output_tables' parameter below is TRUE, 
> in which case two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> The output table contains all the  columns present in the source 
> table unless otherwise specified  in the 'target_cols' parameter below. 
> train_proportion
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the train split.  If the 'grouping_col' parameter is specified below, 
> each group will be sampled independently using the 
> train proportion, i.e., in a stratified fashion.
> test_proportion (optional)
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the test split.  Default is the complement to the train
> proportion (1-'train_proportion').  If the 'grouping_col' 
> parameter is specified below,  each group will be sampled 
> independently using the  train proportion, 
> i.e., in a stratified fashion.
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> the train-test split is not stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> separate_output_tables (optional)
> BOOLEAN, default FALSE.  If TRUE, two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> {code}
> Other notes
> 1) PDL tools is one example implementation of train/test split to review [2]. 
>  
> 2) From Rahul Iyer: "The goal of having both train and test is to provide 
> subsample and train/test split in one function. 
> For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed 
> data will be output. This is tremendously useful in situations where a user 
> wants to prototype/evaluate a couple of models on smaller iid data before 
> running it on whole dataset. 
> Under no circumstances would the train_size + test_size be allowed to be more 
> than 1. The implementation will also ensure that there are no "leaks" (leak = 
> same data occurring in both train and test) as that defeats the whole purpose 
> of building an independent dataset for model evaluation. 
> Of course, the interface does get a little complex and could confuse users. 
> Explanatory documentation with examples is the only solution to that problem. 
> The alternative to having both sizes in one function is to run a subsample 
> function (using various sampling methods) and then perform the train_test 
> split. The downside to this approach is it requires writing an intermediate 
> table to disk (inefficient). "
> Acceptance
> 1) Code, user docs, on-line docs, IC, Tinc tests complete.
> 2) Radar green for all supported dbs.
> References
> [1] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__train__test__split.html
> [2] Related story on stratified sampling 
> https://issues.apache.org/jira/browse/MADLIB-986
> [3] General
> https://en.wikipedia.org/wiki/Test_set
> [4] scikit-learn
> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MADLIB-1119) Train-test split

Reply via email to