[ https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126041#comment-16126041 ]
ASF GitHub Bot commented on MADLIB-1119: ---------------------------------------- GitHub user cooper-sloan opened a pull request: https://github.com/apache/incubator-madlib/pull/166 Sample: test_train_split JIRA: MADLIB-1119 Add utility to sample test and train data from an input table. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cooper-sloan/incubator-madlib test_train_split Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-madlib/pull/166.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #166 ---- commit beafde5d3d218ba1d0b9f95843ee2186eb621edb Author: Cooper Sloan <cooper.sl...@gmail.com> Date: 2017-08-11T22:19:43Z Sample: test_train_split JIRA: MADLIB-1119 Add utility to sample test and train data from an input table. ---- > Train-test split > ---------------- > > Key: MADLIB-1119 > URL: https://issues.apache.org/jira/browse/MADLIB-1119 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Sampling > Reporter: Frank McQuillan > Fix For: v1.12 > > > Context > See related story on stratified sampling > https://issues.apache.org/jira/browse/MADLIB-986 > Story > As a data scientist, I want to split a data table into training and test sets > including grouping support, so that I use the result sets for model > development in the usual way. > The MVP for this story is: > * support split by group > * allow option to sample without replacement (default) and sample with > replacement > * allow option to output a subset of columns to the output table > * output one table with a new test/train column, or optionally two separate > tables > Proposed Interface > {code} > train_test_split ( > source_table, > output_table, > train_proportion, > test_proportion, -- optional > grouping_col -- optional > with_replacement, -- optional > target_cols -- optional > separate_output_tables -- optional > ) > source_table > TEXT. The name of the table containing the input data. > output_table > TEXT. Name of output table. A new INTEGER column on the right > called 'split' will identify 1 for train set and 0 for test set, > unless the 'separate_output_tables' parameter below is TRUE, > in which case two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > The output table contains all the columns present in the source > table unless otherwise specified in the 'target_cols' parameter below. > train_proportion > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the train split. If the 'grouping_col' parameter is specified below, > each group will be sampled independently using the > train proportion, i.e., in a stratified fashion. > test_proportion (optional) > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the test split. Default is the complement to the train > proportion (1-'train_proportion'). If the 'grouping_col' > parameter is specified below, each group will be sampled > independently using the train proportion, > i.e., in a stratified fashion. > grouping_col (optional) > TEXT, default: NULL. A single column or a list of comma-separated columns > that defines how to stratify. When this parameter is NULL, > the train-test split is not stratified. > with_replacement (optional) > BOOLEAN, default FALSE. Determines whether to sample with replacement > or without replacement (default). > target_cols (optional) > TEXT, default NULL. A comma-separated list of columns to appear in the > 'output_table'. > If NULL, all columns from the 'source_table' will appear in the > 'output_table'. > separate_output_tables (optional) > BOOLEAN, default FALSE. If TRUE, two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > {code} > Other notes > 1) PDL tools is one example implementation of train/test split to review [2]. > > 2) From Rahul Iyer: "The goal of having both train and test is to provide > subsample and train/test split in one function. > For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed > data will be output. This is tremendously useful in situations where a user > wants to prototype/evaluate a couple of models on smaller iid data before > running it on whole dataset. > Under no circumstances would the train_size + test_size be allowed to be more > than 1. The implementation will also ensure that there are no "leaks" (leak = > same data occurring in both train and test) as that defeats the whole purpose > of building an independent dataset for model evaluation. > Of course, the interface does get a little complex and could confuse users. > Explanatory documentation with examples is the only solution to that problem. > The alternative to having both sizes in one function is to run a subsample > function (using various sampling methods) and then perform the train_test > split. The downside to this approach is it requires writing an intermediate > table to disk (inefficient). " > Acceptance > 1) Code, user docs, on-line docs, IC, Tinc tests complete. > 2) Radar green for all supported dbs. > References > [1] PDL tools sampling modules incl stratified sampling > http://pivotalsoftware.github.io/PDLTools/group__grp__train__test__split.html > [2] Related story on stratified sampling > https://issues.apache.org/jira/browse/MADLIB-986 > [3] General > https://en.wikipedia.org/wiki/Test_set > [4] scikit-learn > http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)