[ https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135498#comment-16135498 ]
Frank McQuillan commented on MADLIB-1119: ----------------------------------------- 3) The user doc changes have not been updated that I attached above 4) Left panel of the user docs is still called "Test Train Split" but should be called "Train-Test Split" Also in the main page file: * v1.11 missing from http://madlib.incubator.apache.org/docs/latest/ * v1.10.0 link does not work 5) Train and test proportions are reversed in the case where `separate_output_tables = FALSE` or are the user docs wrong? 6) For the same proportions for train and test with no stratification, how come the counts are different? (i.e., 2 vs 3) {code} DROP TABLE IF EXISTS out; SELECT madlib.train_test_split( 'test', -- Source table 'out', -- Output table 0.1, -- Sample proportion 0.1, -- Sample proportion NULL, -- Strata definition 'id1,id2', -- Columns to output FALSE, -- Sample without replacement FALSE); -- Yes separate output tables SELECT * FROM out; {code} produces {code} id1 | id2 | split -----+-----+------- 70 | 70 | 0 10 | 10 | 1 60 | 60 | 1 9 | 0 | 0 9 | 0 | 0 {code} 7) Please do some detailed testing on the functionality of this module. > Train-test split > ---------------- > > Key: MADLIB-1119 > URL: https://issues.apache.org/jira/browse/MADLIB-1119 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Sampling > Reporter: Frank McQuillan > Assignee: Orhan Kislal > Fix For: v1.12 > > Attachments: test_train_split.sql_in > > > Context > See related story on stratified sampling > https://issues.apache.org/jira/browse/MADLIB-986 > Story > As a data scientist, I want to split a data table into training and test sets > including grouping support, so that I use the result sets for model > development in the usual way. > The MVP for this story is: > * support split by group > * allow option to sample without replacement (default) and sample with > replacement > * allow option to output a subset of columns to the output table > * output one table with a new test/train column, or optionally two separate > tables > Proposed Interface > {code} > train_test_split ( > source_table, > output_table, > train_proportion, > test_proportion, -- optional > grouping_col -- optional > with_replacement, -- optional > target_cols -- optional > separate_output_tables -- optional > ) > source_table > TEXT. The name of the table containing the input data. > output_table > TEXT. Name of output table. A new INTEGER column on the right > called 'split' will identify 1 for train set and 0 for test set, > unless the 'separate_output_tables' parameter below is TRUE, > in which case two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > The output table contains all the columns present in the source > table unless otherwise specified in the 'target_cols' parameter below. > train_proportion > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the train split. If the 'grouping_col' parameter is specified below, > each group will be sampled independently using the > train proportion, i.e., in a stratified fashion. > test_proportion (optional) > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the test split. Default is the complement to the train > proportion (1-'train_proportion'). If the 'grouping_col' > parameter is specified below, each group will be sampled > independently using the train proportion, > i.e., in a stratified fashion. > grouping_col (optional) > TEXT, default: NULL. A single column or a list of comma-separated columns > that defines how to stratify. When this parameter is NULL, > the train-test split is not stratified. > with_replacement (optional) > BOOLEAN, default FALSE. Determines whether to sample with replacement > or without replacement (default). > target_cols (optional) > TEXT, default NULL. A comma-separated list of columns to appear in the > 'output_table'. > If NULL, all columns from the 'source_table' will appear in the > 'output_table'. > separate_output_tables (optional) > BOOLEAN, default FALSE. If TRUE, two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > {code} > Other notes > 1) PDL tools is one example implementation of train/test split to review [2]. > > 2) From Rahul Iyer: "The goal of having both train and test is to provide > subsample and train/test split in one function. > For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed > data will be output. This is tremendously useful in situations where a user > wants to prototype/evaluate a couple of models on smaller iid data before > running it on whole dataset. > Under no circumstances would the train_size + test_size be allowed to be more > than 1. The implementation will also ensure that there are no "leaks" (leak = > same data occurring in both train and test) as that defeats the whole purpose > of building an independent dataset for model evaluation. > Of course, the interface does get a little complex and could confuse users. > Explanatory documentation with examples is the only solution to that problem. > The alternative to having both sizes in one function is to run a subsample > function (using various sampling methods) and then perform the train_test > split. The downside to this approach is it requires writing an intermediate > table to disk (inefficient). " > Acceptance > 1) Code, user docs, on-line docs, IC, Tinc tests complete. > 2) Radar green for all supported dbs. > References > [1] PDL tools sampling modules incl stratified sampling > http://pivotalsoftware.github.io/PDLTools/group__grp__train__test__split.html > [2] Related story on stratified sampling > https://issues.apache.org/jira/browse/MADLIB-986 > [3] General > https://en.wikipedia.org/wiki/Test_set > [4] scikit-learn > http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)