[jira] [Commented] (MADLIB-1146) Elastic Net fails when used without normalization with grouping
[ https://issues.apache.org/jira/browse/MADLIB-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133934#comment-16133934 ] ASF GitHub Bot commented on MADLIB-1146: Github user asfgit closed the pull request at: https://github.com/apache/incubator-madlib/pull/172 > Elastic Net fails when used without normalization with grouping > --- > > Key: MADLIB-1146 > URL: https://issues.apache.org/jira/browse/MADLIB-1146 > Project: Apache MADlib > Issue Type: Bug >Reporter: Cooper Sloan >Assignee: Nandish Jayaram >Priority: Minor > > ``` > DROP TABLE IF EXISTS house_en,house_en_summary; > SELECT madlib.elastic_net_train( > 'lin_housing_wi', > 'house_en', > 'y', > 'x', > 'gaussian', > 0.5, > 0.5, > False, > 'grp_by_col', > 'fista', > '', > NULL, > 1, > 1e-6 > ); > psql:/Users/csloan/elastic_net.sql:1: NOTICE: table "house_en" does not > exist, skipping > psql:/Users/csloan/elastic_net.sql:1: NOTICE: table "house_en_summary" does > not exist, skipping > DROP TABLE > psql:/Users/csloan/elastic_net.sql:17: ERROR: KeyError: 'select_grp' > CONTEXT: Traceback (most recent call last): > PL/Python function "elastic_net_train", line 27, in > excluded, max_iter, tolerance) > PL/Python function "elastic_net_train", line 467, in elastic_net_train > PL/Python function "elastic_net_train", line 502, in > _internal_elastic_net_train > PL/Python function "elastic_net_train", line 24, in > _elastic_net_gaussian_fista_train > PL/Python function "elastic_net_train", line 171, in > _elastic_net_fista_train > PL/Python function "elastic_net_train", line 297, in > _elastic_net_fista_train_compute > PL/Python function "elastic_net_train", line 83, in > _elastic_net_generate_result > PL/Python function "elastic_net_train" > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MADLIB-1119) Train-test split
[ https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133806#comment-16133806 ] Orhan Kislal commented on MADLIB-1119: -- Created a PR to address both of these issues: https://github.com/apache/incubator-madlib/pull/174 > Train-test split > > > Key: MADLIB-1119 > URL: https://issues.apache.org/jira/browse/MADLIB-1119 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Sampling >Reporter: Frank McQuillan >Assignee: Orhan Kislal > Fix For: v1.12 > > > Context > See related story on stratified sampling > https://issues.apache.org/jira/browse/MADLIB-986 > Story > As a data scientist, I want to split a data table into training and test sets > including grouping support, so that I use the result sets for model > development in the usual way. > The MVP for this story is: > * support split by group > * allow option to sample without replacement (default) and sample with > replacement > * allow option to output a subset of columns to the output table > * output one table with a new test/train column, or optionally two separate > tables > Proposed Interface > {code} > train_test_split ( >source_table, >output_table, >train_proportion, >test_proportion, -- optional >grouping_col -- optional >with_replacement, -- optional >target_cols -- optional >separate_output_tables -- optional > ) > source_table > TEXT. The name of the table containing the input data. > output_table > TEXT. Name of output table. A new INTEGER column on the right > called 'split' will identify 1 for train set and 0 for test set, > unless the 'separate_output_tables' parameter below is TRUE, > in which case two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > The output table contains all the columns present in the source > table unless otherwise specified in the 'target_cols' parameter below. > train_proportion > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the train split. If the 'grouping_col' parameter is specified below, > each group will be sampled independently using the > train proportion, i.e., in a stratified fashion. > test_proportion (optional) > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the test split. Default is the complement to the train > proportion (1-'train_proportion'). If the 'grouping_col' > parameter is specified below, each group will be sampled > independently using the train proportion, > i.e., in a stratified fashion. > grouping_col (optional) > TEXT, default: NULL. A single column or a list of comma-separated columns > that defines how to stratify. When this parameter is NULL, > the train-test split is not stratified. > with_replacement (optional) > BOOLEAN, default FALSE. Determines whether to sample with replacement > or without replacement (default). > target_cols (optional) > TEXT, default NULL. A comma-separated list of columns to appear in the > 'output_table'. > If NULL, all columns from the 'source_table' will appear in the > 'output_table'. > separate_output_tables (optional) > BOOLEAN, default FALSE. If TRUE, two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > {code} > Other notes > 1) PDL tools is one example implementation of train/test split to review [2]. > > 2) From Rahul Iyer: "The goal of having both train and test is to provide > subsample and train/test split in one function. > For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed > data will be output. This is tremendously useful in situations where a user > wants to prototype/evaluate a couple of models on smaller iid data before > running it on whole dataset. > Under no circumstances would the train_size + test_size be allowed to be more > than 1. The implementation will also ensure that there are no "leaks" (leak = > same data occurring in both train and test) as that defeats the whole purpose > of building an independent dataset for model evaluation. > Of course, the interface does get a little complex and could confuse users. > Explanatory documentation with examples is the only solution to that problem. > The alternative to having both sizes in one function is to run a subsample > function (using various sampling methods) and then perform the train_test > split. The downside to this approach is it requires writing an intermediate > table
[jira] [Commented] (MADLIB-1073) Graph - Phase 1 measures
[ https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133772#comment-16133772 ] ASF GitHub Bot commented on MADLIB-1073: GitHub user iyerr3 opened a pull request: https://github.com/apache/incubator-madlib/pull/173 Measures: Use outer join for in-out degrees computation JIRA: MADLIB-1073 Commit 06788cc added the graph measure functions described in the JIRA. This commit fixes a bug from that commit in the graph_vertex_degrees function. The bug led to results not containing vertices that either had 0 in-degree or out-degree. You can merge this pull request into a Git repository by running: $ git pull https://github.com/iyerr3/incubator-madlib bugfix/in_out_degrees Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-madlib/pull/173.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #173 commit f3697fdaaebeb851dfa23a0503c2c143c54f7f69 Author: Rahul IyerDate: 2017-08-18T23:19:39Z Measures: Use outer join for in-out degrees computation JIRA: MADLIB-1073 Commit 06788cc added the graph measure functions described in the JIRA. This commit fixes a bug from that commit in the graph_vertex_degrees function. The bug led to results not containing vertices that either had 0 in-degree or out-degree. > Graph - Phase 1 measures > > > Key: MADLIB-1073 > URL: https://issues.apache.org/jira/browse/MADLIB-1073 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Graph >Reporter: Frank McQuillan >Assignee: Rahul Iyer > Fix For: v1.12 > > Attachments: Graph Measures Interfaces - JIRA.pdf > > > Follow on from https://issues.apache.org/jira/browse/MADLIB-1072. Given that > this story is complete, what measures can we compute from APSP? > Story > As a MADlib developer, I want to implement the following measures: > * Closeness (uses APSP) > * Graph diameter (uses APSP) > * Average path length (uses APSP) > * In/out degrees > Acceptance > 1) Interface defined > 2) Design document updated > 3) Documentation and on-line help > 4) IC and functional tests > 5) Scale tests -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (MADLIB-1073) Graph - Phase 1 measures
[ https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank McQuillan resolved MADLIB-1073. - Resolution: Fixed > Graph - Phase 1 measures > > > Key: MADLIB-1073 > URL: https://issues.apache.org/jira/browse/MADLIB-1073 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Graph >Reporter: Frank McQuillan >Assignee: Rahul Iyer > Fix For: v1.12 > > Attachments: Graph Measures Interfaces - JIRA.pdf > > > Follow on from https://issues.apache.org/jira/browse/MADLIB-1072. Given that > this story is complete, what measures can we compute from APSP? > Story > As a MADlib developer, I want to implement the following measures: > * Closeness (uses APSP) > * Graph diameter (uses APSP) > * Average path length (uses APSP) > * In/out degrees > Acceptance > 1) Interface defined > 2) Design document updated > 3) Documentation and on-line help > 4) IC and functional tests > 5) Scale tests -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Reopened] (MADLIB-1119) Train-test split
[ https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank McQuillan reopened MADLIB-1119: - > Train-test split > > > Key: MADLIB-1119 > URL: https://issues.apache.org/jira/browse/MADLIB-1119 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Sampling >Reporter: Frank McQuillan >Assignee: Orhan Kislal > Fix For: v1.12 > > > Context > See related story on stratified sampling > https://issues.apache.org/jira/browse/MADLIB-986 > Story > As a data scientist, I want to split a data table into training and test sets > including grouping support, so that I use the result sets for model > development in the usual way. > The MVP for this story is: > * support split by group > * allow option to sample without replacement (default) and sample with > replacement > * allow option to output a subset of columns to the output table > * output one table with a new test/train column, or optionally two separate > tables > Proposed Interface > {code} > train_test_split ( >source_table, >output_table, >train_proportion, >test_proportion, -- optional >grouping_col -- optional >with_replacement, -- optional >target_cols -- optional >separate_output_tables -- optional > ) > source_table > TEXT. The name of the table containing the input data. > output_table > TEXT. Name of output table. A new INTEGER column on the right > called 'split' will identify 1 for train set and 0 for test set, > unless the 'separate_output_tables' parameter below is TRUE, > in which case two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > The output table contains all the columns present in the source > table unless otherwise specified in the 'target_cols' parameter below. > train_proportion > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the train split. If the 'grouping_col' parameter is specified below, > each group will be sampled independently using the > train proportion, i.e., in a stratified fashion. > test_proportion (optional) > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the test split. Default is the complement to the train > proportion (1-'train_proportion'). If the 'grouping_col' > parameter is specified below, each group will be sampled > independently using the train proportion, > i.e., in a stratified fashion. > grouping_col (optional) > TEXT, default: NULL. A single column or a list of comma-separated columns > that defines how to stratify. When this parameter is NULL, > the train-test split is not stratified. > with_replacement (optional) > BOOLEAN, default FALSE. Determines whether to sample with replacement > or without replacement (default). > target_cols (optional) > TEXT, default NULL. A comma-separated list of columns to appear in the > 'output_table'. > If NULL, all columns from the 'source_table' will appear in the > 'output_table'. > separate_output_tables (optional) > BOOLEAN, default FALSE. If TRUE, two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > {code} > Other notes > 1) PDL tools is one example implementation of train/test split to review [2]. > > 2) From Rahul Iyer: "The goal of having both train and test is to provide > subsample and train/test split in one function. > For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed > data will be output. This is tremendously useful in situations where a user > wants to prototype/evaluate a couple of models on smaller iid data before > running it on whole dataset. > Under no circumstances would the train_size + test_size be allowed to be more > than 1. The implementation will also ensure that there are no "leaks" (leak = > same data occurring in both train and test) as that defeats the whole purpose > of building an independent dataset for model evaluation. > Of course, the interface does get a little complex and could confuse users. > Explanatory documentation with examples is the only solution to that problem. > The alternative to having both sizes in one function is to run a subsample > function (using various sampling methods) and then perform the train_test > split. The downside to this approach is it requires writing an intermediate > table to disk (inefficient). " > Acceptance > 1) Code, user docs, on-line docs, IC, Tinc tests complete. > 2) Radar green for all supported
[jira] [Commented] (MADLIB-1119) Train-test split
[ https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133739#comment-16133739 ] Frank McQuillan commented on MADLIB-1119: - Re-opening 1) Please rename this module train-test split, not test-train split, and update user docs. This is what scikit-learn does and it is a more common term in the industry. 2) for test_proportion * should be optional param (it is mandatory currently) * Default should the complement to the train proportion (1-'train_proportion'); this does not seen to be implemented Thanks > Train-test split > > > Key: MADLIB-1119 > URL: https://issues.apache.org/jira/browse/MADLIB-1119 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Sampling >Reporter: Frank McQuillan >Assignee: Orhan Kislal > Fix For: v1.12 > > > Context > See related story on stratified sampling > https://issues.apache.org/jira/browse/MADLIB-986 > Story > As a data scientist, I want to split a data table into training and test sets > including grouping support, so that I use the result sets for model > development in the usual way. > The MVP for this story is: > * support split by group > * allow option to sample without replacement (default) and sample with > replacement > * allow option to output a subset of columns to the output table > * output one table with a new test/train column, or optionally two separate > tables > Proposed Interface > {code} > train_test_split ( >source_table, >output_table, >train_proportion, >test_proportion, -- optional >grouping_col -- optional >with_replacement, -- optional >target_cols -- optional >separate_output_tables -- optional > ) > source_table > TEXT. The name of the table containing the input data. > output_table > TEXT. Name of output table. A new INTEGER column on the right > called 'split' will identify 1 for train set and 0 for test set, > unless the 'separate_output_tables' parameter below is TRUE, > in which case two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > The output table contains all the columns present in the source > table unless otherwise specified in the 'target_cols' parameter below. > train_proportion > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the train split. If the 'grouping_col' parameter is specified below, > each group will be sampled independently using the > train proportion, i.e., in a stratified fashion. > test_proportion (optional) > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the test split. Default is the complement to the train > proportion (1-'train_proportion'). If the 'grouping_col' > parameter is specified below, each group will be sampled > independently using the train proportion, > i.e., in a stratified fashion. > grouping_col (optional) > TEXT, default: NULL. A single column or a list of comma-separated columns > that defines how to stratify. When this parameter is NULL, > the train-test split is not stratified. > with_replacement (optional) > BOOLEAN, default FALSE. Determines whether to sample with replacement > or without replacement (default). > target_cols (optional) > TEXT, default NULL. A comma-separated list of columns to appear in the > 'output_table'. > If NULL, all columns from the 'source_table' will appear in the > 'output_table'. > separate_output_tables (optional) > BOOLEAN, default FALSE. If TRUE, two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > {code} > Other notes > 1) PDL tools is one example implementation of train/test split to review [2]. > > 2) From Rahul Iyer: "The goal of having both train and test is to provide > subsample and train/test split in one function. > For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed > data will be output. This is tremendously useful in situations where a user > wants to prototype/evaluate a couple of models on smaller iid data before > running it on whole dataset. > Under no circumstances would the train_size + test_size be allowed to be more > than 1. The implementation will also ensure that there are no "leaks" (leak = > same data occurring in both train and test) as that defeats the whole purpose > of building an independent dataset for model evaluation. > Of course, the interface does get a little complex and could confuse users. > Explanatory documentation
[jira] [Commented] (MADLIB-413) Neural Networks - MLP - Phase 1
[ https://issues.apache.org/jira/browse/MADLIB-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133665#comment-16133665 ] Frank McQuillan commented on MADLIB-413: (2) {code} SELECT madlib.mlp_classification( 'iris_data', -- Source table 'mlp_model', -- Destination table 'attributes', -- Input features 'class_text' -- Label ); {code} works now On-line help looks good though I did not examine all of the details > Neural Networks - MLP - Phase 1 > --- > > Key: MADLIB-413 > URL: https://issues.apache.org/jira/browse/MADLIB-413 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Neural Networks >Reporter: Caleb Welton >Assignee: Cooper Sloan > Fix For: v1.12 > > Attachments: mlp.sql_in, screenshot-1.png > > > Multilayer perceptron with backpropagation > Modules: > * mlp_classification > * mlp_regression > Interface > {code} > source_table VARCHAR > output_table VARCHAR > independent_varname VARCHAR -- Column name for input features, should be a > Real Valued array > dependent_varname VARCHAR, -- Column name for target values, should be Real > Valued array of size 1 or greater > hidden_layer_sizes INTEGER[], -- Number of units per hidden layer (can be > empty or null, in which case, no hidden layers) > optimizer_params VARCHAR, -- Specified below > weights VARCHAR, -- Column name for weights. Weights the loss for each input > vector. Column should contain positive real value > activation_function VARCHAR, -- One of 'sigmoid' (default), 'tanh', 'relu', > or any prefix (eg. 't', 's') > grouping_cols > ) > {code} > where > {code} > optimizer_params: -- eg "step_size=0.5, n_tries=5" > { > step_size DOUBLE PRECISION, -- Learning rate > n_iterations INTEGER, -- Number of iterations per try > n_tries INTEGER, -- Total number of training cycles, with random > initializations to avoid local minima. > tolerance DOUBLE PRECISION, -- Maximum distance between weights before > training stops (or until it reaches n_iterations) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (MADLIB-1134) Neural Networks - MLP - Phase 2
[ https://issues.apache.org/jira/browse/MADLIB-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank McQuillan closed MADLIB-1134. --- > Neural Networks - MLP - Phase 2 > --- > > Key: MADLIB-1134 > URL: https://issues.apache.org/jira/browse/MADLIB-1134 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Neural Networks >Reporter: Frank McQuillan >Assignee: Cooper Sloan > Fix For: v1.12 > > > Follow on from https://issues.apache.org/jira/browse/MADLIB-413 > Story > As a MADlib developer, I want to get 2nd phase implementation of NN going > with training and prediction functions, so that I can use this to build to an > MVP version for GA. > Features to add: > * weights for inputs > * logic for n_tries > * normalize inputs > * L2 regularization > * learning rate policy > * warm start -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Reopened] (MADLIB-1073) Graph - Phase 1 measures
[ https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank McQuillan reopened MADLIB-1073: - > Graph - Phase 1 measures > > > Key: MADLIB-1073 > URL: https://issues.apache.org/jira/browse/MADLIB-1073 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Graph >Reporter: Frank McQuillan >Assignee: Rahul Iyer > Fix For: v1.12 > > Attachments: Graph Measures Interfaces - JIRA.pdf > > > Follow on from https://issues.apache.org/jira/browse/MADLIB-1072. Given that > this story is complete, what measures can we compute from APSP? > Story > As a MADlib developer, I want to implement the following measures: > * Closeness (uses APSP) > * Graph diameter (uses APSP) > * Average path length (uses APSP) > * In/out degrees > Acceptance > 1) Interface defined > 2) Design document updated > 3) Documentation and on-line help > 4) IC and functional tests > 5) Scale tests -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MADLIB-1073) Graph - Phase 1 measures
[ https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133549#comment-16133549 ] Frank McQuillan commented on MADLIB-1073: - Regarding https://github.com/apache/incubator-madlib/pull/152 Closeness, diameter and average path length look good. In-out degree is dropping the last vertex. E.g., from the user docs: {code} DROP TABLE IF EXISTS degrees; SELECT madlib.graph_vertex_degrees( 'vertex', -- Vertex table 'id', -- Vertix id column (NULL means use default naming) 'edge',-- Edge table 'src=src_id, dest=dest_id, weight=edge_weight', 'degrees');-- Output table of shortest paths SELECT * FROM degrees ORDER BY id; {code} produces {code} id | indegree | outdegree +--+--- 0 |2 | 3 1 |1 | 2 2 |2 | 3 3 |2 | 1 4 |1 | 1 5 |1 | 1 6 |2 | 1 (7 rows) {code} Vertex 7 has indegree=1 and outdegree=0. Same issue with and without grouping. > Graph - Phase 1 measures > > > Key: MADLIB-1073 > URL: https://issues.apache.org/jira/browse/MADLIB-1073 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Graph >Reporter: Frank McQuillan >Assignee: Rahul Iyer > Fix For: v1.12 > > Attachments: Graph Measures Interfaces - JIRA.pdf > > > Follow on from https://issues.apache.org/jira/browse/MADLIB-1072. Given that > this story is complete, what measures can we compute from APSP? > Story > As a MADlib developer, I want to implement the following measures: > * Closeness (uses APSP) > * Graph diameter (uses APSP) > * Average path length (uses APSP) > * In/out degrees > Acceptance > 1) Interface defined > 2) Design document updated > 3) Documentation and on-line help > 4) IC and functional tests > 5) Scale tests -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MADLIB-1146) Elastic Net fails when used without normalization with grouping
[ https://issues.apache.org/jira/browse/MADLIB-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133294#comment-16133294 ] ASF GitHub Bot commented on MADLIB-1146: GitHub user njayaram2 opened a pull request: https://github.com/apache/incubator-madlib/pull/172 Elastic_net: Fix grouping without normalization bug JIRA: MADLIB-1146 Selecting grouping columns into the output table was not working when data was NOT scaled, but grouping was used. This commit fixes it. Closes #172 You can merge this pull request into a Git repository by running: $ git pull https://github.com/njayaram2/incubator-madlib bugfix/MADlib_1146 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-madlib/pull/172.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #172 commit 54c09a64b98a070cd4f18c85bf47bf77a2546264 Author: Nandish JayaramDate: 2017-08-18T17:10:58Z Elastic_net: Fix grouping without normalization bug JIRA: MADLIB-1146 Selecting grouping columns into the output table was not working when data was NOT scaled, but grouping was used. This commit fixes it. Closes #172 > Elastic Net fails when used without normalization with grouping > --- > > Key: MADLIB-1146 > URL: https://issues.apache.org/jira/browse/MADLIB-1146 > Project: Apache MADlib > Issue Type: Bug >Reporter: Cooper Sloan >Assignee: Nandish Jayaram >Priority: Minor > > ``` > DROP TABLE IF EXISTS house_en,house_en_summary; > SELECT madlib.elastic_net_train( > 'lin_housing_wi', > 'house_en', > 'y', > 'x', > 'gaussian', > 0.5, > 0.5, > False, > 'grp_by_col', > 'fista', > '', > NULL, > 1, > 1e-6 > ); > psql:/Users/csloan/elastic_net.sql:1: NOTICE: table "house_en" does not > exist, skipping > psql:/Users/csloan/elastic_net.sql:1: NOTICE: table "house_en_summary" does > not exist, skipping > DROP TABLE > psql:/Users/csloan/elastic_net.sql:17: ERROR: KeyError: 'select_grp' > CONTEXT: Traceback (most recent call last): > PL/Python function "elastic_net_train", line 27, in > excluded, max_iter, tolerance) > PL/Python function "elastic_net_train", line 467, in elastic_net_train > PL/Python function "elastic_net_train", line 502, in > _internal_elastic_net_train > PL/Python function "elastic_net_train", line 24, in > _elastic_net_gaussian_fista_train > PL/Python function "elastic_net_train", line 171, in > _elastic_net_fista_train > PL/Python function "elastic_net_train", line 297, in > _elastic_net_fista_train_compute > PL/Python function "elastic_net_train", line 83, in > _elastic_net_generate_result > PL/Python function "elastic_net_train" > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (MADLIB-1119) Train-test split
[ https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank McQuillan resolved MADLIB-1119. - Resolution: Fixed > Train-test split > > > Key: MADLIB-1119 > URL: https://issues.apache.org/jira/browse/MADLIB-1119 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Sampling >Reporter: Frank McQuillan >Assignee: Orhan Kislal > Fix For: v1.12 > > > Context > See related story on stratified sampling > https://issues.apache.org/jira/browse/MADLIB-986 > Story > As a data scientist, I want to split a data table into training and test sets > including grouping support, so that I use the result sets for model > development in the usual way. > The MVP for this story is: > * support split by group > * allow option to sample without replacement (default) and sample with > replacement > * allow option to output a subset of columns to the output table > * output one table with a new test/train column, or optionally two separate > tables > Proposed Interface > {code} > train_test_split ( >source_table, >output_table, >train_proportion, >test_proportion, -- optional >grouping_col -- optional >with_replacement, -- optional >target_cols -- optional >separate_output_tables -- optional > ) > source_table > TEXT. The name of the table containing the input data. > output_table > TEXT. Name of output table. A new INTEGER column on the right > called 'split' will identify 1 for train set and 0 for test set, > unless the 'separate_output_tables' parameter below is TRUE, > in which case two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > The output table contains all the columns present in the source > table unless otherwise specified in the 'target_cols' parameter below. > train_proportion > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the train split. If the 'grouping_col' parameter is specified below, > each group will be sampled independently using the > train proportion, i.e., in a stratified fashion. > test_proportion (optional) > FLOAT8 in the range (0,1). Proportion of the dataset to include > in the test split. Default is the complement to the train > proportion (1-'train_proportion'). If the 'grouping_col' > parameter is specified below, each group will be sampled > independently using the train proportion, > i.e., in a stratified fashion. > grouping_col (optional) > TEXT, default: NULL. A single column or a list of comma-separated columns > that defines how to stratify. When this parameter is NULL, > the train-test split is not stratified. > with_replacement (optional) > BOOLEAN, default FALSE. Determines whether to sample with replacement > or without replacement (default). > target_cols (optional) > TEXT, default NULL. A comma-separated list of columns to appear in the > 'output_table'. > If NULL, all columns from the 'source_table' will appear in the > 'output_table'. > separate_output_tables (optional) > BOOLEAN, default FALSE. If TRUE, two output tables will be created using > the 'output_table' name with the suffixes '_train' and '_test'. > {code} > Other notes > 1) PDL tools is one example implementation of train/test split to review [2]. > > 2) From Rahul Iyer: "The goal of having both train and test is to provide > subsample and train/test split in one function. > For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed > data will be output. This is tremendously useful in situations where a user > wants to prototype/evaluate a couple of models on smaller iid data before > running it on whole dataset. > Under no circumstances would the train_size + test_size be allowed to be more > than 1. The implementation will also ensure that there are no "leaks" (leak = > same data occurring in both train and test) as that defeats the whole purpose > of building an independent dataset for model evaluation. > Of course, the interface does get a little complex and could confuse users. > Explanatory documentation with examples is the only solution to that problem. > The alternative to having both sizes in one function is to run a subsample > function (using various sampling methods) and then perform the train_test > split. The downside to this approach is it requires writing an intermediate > table to disk (inefficient). " > Acceptance > 1) Code, user docs, on-line docs, IC, Tinc tests complete. > 2) Radar