[jira] [Commented] (MADLIB-1097) Random Forest does not allow NULL values in features
[ https://issues.apache.org/jira/browse/MADLIB-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005862#comment-16005862 ] ASF GitHub Bot commented on MADLIB-1097: Github user iyerr3 closed the pull request at: https://github.com/apache/incubator-madlib/pull/131 > Random Forest does not allow NULL values in features > > > Key: MADLIB-1097 > URL: https://issues.apache.org/jira/browse/MADLIB-1097 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Random Forest >Reporter: Nandish Jayaram >Assignee: Rahul Iyer >Priority: Minor > Fix For: v1.12 > > > Running forest_train() with features that have NULL values results in the > following error: > {code} > psql:/tmp/madlib.LkFR_5/recursive_partitioning/test/random_forest.sql_in.tmp:79: > ERROR: spiexceptions.InvalidParameterValue: Function > "_rf_cat_imp_score(bytea8,integer[],double > precision[],integer[],integer,double precision,boolean,double precision[])": > Invalid type conversion. Null where not expected. > CONTEXT: Traceback (most recent call last): > PL/Python function "forest_train", line 42, in > sample_ratio > PL/Python function "forest_train", line 605, in forest_train > PL/Python function "forest_train", line 1052, in _calculate_oob_prediction > PL/Python function "forest_train" > {code} > The following are the input table and parameters used: > {code:sql} > CREATE TABLE dt_golf ( > id integer NOT NULL, > "OUTLOOK" text, > temperature double precision, > humidity double precision, > windy boolean, > class text > ) ; > INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES > (1, 'sunny', 85, 85, false, 'Don''t Play'), > (2, 'sunny', 80, 90, true, 'Don''t Play'), > (3, 'overcast', 83, 78, false, 'Play'), > (4, 'rain', NULL, 96, false, 'Play'), > (5, 'rain', 68, 80, NULL, 'Play'), > (6, 'rain', 65, 70, true, 'Don''t Play'), > (7, 'overcast', 64, 65, true, 'Play'), > (8, 'sunny', 72, 95, false, 'Don''t Play'), > (9, 'sunny', 69, 70, false, 'Play'), > (10, 'rain', 75, 80, false, 'Play'), > (11, 'sunny', 75, 70, true, 'Play'), > (12, 'overcast', 72, 90, true, 'Play'), > (13, 'overcast', 81, 75, false, 'Play'), > (14, 'rain', 71, 80, true, 'Don''t Play'); > SELECT forest_train( > 'dt_golf'::TEXT, -- source table > 'train_output'::TEXT,-- output model table > 'id'::TEXT, -- id column > 'class'::TEXT, -- response > 'windy, temperature'::TEXT, -- features > NULL::TEXT,-- exclude columns > NULL::TEXT,-- no grouping > 5,-- num of trees > 1, -- num of random features > TRUE::BOOLEAN,-- importance > 1::INTEGER, -- num_permutations > 10::INTEGER, -- max depth > 1::INTEGER,-- min split > 1::INTEGER,-- min bucket > 8::INTEGER,-- number of bins per continuous variable > 'max_surrogates=0', > FALSE > ); > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MADLIB-965) RF and DT should accept array input for feature vector
[ https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005863#comment-16005863 ] ASF GitHub Bot commented on MADLIB-965: --- Github user iyerr3 closed the pull request at: https://github.com/apache/incubator-madlib/pull/132 > RF and DT should accept array input for feature vector > -- > > Key: MADLIB-965 > URL: https://issues.apache.org/jira/browse/MADLIB-965 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Decision Tree, Module: Random Forest >Reporter: Rashmi Raghu >Assignee: Rahul Iyer >Priority: Minor > Fix For: v1.12 > > Attachments: DT and RF work1.ipynb > > > We were trying to test whether the RF module could handle a column containing > array of features as input (instead of each feature in a separate column). > The result was an error message but that message is unclear as to source of > error (i.e. is it because of the array feature input column or something > else). Example table, query and error can be found below: > {quote} > -- Executing query: > DROP TABLE IF EXISTS dt_golf; > CREATE TABLE dt_golf ( > id integer NOT NULL, > "OUTLOOK" text, > temperature double precision, > humidity double precision, > windy text, > class text > ) ; > -- Executing query: > INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES > (1, 'sunny', 85, 85, 'false', 'Don''t Play'), > (2, 'sunny', 80, 90, 'true', 'Don''t Play'), > (3, 'overcast', 83, 78, 'false', 'Play'), > (4, 'rain', 70, 96, 'false', 'Play'), > (5, 'rain', 68, 80, 'false', 'Play'), > (6, 'rain', 65, 70, 'true', 'Don''t Play'), > (7, 'overcast', 64, 65, 'true', 'Play'), > (8, 'sunny', 72, 95, 'false', 'Don''t Play'), > (9, 'sunny', 69, 70, 'false', 'Play'), > (10, 'rain', 75, 80, 'false', 'Play'), > (11, 'sunny', 75, 70, 'true', 'Play'), > (12, 'overcast', 72, 90, 'true', 'Play'), > (13, 'overcast', 81, 75, 'false', 'Play'), > (14, 'rain', 71, 80, 'true', 'Don''t Play'); > DROP TABLE IF EXISTS dt_golf_array; > CREATE TABLE dt_golf_array as > select id, array[temperature, humidity] as input_array, class > from dt_golf > distributed by (id); > DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary; > SELECT madlib.forest_train('dt_golf_array', -- source table >'train_output',-- output model table >'id', -- id column >'class', -- response >'input_array', -- features >NULL, -- exclude columns >NULL, -- grouping columns >20::integer, -- number of trees >1::integer,-- number of random features >TRUE::boolean, -- variable importance >1::integer,-- num_permutations >8::integer,-- max depth >3::integer,-- min split >1::integer,-- min bucket >10::integer-- number of splits per > continuous variable >); > NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named > 'id' as the Greenplum Database data distribution key for this table. > HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make > sure column(s) chosen are the optimal data distribution key to minimize skew. > query result with 1 row discarded. > ERROR: plpy.SPIError: invalid array length (plpython.c:4648) > DETAIL: array_of_bigint: Size should be in [1, 1e7], 0 given > CONTEXT: Traceback (most recent call last): > PL/Python function "forest_train", line 42, in > sample_ratio > PL/Python function "forest_train", line 589, in forest_train > PL/Python function "forest_train", line 1037, in _calculate_oob_prediction > PL/Python function "forest_train" > ** Error ** > ERROR: plpy.SPIError: invalid array length (plpython.c:4648) > SQL state: XX000 > Detail: array_of_bigint: Size should be in [1, 1e7], 0 given > Context: Traceback (most recent call last): > PL/Python function "forest_train", line 42, in > sample_ratio > PL/Python function "forest_train", line 589, in forest_train > PL/Python function "forest_train", line 1037, in _calculate_oob_prediction > PL/Python function "forest_train" > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MADLIB-965) RF and DT should accept array input for feature vector
[ https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005650#comment-16005650 ] ASF GitHub Bot commented on MADLIB-965: --- GitHub user iyerr3 opened a pull request: https://github.com/apache/incubator-madlib/pull/132 DT/RF: Allow array input for features JIRA: MADLIB-965 Currently array columns are not allowed features in decision tree and random forest train functions. This commit adds support for a mixed list of features: arrays and individual columns of multiple types can be combined into a single list. Each array is expanded to treat each element of the array as a feature. You can merge this pull request into a Git repository by running: $ git pull https://github.com/iyerr3/incubator-madlib feature/dt_array_feature_support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-madlib/pull/132.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #132 commit 2f1ddee5ab957684988dac575627760a1dfd67bb Author: Rahul Iyer Date: 2017-05-09T21:50:52Z DT/RF: Allow array input for features JIRA: MADLIB-965 Currently array columns are not allowed features in decision tree and random forest train functions. This commit adds support for a mixed list of features: arrays and individual columns of multiple types can be combined into a single list. Each array is expanded to treat each element of the array as a feature. > RF and DT should accept array input for feature vector > -- > > Key: MADLIB-965 > URL: https://issues.apache.org/jira/browse/MADLIB-965 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Decision Tree, Module: Random Forest >Reporter: Rashmi Raghu >Assignee: Rahul Iyer >Priority: Minor > Fix For: v1.12 > > Attachments: DT and RF work1.ipynb > > > We were trying to test whether the RF module could handle a column containing > array of features as input (instead of each feature in a separate column). > The result was an error message but that message is unclear as to source of > error (i.e. is it because of the array feature input column or something > else). Example table, query and error can be found below: > {quote} > -- Executing query: > DROP TABLE IF EXISTS dt_golf; > CREATE TABLE dt_golf ( > id integer NOT NULL, > "OUTLOOK" text, > temperature double precision, > humidity double precision, > windy text, > class text > ) ; > -- Executing query: > INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES > (1, 'sunny', 85, 85, 'false', 'Don''t Play'), > (2, 'sunny', 80, 90, 'true', 'Don''t Play'), > (3, 'overcast', 83, 78, 'false', 'Play'), > (4, 'rain', 70, 96, 'false', 'Play'), > (5, 'rain', 68, 80, 'false', 'Play'), > (6, 'rain', 65, 70, 'true', 'Don''t Play'), > (7, 'overcast', 64, 65, 'true', 'Play'), > (8, 'sunny', 72, 95, 'false', 'Don''t Play'), > (9, 'sunny', 69, 70, 'false', 'Play'), > (10, 'rain', 75, 80, 'false', 'Play'), > (11, 'sunny', 75, 70, 'true', 'Play'), > (12, 'overcast', 72, 90, 'true', 'Play'), > (13, 'overcast', 81, 75, 'false', 'Play'), > (14, 'rain', 71, 80, 'true', 'Don''t Play'); > DROP TABLE IF EXISTS dt_golf_array; > CREATE TABLE dt_golf_array as > select id, array[temperature, humidity] as input_array, class > from dt_golf > distributed by (id); > DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary; > SELECT madlib.forest_train('dt_golf_array', -- source table >'train_output',-- output model table >'id', -- id column >'class', -- response >'input_array', -- features >NULL, -- exclude columns >NULL, -- grouping columns >20::integer, -- number of trees >1::integer,-- number of random features >TRUE::boolean, -- variable importance >1::integer,-- num_permutations >8::integer,-- max depth >3::integer,-- min split >1::integer,-- min bucket >10::integer-- number of splits per > continuous variable >); > NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named > 'id' as the Green
[jira] [Commented] (MADLIB-1097) Random Forest does not allow NULL values in features
[ https://issues.apache.org/jira/browse/MADLIB-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005621#comment-16005621 ] ASF GitHub Bot commented on MADLIB-1097: GitHub user iyerr3 opened a pull request: https://github.com/apache/incubator-madlib/pull/131 RF: Filter NULL dependent values in OOB JIRA: MADLIB-1097 Added `filter_null` string obtained from decision_tree.py into the OOB view to exclude rows that have NULL dependent values. You can merge this pull request into a Git repository by running: $ git pull https://github.com/iyerr3/incubator-madlib bugfix/rf_null_dep_values Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-madlib/pull/131.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #131 commit 9b45ecaaadb9e0d4999dc49e72df8a97cb7692d2 Author: Rahul Iyer Date: 2017-05-04T00:07:55Z RF: Filter NULL dependent values in OOB JIRA: MADLIB-1097 Added `filter_null` string obtained from decision_tree.py into the OOB view to exclude rows that have NULL dependent values. > Random Forest does not allow NULL values in features > > > Key: MADLIB-1097 > URL: https://issues.apache.org/jira/browse/MADLIB-1097 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Random Forest >Reporter: Nandish Jayaram >Assignee: Rahul Iyer >Priority: Minor > Fix For: v1.12 > > > Running forest_train() with features that have NULL values results in the > following error: > {code} > psql:/tmp/madlib.LkFR_5/recursive_partitioning/test/random_forest.sql_in.tmp:79: > ERROR: spiexceptions.InvalidParameterValue: Function > "_rf_cat_imp_score(bytea8,integer[],double > precision[],integer[],integer,double precision,boolean,double precision[])": > Invalid type conversion. Null where not expected. > CONTEXT: Traceback (most recent call last): > PL/Python function "forest_train", line 42, in > sample_ratio > PL/Python function "forest_train", line 605, in forest_train > PL/Python function "forest_train", line 1052, in _calculate_oob_prediction > PL/Python function "forest_train" > {code} > The following are the input table and parameters used: > {code:sql} > CREATE TABLE dt_golf ( > id integer NOT NULL, > "OUTLOOK" text, > temperature double precision, > humidity double precision, > windy boolean, > class text > ) ; > INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES > (1, 'sunny', 85, 85, false, 'Don''t Play'), > (2, 'sunny', 80, 90, true, 'Don''t Play'), > (3, 'overcast', 83, 78, false, 'Play'), > (4, 'rain', NULL, 96, false, 'Play'), > (5, 'rain', 68, 80, NULL, 'Play'), > (6, 'rain', 65, 70, true, 'Don''t Play'), > (7, 'overcast', 64, 65, true, 'Play'), > (8, 'sunny', 72, 95, false, 'Don''t Play'), > (9, 'sunny', 69, 70, false, 'Play'), > (10, 'rain', 75, 80, false, 'Play'), > (11, 'sunny', 75, 70, true, 'Play'), > (12, 'overcast', 72, 90, true, 'Play'), > (13, 'overcast', 81, 75, false, 'Play'), > (14, 'rain', 71, 80, true, 'Don''t Play'); > SELECT forest_train( > 'dt_golf'::TEXT, -- source table > 'train_output'::TEXT,-- output model table > 'id'::TEXT, -- id column > 'class'::TEXT, -- response > 'windy, temperature'::TEXT, -- features > NULL::TEXT,-- exclude columns > NULL::TEXT,-- no grouping > 5,-- num of trees > 1, -- num of random features > TRUE::BOOLEAN,-- importance > 1::INTEGER, -- num_permutations > 10::INTEGER, -- max depth > 1::INTEGER,-- min split > 1::INTEGER,-- min bucket > 8::INTEGER,-- number of bins per continuous variable > 'max_surrogates=0', > FALSE > ); > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (MADLIB-965) RF and DT should accept array input for feature vector
[ https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank McQuillan resolved MADLIB-965. Resolution: Fixed > RF and DT should accept array input for feature vector > -- > > Key: MADLIB-965 > URL: https://issues.apache.org/jira/browse/MADLIB-965 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Decision Tree, Module: Random Forest >Reporter: Rashmi Raghu >Assignee: Rahul Iyer >Priority: Minor > Fix For: v1.12 > > Attachments: DT and RF work1.ipynb > > > We were trying to test whether the RF module could handle a column containing > array of features as input (instead of each feature in a separate column). > The result was an error message but that message is unclear as to source of > error (i.e. is it because of the array feature input column or something > else). Example table, query and error can be found below: > {quote} > -- Executing query: > DROP TABLE IF EXISTS dt_golf; > CREATE TABLE dt_golf ( > id integer NOT NULL, > "OUTLOOK" text, > temperature double precision, > humidity double precision, > windy text, > class text > ) ; > -- Executing query: > INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES > (1, 'sunny', 85, 85, 'false', 'Don''t Play'), > (2, 'sunny', 80, 90, 'true', 'Don''t Play'), > (3, 'overcast', 83, 78, 'false', 'Play'), > (4, 'rain', 70, 96, 'false', 'Play'), > (5, 'rain', 68, 80, 'false', 'Play'), > (6, 'rain', 65, 70, 'true', 'Don''t Play'), > (7, 'overcast', 64, 65, 'true', 'Play'), > (8, 'sunny', 72, 95, 'false', 'Don''t Play'), > (9, 'sunny', 69, 70, 'false', 'Play'), > (10, 'rain', 75, 80, 'false', 'Play'), > (11, 'sunny', 75, 70, 'true', 'Play'), > (12, 'overcast', 72, 90, 'true', 'Play'), > (13, 'overcast', 81, 75, 'false', 'Play'), > (14, 'rain', 71, 80, 'true', 'Don''t Play'); > DROP TABLE IF EXISTS dt_golf_array; > CREATE TABLE dt_golf_array as > select id, array[temperature, humidity] as input_array, class > from dt_golf > distributed by (id); > DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary; > SELECT madlib.forest_train('dt_golf_array', -- source table >'train_output',-- output model table >'id', -- id column >'class', -- response >'input_array', -- features >NULL, -- exclude columns >NULL, -- grouping columns >20::integer, -- number of trees >1::integer,-- number of random features >TRUE::boolean, -- variable importance >1::integer,-- num_permutations >8::integer,-- max depth >3::integer,-- min split >1::integer,-- min bucket >10::integer-- number of splits per > continuous variable >); > NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named > 'id' as the Greenplum Database data distribution key for this table. > HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make > sure column(s) chosen are the optimal data distribution key to minimize skew. > query result with 1 row discarded. > ERROR: plpy.SPIError: invalid array length (plpython.c:4648) > DETAIL: array_of_bigint: Size should be in [1, 1e7], 0 given > CONTEXT: Traceback (most recent call last): > PL/Python function "forest_train", line 42, in > sample_ratio > PL/Python function "forest_train", line 589, in forest_train > PL/Python function "forest_train", line 1037, in _calculate_oob_prediction > PL/Python function "forest_train" > ** Error ** > ERROR: plpy.SPIError: invalid array length (plpython.c:4648) > SQL state: XX000 > Detail: array_of_bigint: Size should be in [1, 1e7], 0 given > Context: Traceback (most recent call last): > PL/Python function "forest_train", line 42, in > sample_ratio > PL/Python function "forest_train", line 589, in forest_train > PL/Python function "forest_train", line 1037, in _calculate_oob_prediction > PL/Python function "forest_train" > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MADLIB-1087) Random Forest fails if features are INT or NUMERIC only and variable importance is TRUE
[ https://issues.apache.org/jira/browse/MADLIB-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005602#comment-16005602 ] ASF GitHub Bot commented on MADLIB-1087: Github user asfgit closed the pull request at: https://github.com/apache/incubator-madlib/pull/129 > Random Forest fails if features are INT or NUMERIC only and variable > importance is TRUE > --- > > Key: MADLIB-1087 > URL: https://issues.apache.org/jira/browse/MADLIB-1087 > Project: Apache MADlib > Issue Type: Bug > Components: Module: Random Forest >Reporter: Paul Chang >Assignee: Rahul Iyer >Priority: Minor > Fix For: v1.12 > > > If we attempt to train on a dataset where all features are either INT or > NUMERIC, and with variable importance TRUE, forest_train() fails with the > following error: > [2017-04-03 13:35:35] [XX000] ERROR: plpy.SPIError: invalid array length > (plpython.c:4648) > [2017-04-03 13:35:35] Detail: array_of_bigint: Size should be in [1, 1e7], 0 > given > [2017-04-03 13:35:35] Where: Traceback (most recent call last): > [2017-04-03 13:35:35] PL/Python function "forest_train", line 42, in > [2017-04-03 13:35:35] sample_ratio > [2017-04-03 13:35:35] PL/Python function "forest_train", line 591, in > forest_train > [2017-04-03 13:35:35] PL/Python function "forest_train", line 1038, in > _calculate_oob_prediction > [2017-04-03 13:35:35] PL/Python function "forest_train" > However, if we add a single feature column that is FLOAT, REAL, or DOUBLE > PRECISION, the trainer does not fail. -- This message was sent by Atlassian JIRA (v6.3.15#6346)