[jira] [Commented] (MADLIB-1097) Random Forest does not allow NULL values in features

2017-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005862#comment-16005862
 ] 

ASF GitHub Bot commented on MADLIB-1097:


Github user iyerr3 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/131


> Random Forest does not allow NULL values in features
> 
>
> Key: MADLIB-1097
> URL: https://issues.apache.org/jira/browse/MADLIB-1097
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Random Forest
>Reporter: Nandish Jayaram
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
>
> Running forest_train() with features that have NULL values results in the 
> following error:
> {code}
> psql:/tmp/madlib.LkFR_5/recursive_partitioning/test/random_forest.sql_in.tmp:79:
>  ERROR:  spiexceptions.InvalidParameterValue: Function 
> "_rf_cat_imp_score(bytea8,integer[],double 
> precision[],integer[],integer,double precision,boolean,double precision[])": 
> Invalid type conversion. Null where not expected.
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in 
> sample_ratio
>   PL/Python function "forest_train", line 605, in forest_train
>   PL/Python function "forest_train", line 1052, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {code}
> The following are the input table and parameters used:
> {code:sql}
> CREATE TABLE dt_golf (
> id integer NOT NULL,
> "OUTLOOK" text,
> temperature double precision,
> humidity double precision,
> windy boolean,
> class text
> ) ;
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, false, 'Don''t Play'),
> (2, 'sunny', 80, 90, true, 'Don''t Play'),
> (3, 'overcast', 83, 78, false, 'Play'),
> (4, 'rain', NULL, 96, false, 'Play'),
> (5, 'rain', 68, 80, NULL, 'Play'),
> (6, 'rain', 65, 70, true, 'Don''t Play'),
> (7, 'overcast', 64, 65, true, 'Play'),
> (8, 'sunny', 72, 95, false, 'Don''t Play'),
> (9, 'sunny', 69, 70, false, 'Play'),
> (10, 'rain', 75, 80, false, 'Play'),
> (11, 'sunny', 75, 70, true, 'Play'),
> (12, 'overcast', 72, 90, true, 'Play'),
> (13, 'overcast', 81, 75, false, 'Play'),
> (14, 'rain', 71, 80, true, 'Don''t Play');
> SELECT forest_train(
>   'dt_golf'::TEXT, -- source table
>   'train_output'::TEXT,-- output model table
>   'id'::TEXT,  -- id column
>   'class'::TEXT,   -- response
>   'windy, temperature'::TEXT,   -- features
>   NULL::TEXT,-- exclude columns
>   NULL::TEXT,-- no grouping
>   5,-- num of trees
>   1, -- num of random features
>   TRUE::BOOLEAN,-- importance
>   1::INTEGER,   -- num_permutations
>   10::INTEGER,   -- max depth
>   1::INTEGER,-- min split
>   1::INTEGER,-- min bucket
>   8::INTEGER,-- number of bins per continuous variable
>   'max_surrogates=0',
>   FALSE
>   );
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-965) RF and DT should accept array input for feature vector

2017-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005863#comment-16005863
 ] 

ASF GitHub Bot commented on MADLIB-965:
---

Github user iyerr3 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/132


> RF and DT should accept array input for feature vector
> --
>
> Key: MADLIB-965
> URL: https://issues.apache.org/jira/browse/MADLIB-965
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Decision Tree, Module: Random Forest
>Reporter: Rashmi Raghu
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
> Attachments: DT and RF work1.ipynb
>
>
> We were trying to test whether the RF module could handle a column containing 
> array of features as input (instead of each feature in a separate column). 
> The result was an error message but that message is unclear as to source of 
> error (i.e. is it because of the array feature input column or something 
> else). Example table, query and error can be found below:
> {quote}
> -- Executing query:
> DROP TABLE IF EXISTS dt_golf;
> CREATE TABLE dt_golf (
> id integer NOT NULL,
> "OUTLOOK" text,
> temperature double precision,
> humidity double precision,
> windy text,
> class text
> ) ;
> -- Executing query:
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, 'false', 'Don''t Play'),
> (2, 'sunny', 80, 90, 'true', 'Don''t Play'),
> (3, 'overcast', 83, 78, 'false', 'Play'),
> (4, 'rain', 70, 96, 'false', 'Play'),
> (5, 'rain', 68, 80, 'false', 'Play'),
> (6, 'rain', 65, 70, 'true', 'Don''t Play'),
> (7, 'overcast', 64, 65, 'true', 'Play'),
> (8, 'sunny', 72, 95, 'false', 'Don''t Play'),
> (9, 'sunny', 69, 70, 'false', 'Play'),
> (10, 'rain', 75, 80, 'false', 'Play'),
> (11, 'sunny', 75, 70, 'true', 'Play'),
> (12, 'overcast', 72, 90, 'true', 'Play'),
> (13, 'overcast', 81, 75, 'false', 'Play'),
> (14, 'rain', 71, 80, 'true', 'Don''t Play');
> DROP TABLE IF EXISTS dt_golf_array;
> CREATE TABLE dt_golf_array as 
> select id, array[temperature, humidity] as input_array, class
> from dt_golf
> distributed by (id);
> DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
> SELECT madlib.forest_train('dt_golf_array', -- source table
>'train_output',-- output model table
>'id',  -- id column
>'class',   -- response
>'input_array',   -- features
>NULL,  -- exclude columns
>NULL,  -- grouping columns
>20::integer,   -- number of trees
>1::integer,-- number of random features
>TRUE::boolean, -- variable importance
>1::integer,-- num_permutations
>8::integer,-- max depth
>3::integer,-- min split
>1::integer,-- min bucket
>10::integer-- number of splits per 
> continuous variable
>);
> NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 
> 'id' as the Greenplum Database data distribution key for this table.
> HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make 
> sure column(s) chosen are the optimal data distribution key to minimize skew.
> query result with 1 row discarded.
> ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
> DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in 
> sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> ** Error **
> ERROR: plpy.SPIError: invalid array length (plpython.c:4648)
> SQL state: XX000
> Detail: array_of_bigint: Size should be in [1, 1e7], 0 given
> Context: Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in 
> sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-965) RF and DT should accept array input for feature vector

2017-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005650#comment-16005650
 ] 

ASF GitHub Bot commented on MADLIB-965:
---

GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/132

DT/RF: Allow array input for features

JIRA: MADLIB-965

Currently array columns are not allowed features in decision tree and
random forest train functions. This commit adds support for a mixed list
of features: arrays and individual columns of multiple types can be
combined into a single list. Each array is expanded to treat each element
of the array as a feature.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
feature/dt_array_feature_support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/132.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #132


commit 2f1ddee5ab957684988dac575627760a1dfd67bb
Author: Rahul Iyer 
Date:   2017-05-09T21:50:52Z

DT/RF: Allow array input for features

JIRA: MADLIB-965

Currently array columns are not allowed features in decision tree and
random forest train functions. This commit adds support for a mixed list
of features: arrays and individual columns of multiple types can be
combined into a single list. Each array is expanded to treat each element
of the array as a feature.




> RF and DT should accept array input for feature vector
> --
>
> Key: MADLIB-965
> URL: https://issues.apache.org/jira/browse/MADLIB-965
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Decision Tree, Module: Random Forest
>Reporter: Rashmi Raghu
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
> Attachments: DT and RF work1.ipynb
>
>
> We were trying to test whether the RF module could handle a column containing 
> array of features as input (instead of each feature in a separate column). 
> The result was an error message but that message is unclear as to source of 
> error (i.e. is it because of the array feature input column or something 
> else). Example table, query and error can be found below:
> {quote}
> -- Executing query:
> DROP TABLE IF EXISTS dt_golf;
> CREATE TABLE dt_golf (
> id integer NOT NULL,
> "OUTLOOK" text,
> temperature double precision,
> humidity double precision,
> windy text,
> class text
> ) ;
> -- Executing query:
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, 'false', 'Don''t Play'),
> (2, 'sunny', 80, 90, 'true', 'Don''t Play'),
> (3, 'overcast', 83, 78, 'false', 'Play'),
> (4, 'rain', 70, 96, 'false', 'Play'),
> (5, 'rain', 68, 80, 'false', 'Play'),
> (6, 'rain', 65, 70, 'true', 'Don''t Play'),
> (7, 'overcast', 64, 65, 'true', 'Play'),
> (8, 'sunny', 72, 95, 'false', 'Don''t Play'),
> (9, 'sunny', 69, 70, 'false', 'Play'),
> (10, 'rain', 75, 80, 'false', 'Play'),
> (11, 'sunny', 75, 70, 'true', 'Play'),
> (12, 'overcast', 72, 90, 'true', 'Play'),
> (13, 'overcast', 81, 75, 'false', 'Play'),
> (14, 'rain', 71, 80, 'true', 'Don''t Play');
> DROP TABLE IF EXISTS dt_golf_array;
> CREATE TABLE dt_golf_array as 
> select id, array[temperature, humidity] as input_array, class
> from dt_golf
> distributed by (id);
> DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
> SELECT madlib.forest_train('dt_golf_array', -- source table
>'train_output',-- output model table
>'id',  -- id column
>'class',   -- response
>'input_array',   -- features
>NULL,  -- exclude columns
>NULL,  -- grouping columns
>20::integer,   -- number of trees
>1::integer,-- number of random features
>TRUE::boolean, -- variable importance
>1::integer,-- num_permutations
>8::integer,-- max depth
>3::integer,-- min split
>1::integer,-- min bucket
>10::integer-- number of splits per 
> continuous variable
>);
> NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 
> 'id' as the Green

[jira] [Commented] (MADLIB-1097) Random Forest does not allow NULL values in features

2017-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005621#comment-16005621
 ] 

ASF GitHub Bot commented on MADLIB-1097:


GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/131

RF: Filter NULL dependent values in OOB

JIRA: MADLIB-1097

Added `filter_null` string obtained from decision_tree.py into the OOB
view to exclude rows that have NULL dependent values.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/rf_null_dep_values

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/131.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #131


commit 9b45ecaaadb9e0d4999dc49e72df8a97cb7692d2
Author: Rahul Iyer 
Date:   2017-05-04T00:07:55Z

RF: Filter NULL dependent values in OOB

JIRA: MADLIB-1097

Added `filter_null` string obtained from decision_tree.py into the OOB
view to exclude rows that have NULL dependent values.




> Random Forest does not allow NULL values in features
> 
>
> Key: MADLIB-1097
> URL: https://issues.apache.org/jira/browse/MADLIB-1097
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Random Forest
>Reporter: Nandish Jayaram
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
>
> Running forest_train() with features that have NULL values results in the 
> following error:
> {code}
> psql:/tmp/madlib.LkFR_5/recursive_partitioning/test/random_forest.sql_in.tmp:79:
>  ERROR:  spiexceptions.InvalidParameterValue: Function 
> "_rf_cat_imp_score(bytea8,integer[],double 
> precision[],integer[],integer,double precision,boolean,double precision[])": 
> Invalid type conversion. Null where not expected.
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in 
> sample_ratio
>   PL/Python function "forest_train", line 605, in forest_train
>   PL/Python function "forest_train", line 1052, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {code}
> The following are the input table and parameters used:
> {code:sql}
> CREATE TABLE dt_golf (
> id integer NOT NULL,
> "OUTLOOK" text,
> temperature double precision,
> humidity double precision,
> windy boolean,
> class text
> ) ;
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, false, 'Don''t Play'),
> (2, 'sunny', 80, 90, true, 'Don''t Play'),
> (3, 'overcast', 83, 78, false, 'Play'),
> (4, 'rain', NULL, 96, false, 'Play'),
> (5, 'rain', 68, 80, NULL, 'Play'),
> (6, 'rain', 65, 70, true, 'Don''t Play'),
> (7, 'overcast', 64, 65, true, 'Play'),
> (8, 'sunny', 72, 95, false, 'Don''t Play'),
> (9, 'sunny', 69, 70, false, 'Play'),
> (10, 'rain', 75, 80, false, 'Play'),
> (11, 'sunny', 75, 70, true, 'Play'),
> (12, 'overcast', 72, 90, true, 'Play'),
> (13, 'overcast', 81, 75, false, 'Play'),
> (14, 'rain', 71, 80, true, 'Don''t Play');
> SELECT forest_train(
>   'dt_golf'::TEXT, -- source table
>   'train_output'::TEXT,-- output model table
>   'id'::TEXT,  -- id column
>   'class'::TEXT,   -- response
>   'windy, temperature'::TEXT,   -- features
>   NULL::TEXT,-- exclude columns
>   NULL::TEXT,-- no grouping
>   5,-- num of trees
>   1, -- num of random features
>   TRUE::BOOLEAN,-- importance
>   1::INTEGER,   -- num_permutations
>   10::INTEGER,   -- max depth
>   1::INTEGER,-- min split
>   1::INTEGER,-- min bucket
>   8::INTEGER,-- number of bins per continuous variable
>   'max_surrogates=0',
>   FALSE
>   );
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (MADLIB-965) RF and DT should accept array input for feature vector

2017-05-10 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan resolved MADLIB-965.

Resolution: Fixed

> RF and DT should accept array input for feature vector
> --
>
> Key: MADLIB-965
> URL: https://issues.apache.org/jira/browse/MADLIB-965
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Decision Tree, Module: Random Forest
>Reporter: Rashmi Raghu
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
> Attachments: DT and RF work1.ipynb
>
>
> We were trying to test whether the RF module could handle a column containing 
> array of features as input (instead of each feature in a separate column). 
> The result was an error message but that message is unclear as to source of 
> error (i.e. is it because of the array feature input column or something 
> else). Example table, query and error can be found below:
> {quote}
> -- Executing query:
> DROP TABLE IF EXISTS dt_golf;
> CREATE TABLE dt_golf (
> id integer NOT NULL,
> "OUTLOOK" text,
> temperature double precision,
> humidity double precision,
> windy text,
> class text
> ) ;
> -- Executing query:
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, 'false', 'Don''t Play'),
> (2, 'sunny', 80, 90, 'true', 'Don''t Play'),
> (3, 'overcast', 83, 78, 'false', 'Play'),
> (4, 'rain', 70, 96, 'false', 'Play'),
> (5, 'rain', 68, 80, 'false', 'Play'),
> (6, 'rain', 65, 70, 'true', 'Don''t Play'),
> (7, 'overcast', 64, 65, 'true', 'Play'),
> (8, 'sunny', 72, 95, 'false', 'Don''t Play'),
> (9, 'sunny', 69, 70, 'false', 'Play'),
> (10, 'rain', 75, 80, 'false', 'Play'),
> (11, 'sunny', 75, 70, 'true', 'Play'),
> (12, 'overcast', 72, 90, 'true', 'Play'),
> (13, 'overcast', 81, 75, 'false', 'Play'),
> (14, 'rain', 71, 80, 'true', 'Don''t Play');
> DROP TABLE IF EXISTS dt_golf_array;
> CREATE TABLE dt_golf_array as 
> select id, array[temperature, humidity] as input_array, class
> from dt_golf
> distributed by (id);
> DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
> SELECT madlib.forest_train('dt_golf_array', -- source table
>'train_output',-- output model table
>'id',  -- id column
>'class',   -- response
>'input_array',   -- features
>NULL,  -- exclude columns
>NULL,  -- grouping columns
>20::integer,   -- number of trees
>1::integer,-- number of random features
>TRUE::boolean, -- variable importance
>1::integer,-- num_permutations
>8::integer,-- max depth
>3::integer,-- min split
>1::integer,-- min bucket
>10::integer-- number of splits per 
> continuous variable
>);
> NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 
> 'id' as the Greenplum Database data distribution key for this table.
> HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make 
> sure column(s) chosen are the optimal data distribution key to minimize skew.
> query result with 1 row discarded.
> ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
> DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in 
> sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> ** Error **
> ERROR: plpy.SPIError: invalid array length (plpython.c:4648)
> SQL state: XX000
> Detail: array_of_bigint: Size should be in [1, 1e7], 0 given
> Context: Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in 
> sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1087) Random Forest fails if features are INT or NUMERIC only and variable importance is TRUE

2017-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005602#comment-16005602
 ] 

ASF GitHub Bot commented on MADLIB-1087:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/129


> Random Forest fails if features are INT or NUMERIC only and variable 
> importance is TRUE
> ---
>
> Key: MADLIB-1087
> URL: https://issues.apache.org/jira/browse/MADLIB-1087
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Module: Random Forest
>Reporter: Paul Chang
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
>
> If we attempt to train on a dataset where all features are either INT or 
> NUMERIC, and with variable importance TRUE, forest_train() fails with the 
> following error:
> [2017-04-03 13:35:35] [XX000] ERROR: plpy.SPIError: invalid array length 
> (plpython.c:4648)
> [2017-04-03 13:35:35] Detail: array_of_bigint: Size should be in [1, 1e7], 0 
> given
> [2017-04-03 13:35:35] Where: Traceback (most recent call last):
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 42, in 
> [2017-04-03 13:35:35] sample_ratio
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 591, in 
> forest_train
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 1038, in 
> _calculate_oob_prediction
> [2017-04-03 13:35:35] PL/Python function "forest_train"
> However, if we add a single feature column that is FLOAT, REAL, or DOUBLE 
> PRECISION, the trainer does not fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)