[jira] [Reopened] (MADLIB-1254) RF/DT: Grouping might give incorrect results if 1 group eliminates a categorical variable

Frank McQuillan (JIRA) Tue, 24 Jul 2018 13:08:18 -0700


     [ 
https://issues.apache.org/jira/browse/MADLIB-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Frank McQuillan reopened MADLIB-1254:
-------------------------------------

Still seeing this issue:

{code}
madlib=# SELECT * from train_output_group;
-[ RECORD 1 
]-----------+--------------------------------------------------------------------------------------------------------
gid                     | 1
class                   | Don't Play
success                 | t
cat_n_levels            | {2,2}
cat_levels_in_text      | {c,a,True,False}
oob_error               | 48.6388888888889
oob_var_importance      | {0,0,0,10.92,0}
impurity_var_importance | 
{26.6666661315212,0,0,33.3333327455909,33.3333327455909,6.66666661406969}
-[ RECORD 2 
]-----------+--------------------------------------------------------------------------------------------------------
gid                     | 2
class                   | Play
success                 | t
cat_n_levels            | {2,2,2}
cat_levels_in_text      | {c,a,b,d,False,True}
oob_error               | 31.6581790123457
oob_var_importance      | {0,1.128,1.128,1.128,37.78425,7.048}
impurity_var_importance | 
{5.74406918728342,8.34440322933007,8.34440322933007,5.69460300469225,46.0230475868173,25.8494712234864}
{code}

{code}
madlib=# SELECT * from train_output_summary;
-[ RECORD 1 
]---------+---------------------------------------------------------------------------------------------
method                | forest_train
is_classification     | f
source_table          | dt_golf
model_table           | train_output
id_col_name           | id
dependent_varname     | temperature::double precision
independent_varnames  | 
(cat_features)[1],(cat_features)[2],windy,humidity,("Cont_features")[1],("Cont_features")[2]
cat_features          | (cat_features)[1],(cat_features)[2],windy
con_features          | humidity,("Cont_features")[1],("Cont_features")[2]
grouping_cols         | class
num_trees             | 5
num_random_features   | 2
max_tree_depth        | 10
min_split             | 1
min_bucket            | 1
num_splits            | 3
verbose               | f
importance            | t
num_permutations      | 20
num_all_groups        | 2
num_failed_groups     | 0
total_rows_processed  | 14
total_rows_skipped    | 2
dependent_var_levels  | 
dependent_var_type    | double precision
independent_var_types | text, text, boolean, double precision, double 
precision, double precision
null_proxy            | None
{code}


> RF/DT: Grouping might give incorrect results if 1 group eliminates a 
> categorical variable
> -----------------------------------------------------------------------------------------
>
>                 Key: MADLIB-1254
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1254
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Decision Tree
>            Reporter: Rahul Iyer
>            Priority: Major
>             Fix For: v1.15
>
>
> If {{forest_train}} is run with grouping enabled and if one of the groups has 
> a categorical feature with just single level, then the categorical feature is 
> eliminated for that group. If other groups retain that feature, then the 
> output of impurity_var_importance is incorrect for the group in question. 
> There could be other ramifications related to this as well. 
> {code:java}
> DROP TABLE IF EXISTS dt_golf CASCADE;
> CREATE TABLE dt_golf (
>     id integer NOT NULL,
>     "OUTLOOK" text,
>     temperature double precision,
>     humidity double precision,
>     "Cont_features" double precision[],
>     cat_features text[],
>     windy boolean,
>     class text
> ) ;
> INSERT INTO dt_golf 
> (id,"OUTLOOK",temperature,humidity,"Cont_features",cat_features, windy,class) 
> VALUES
> (1, 'sunny', 85, 85,ARRAY[85, 85], ARRAY['a', 'b'], false, 'Don''t Play'),
> (2, 'sunny', 80, 90, ARRAY[80, 90], ARRAY['a', 'b'], true, 'Don''t Play'),
> (3, 'overcast', 83, 78, ARRAY[83, 78], ARRAY['a', 'b'], false, 'Play'),
> (4, 'rain', 70, NULL, ARRAY[70, 96], ARRAY['a', 'b'], false, 'Play'),
> (5, 'rain', 68, 80, ARRAY[68, 80], ARRAY['a', 'b'], false, 'Play'),
> (6, 'rain', NULL, 70, ARRAY[65, 70], ARRAY['a', 'b'], true, 'Don''t Play'),
> (7, 'overcast', 64, 65, ARRAY[64, 65], ARRAY['c', 'b'], NULL , 'Play'),
> (8, 'sunny', 72, 95, ARRAY[72, 95], ARRAY['a', 'b'], false, 'Don''t Play'),
> (9, 'sunny', 69, 70, ARRAY[69, 70], ARRAY['a', 'b'], false, 'Play'),
> (10, 'rain', 75, 80, ARRAY[75, 80], ARRAY['a', 'b'], false, 'Play'),
> (11, 'sunny', 75, 70, ARRAY[75, 70], ARRAY['a', 'd'], true, 'Play'),
> (12, 'overcast', 72, 90, ARRAY[72, 90], ARRAY['c', 'b'], NULL, 'Play'),
> (13, 'overcast', 81, 75, ARRAY[81, 75], ARRAY['a', 'b'], false, 'Play'),
> (15, NULL, 81, 75, ARRAY[81, 75], ARRAY['a', 'b'], false, 'Play'),
> (16, 'overcast', NULL, 75, ARRAY[81, 75], ARRAY['a', 'd'], false, 'Play'),
> (14, 'rain', 71, 80, ARRAY[71, 80], ARRAY['c', 'b'], true, 'Don''t Play');
> DROP TABLE IF EXISTS train_output, train_output_summary, train_output_group, 
> train_output_poisson_count;
> SELECT forest_train(
>                   'dt_golf',         -- source table
>                   'train_output',    -- output model table
>                   'id',              -- id column
>                   'temperature::double precision',           -- response
>                   'humidity, cat_features, windy, "Cont_features"',   -- 
> features
>                   NULL,        -- exclude columns
>                   'class',          -- grouping
>                   5,                -- num of trees
>                   NULL,                 -- num of random features
>                   TRUE,     -- importance
>                   20,         -- num_permutations
>                   10,       -- max depth
>                   1,        -- min split
>                   1,        -- min bucket
>                   3,        -- number of bins per continuous variable
>                   'max_surrogates = 2 ',
>                   FALSE
>                   );
> \x on
> SELECT * from train_output_summary;
> SELECT * from train_output_group;
> {code}
> Results:
> {code:java}
> SELECT * from train_output_group;
> -[ RECORD 1 
> ]-----------+-----------------------------------------------------------------------------
> gid                     | 1
> class                   | Don't Play
> success                 | t
> cat_n_levels            | {2,2,2}
> cat_levels_in_text      | {c,a,True,False,c,a}
> oob_error               | 92.5335905349795
> oob_var_importance      | {10.725,10.725,10.725,7.605,10.725,0}
> impurity_var_importance | 
> {8.33148348160485,0,0,19.9999998625892,19.9999998625892,11.6685163809844}
> -[ RECORD 2 
> ]-----------+-----------------------------------------------------------------------------
> gid                     | 2
> class                   | Play
> success                 | t
> cat_n_levels            | {2,2}
> cat_levels_in_text      | {b,d,False,True}
> oob_error               | 43.0244073645405
> oob_var_importance      | 
> {1.06581410364015e-15,1.06581410364015e-15,2.1326171875,16.019375,10.570875}
> impurity_var_importance | 
> {0,0,0,37.8304000437732,38.4881698525677,23.6814277291654}
> {code}
> Note that the {{impurity_var_importance}} for {{gid=2}} has length 6 while 
> the {{oob_var_importance}} correctly has 5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Reopened] (MADLIB-1254) RF/DT: Grouping might give incorrect results if 1 group eliminates a categorical variable

Reply via email to