[jira] [Commented] (MADLIB-1249) Add gini importance to RF

Jingyi Mei (JIRA) Tue, 03 Jul 2018 17:47:36 -0700


    [ 
https://issues.apache.org/jira/browse/MADLIB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532098#comment-16532098
 ]


Jingyi Mei commented on MADLIB-1249:
------------------------------------

Current working branch: 
[https://github.com/madlib/madlib/tree/rf_gini_importance]

By testing the current code, we got the following error with dev-check:

pg96 and GPDB4:
 {code}

DROP TABLE IF EXISTS train_output, train_output_summary, train_output_group;
psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:39:
 NOTICE: table "train_output" does not exist, skipping
psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:39:
 NOTICE: table "train_output_summary" does not exist, skipping
psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:39:
 NOTICE: table "train_output_group" does not exist, skipping
DROP TABLE
SELECT forest_train(
 'dt_golf', -- source table
 'train_output', -- output model table
 'id' , -- id column
 'class', -- response
 'windy, "Cont_features"[1]', -- features
 NULL, -- exclude columns
 NULL, -- no grouping
 5, -- num of trees
 NULL, -- num of random features
 TRUE, -- importance
 1, -- num_permutations
 10, -- max depth
 1, -- min split
 1, -- min bucket
 8, -- number of bins per continuous variable
 'max_surrogates=0',
 FALSE
 );
psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:58:
 ERROR: spiexceptions.InvalidTextRepresentation: invalid input syntax for type 
double precision: "[24.55643307062732, 75.44334141110978]"
CONTEXT: Traceback (most recent call last):
 PL/Python function "forest_train", line 39, in <module>
 sample_ratio
 PL/Python function "forest_train", line 613, in forest_train
 PL/Python function "forest_train", line 1418, in _insert_into_result_table
PL/Python function "forest_train"
{code}

GPDB5:
{code}
DROP TABLE IF EXISTS train_output, train_output_summary, train_output_group;
DROP TABLE
SELECT forest_train(
 'dt_golf', -- source table
 'train_output', -- output model table
 'id', -- id column
 'class', -- response
 'humidity, temperature', -- features
 NULL, -- exclude columns
 NULL, -- no grouping
 5, -- num of trees
 1, -- num of random features
 FALSE, -- importance
 1, -- num_permutations
 10, -- max depth
 1, -- min split
 1, -- min bucket
 8, -- number of bins per continuous variable
 'max_surrogates=0',
 FALSE
 );
-[ RECORD 1 ]+-
forest_train |

DROP TABLE IF EXISTS predict_output;
DROP TABLE
SELECT forest_predict(
 'train_output',
 'dt_golf',
 'predict_output',
 'prob'
);
psql:/tmp/madlib._Ts_aE/recursive_partitioning/test/random_forest.sql_in.tmp:221:
 ERROR: plpy.Error: Random forest error: Input table 'train_output' is empty! 
(plpython.c:5038)
CONTEXT: Traceback (most recent call last):
 PL/Python function "forest_predict", line 19, in <module>
 return random_forest.forest_predict(**globals())
 PL/Python function "forest_predict", line 696, in forest_predict
 PL/Python function "forest_predict", line 1469, in _validate_predict
 PL/Python function "forest_predict", line 635, in input_tbl_valid
PL/Python function "forest_predict"
{code}

> Add gini importance to RF
> -------------------------
>
>                 Key: MADLIB-1249
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1249
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Random Forest
>            Reporter: Jingyi Mei
>            Assignee: Jingyi Mei
>            Priority: Major
>             Fix For: v1.15
>
>
> As a follow up of https://issues.apache.org/jira/browse/MADLIB-1205
> As a data scientist
> I want a measure of variable importance in RF
> so that
> I can understand which predictors are most useful for predicting the response 
> variable.
> We can add a similar measure in our RF code and distinguish this from our 
> permuted importance metric by calling the current metric as 
> {{oob_variable_importance}} and this new metric as 
> {{impurity_variable_importance}}.
> Interface
> Details of interface TBD, but involves:
>  * RF can use the {{importance}} param to generate both oob and gini 
> importance
>  * Write out to the {{<output_table_name>_group table}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MADLIB-1249) Add gini importance to RF

Reply via email to