[
https://issues.apache.org/jira/browse/MADLIB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532098#comment-16532098
]
Jingyi Mei commented on MADLIB-1249:
------------------------------------
Current working branch:
[https://github.com/madlib/madlib/tree/rf_gini_importance]
By testing the current code, we got the following error with dev-check:
pg96 and GPDB4:
{code}
DROP TABLE IF EXISTS train_output, train_output_summary, train_output_group;
psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:39:
NOTICE: table "train_output" does not exist, skipping
psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:39:
NOTICE: table "train_output_summary" does not exist, skipping
psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:39:
NOTICE: table "train_output_group" does not exist, skipping
DROP TABLE
SELECT forest_train(
'dt_golf', -- source table
'train_output', -- output model table
'id' , -- id column
'class', -- response
'windy, "Cont_features"[1]', -- features
NULL, -- exclude columns
NULL, -- no grouping
5, -- num of trees
NULL, -- num of random features
TRUE, -- importance
1, -- num_permutations
10, -- max depth
1, -- min split
1, -- min bucket
8, -- number of bins per continuous variable
'max_surrogates=0',
FALSE
);
psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:58:
ERROR: spiexceptions.InvalidTextRepresentation: invalid input syntax for type
double precision: "[24.55643307062732, 75.44334141110978]"
CONTEXT: Traceback (most recent call last):
PL/Python function "forest_train", line 39, in <module>
sample_ratio
PL/Python function "forest_train", line 613, in forest_train
PL/Python function "forest_train", line 1418, in _insert_into_result_table
PL/Python function "forest_train"
{code}
GPDB5:
{code}
DROP TABLE IF EXISTS train_output, train_output_summary, train_output_group;
DROP TABLE
SELECT forest_train(
'dt_golf', -- source table
'train_output', -- output model table
'id', -- id column
'class', -- response
'humidity, temperature', -- features
NULL, -- exclude columns
NULL, -- no grouping
5, -- num of trees
1, -- num of random features
FALSE, -- importance
1, -- num_permutations
10, -- max depth
1, -- min split
1, -- min bucket
8, -- number of bins per continuous variable
'max_surrogates=0',
FALSE
);
-[ RECORD 1 ]+-
forest_train |
DROP TABLE IF EXISTS predict_output;
DROP TABLE
SELECT forest_predict(
'train_output',
'dt_golf',
'predict_output',
'prob'
);
psql:/tmp/madlib._Ts_aE/recursive_partitioning/test/random_forest.sql_in.tmp:221:
ERROR: plpy.Error: Random forest error: Input table 'train_output' is empty!
(plpython.c:5038)
CONTEXT: Traceback (most recent call last):
PL/Python function "forest_predict", line 19, in <module>
return random_forest.forest_predict(**globals())
PL/Python function "forest_predict", line 696, in forest_predict
PL/Python function "forest_predict", line 1469, in _validate_predict
PL/Python function "forest_predict", line 635, in input_tbl_valid
PL/Python function "forest_predict"
{code}
> Add gini importance to RF
> -------------------------
>
> Key: MADLIB-1249
> URL: https://issues.apache.org/jira/browse/MADLIB-1249
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Random Forest
> Reporter: Jingyi Mei
> Assignee: Jingyi Mei
> Priority: Major
> Fix For: v1.15
>
>
> As a follow up of https://issues.apache.org/jira/browse/MADLIB-1205
> As a data scientist
> I want a measure of variable importance in RF
> so that
> I can understand which predictors are most useful for predicting the response
> variable.
> We can add a similar measure in our RF code and distinguish this from our
> permuted importance metric by calling the current metric as
> {{oob_variable_importance}} and this new metric as
> {{impurity_variable_importance}}.
> Interface
> Details of interface TBD, but involves:
> * RF can use the {{importance}} param to generate both oob and gini
> importance
> * Write out to the {{<output_table_name>_group table}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)