[ https://issues.apache.org/jira/browse/MADLIB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532098#comment-16532098 ]
Jingyi Mei commented on MADLIB-1249: ------------------------------------ Current working branch: [https://github.com/madlib/madlib/tree/rf_gini_importance] By testing the current code, we got the following error with dev-check: pg96 and GPDB4: {code} DROP TABLE IF EXISTS train_output, train_output_summary, train_output_group; psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:39: NOTICE: table "train_output" does not exist, skipping psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:39: NOTICE: table "train_output_summary" does not exist, skipping psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:39: NOTICE: table "train_output_group" does not exist, skipping DROP TABLE SELECT forest_train( 'dt_golf', -- source table 'train_output', -- output model table 'id' , -- id column 'class', -- response 'windy, "Cont_features"[1]', -- features NULL, -- exclude columns NULL, -- no grouping 5, -- num of trees NULL, -- num of random features TRUE, -- importance 1, -- num_permutations 10, -- max depth 1, -- min split 1, -- min bucket 8, -- number of bins per continuous variable 'max_surrogates=0', FALSE ); psql:/tmp/madlib.mYqFM5/recursive_partitioning/test/random_forest.sql_in.tmp:58: ERROR: spiexceptions.InvalidTextRepresentation: invalid input syntax for type double precision: "[24.55643307062732, 75.44334141110978]" CONTEXT: Traceback (most recent call last): PL/Python function "forest_train", line 39, in <module> sample_ratio PL/Python function "forest_train", line 613, in forest_train PL/Python function "forest_train", line 1418, in _insert_into_result_table PL/Python function "forest_train" {code} GPDB5: {code} DROP TABLE IF EXISTS train_output, train_output_summary, train_output_group; DROP TABLE SELECT forest_train( 'dt_golf', -- source table 'train_output', -- output model table 'id', -- id column 'class', -- response 'humidity, temperature', -- features NULL, -- exclude columns NULL, -- no grouping 5, -- num of trees 1, -- num of random features FALSE, -- importance 1, -- num_permutations 10, -- max depth 1, -- min split 1, -- min bucket 8, -- number of bins per continuous variable 'max_surrogates=0', FALSE ); -[ RECORD 1 ]+- forest_train | DROP TABLE IF EXISTS predict_output; DROP TABLE SELECT forest_predict( 'train_output', 'dt_golf', 'predict_output', 'prob' ); psql:/tmp/madlib._Ts_aE/recursive_partitioning/test/random_forest.sql_in.tmp:221: ERROR: plpy.Error: Random forest error: Input table 'train_output' is empty! (plpython.c:5038) CONTEXT: Traceback (most recent call last): PL/Python function "forest_predict", line 19, in <module> return random_forest.forest_predict(**globals()) PL/Python function "forest_predict", line 696, in forest_predict PL/Python function "forest_predict", line 1469, in _validate_predict PL/Python function "forest_predict", line 635, in input_tbl_valid PL/Python function "forest_predict" {code} > Add gini importance to RF > ------------------------- > > Key: MADLIB-1249 > URL: https://issues.apache.org/jira/browse/MADLIB-1249 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Random Forest > Reporter: Jingyi Mei > Assignee: Jingyi Mei > Priority: Major > Fix For: v1.15 > > > As a follow up of https://issues.apache.org/jira/browse/MADLIB-1205 > As a data scientist > I want a measure of variable importance in RF > so that > I can understand which predictors are most useful for predicting the response > variable. > We can add a similar measure in our RF code and distinguish this from our > permuted importance metric by calling the current metric as > {{oob_variable_importance}} and this new metric as > {{impurity_variable_importance}}. > Interface > Details of interface TBD, but involves: > * RF can use the {{importance}} param to generate both oob and gini > importance > * Write out to the {{<output_table_name>_group table}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)