[
https://issues.apache.org/jira/browse/MADLIB-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267898#comment-16267898
]
Frank McQuillan commented on MADLIB-1173:
-----------------------------------------
This looks like it is working now. Here’s a dummy case to test:
{code}
DROP TABLE IF EXISTS dt_golf;
CREATE TABLE dt_golf (
id integer NOT NULL,
outlook text,
temperature double precision,
humidity double precision,
windy text,
class text,
array_vals double precision[]
) ;
INSERT INTO dt_golf (id,outlook,temperature,humidity,windy,class, array_vals)
VALUES
(1, 'sunny', 85, 85, 'false', 'Don''t Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(2, 'sunny', 80, 90, 'true', 'Don''t Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(3, 'overcast', 83, 78, 'false', 'Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(4, 'rain', 70, 96, 'false', 'Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(5, 'rain', 68, 80, 'false', 'Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(6, 'rain', 65, 70, 'true', 'Don''t Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(7, 'overcast', 64, 65, 'true', 'Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(8, 'sunny', 72, 95, 'false', 'Don''t Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(9, 'sunny', 69, 70, 'false', 'Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(10, 'rain', 75, 80, 'false', 'Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(11, 'sunny', 75, 70, 'true', 'Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(12, 'overcast', 72, 90, 'true', 'Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(13, 'overcast', 81, 75, 'false', 'Play', ARRAY(SELECT * FROM
generate_series(1,3000))),
(14, 'rain', 71, 80, 'true', 'Don''t Play', ARRAY(SELECT * FROM
generate_series(1,3000)));
{code}
Train:
{code}
DROP TABLE IF EXISTS train_output, train_output_summary;
SELECT madlib.tree_train('dt_golf', -- source table
'train_output', -- output model table
'id', -- id column
'class', -- response
'outlook, temperature, windy, array_vals', --
features
NULL::text, -- exclude columns
'gini', -- split criterion
NULL::text, -- no grouping
NULL::text, -- no weights
3, -- max depth
3, -- min split
1, -- min bucket
3 -- number of bins per continuous
variable
);
{code}
Predict:
{code}
DROP TABLE IF EXISTS prediction_results;
SELECT madlib.tree_predict('train_output', -- tree model
'dt_golf', -- new data table
'prediction_results', -- output table
'response'); -- show prediction
SELECT g.id, class, estimated_class FROM prediction_results p, dt_golf g where
p.id = g.id ORDER BY g.id;
{code}
produces:
{code}
id | class | estimated_class
----+------------+-----------------
1 | Don't Play | Play
2 | Don't Play | Play
3 | Play | Play
4 | Play | Play
5 | Play | Play
6 | Don't Play | Play
7 | Play | Play
8 | Don't Play | Play
9 | Play | Play
10 | Play | Play
11 | Play | Play
12 | Play | Play
13 | Play | Play
14 | Don't Play | Play
(14 rows)
{code}
Before this fix, the error would have been:
{code}
InternalError: (psycopg2.InternalError) plpy.Error: Decision tree error:
Missing columns in predict data table (dt_golf) that were used during training
CONTEXT: Traceback (most recent call last):
PL/Python function "tree_predict", line 19, in <module>
return decision_tree.tree_predict(**globals())
PL/Python function "tree_predict", line 1752, in tree_predict
PL/Python function "tree_predict", line 75, in _assert
PL/Python function "tree_predict"
[SQL: "SELECT madlib.tree_predict('train_output', -- tree model\n
'dt_golf', -- new data table\n
'prediction_results', -- output table\n
'response'); -- show prediction"]
{code}
> DT predict fails with large feature vector arrays
> -------------------------------------------------
>
> Key: MADLIB-1173
> URL: https://issues.apache.org/jira/browse/MADLIB-1173
> Project: Apache MADlib
> Issue Type: Bug
> Components: Module: Decision Tree
> Reporter: Rahul Iyer
> Fix For: v1.13
>
>
> Since decision trees can now take in arrays, it’s possible to train a model
> with more than 1600 (table column limit) features. For example, we can
> assemble a feature_array field with 2000 elements, and pass that into
> tree_train as the independent variable. The model trains successfully and
> produces the expected model and model_summary tables.
>
> But then when trying to use that model table to make predictions using
> tree_predict, it gets the following error:
>
> {code}
> ERROR: plpy.Error: Decision tree error: Missing columns in predict data
> table (tbl_test_1_data_final) that were used during training (plpython.c:4656)
> CONTEXT: Traceback (most recent call last):
> PL/Python function "tree_predict", line 19, in <module>
> return decision_tree.tree_predict(**globals())
> PL/Python function "tree_predict", line 1752, in tree_predict
> PL/Python function "tree_predict", line 75, in _assert
> PL/Python function "tree_predict"
> {code}
> We think what’s happening (this is only a guess) is that the tree_train
> function correctly makes use of the passed features, even if they number in
> excess of 1600, but tree_predict is still for some reason limiting the
> feature count to 1600. So when it tries to run the predictions, it sees that
> there are 2000 expected features according to the model table, but
> tree_predict has limited the new_data_table to only 1600 features. It then
> sees a discrepancy between the 2000 expected by the model and the 1600 that
> it has perceived on the new_data_table, and gets the above error.
>
> We’ve observed the error only happens when over 1600 features are used. If,
> for example, we trained a model with 900 features, it would able to predict
> successfully. If we train a model with 2000 features, it gets the above
> error.
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)