Rahul Iyer created MADLIB-1173:
----------------------------------
Summary: Using feature vectors with DT
Key: MADLIB-1173
URL: https://issues.apache.org/jira/browse/MADLIB-1173
Project: Apache MADlib
Issue Type: Bug
Components: Module: Decision Tree
Reporter: Rahul Iyer
Fix For: v2.0
Since decision trees can now take in arrays, it’s possible to train a model
with more than 1600 (table column limit) features. For example, we can
assemble a feature_array field with 2000 elements, and pass that into
tree_train as the independent variable. The model trains successfully and
produces the expected model and model_summary tables.
But then when trying to use that model table to make predictions using
tree_predict, it gets the following error:
{code}
ERROR: plpy.Error: Decision tree error: Missing columns in predict data table
(tbl_test_1_data_final) that were used during training (plpython.c:4656)
CONTEXT: Traceback (most recent call last):
PL/Python function "tree_predict", line 19, in <module>
return decision_tree.tree_predict(**globals())
PL/Python function "tree_predict", line 1752, in tree_predict
PL/Python function "tree_predict", line 75, in _assert
PL/Python function "tree_predict"
{code}
We think what’s happening (this is only a guess) is that the tree_train
function correctly makes use of the passed features, even if they number in
excess of 1600, but tree_predict is still for some reason limiting the feature
count to 1600. So when it tries to run the predictions, it sees that there are
2000 expected features according to the model table, but tree_predict has
limited the new_data_table to only 1600 features. It then sees a discrepancy
between the 2000 expected by the model and the 1600 that it has perceived on
the new_data_table, and gets the above error.
We’ve observed the error only happens when over 1600 features are used. If,
for example, we trained a model with 900 features, it would able to predict
successfully. If we train a model with 2000 features, it gets the above error.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)