Rahul Iyer created MADLIB-1173:
----------------------------------

             Summary: Using feature vectors with DT
                 Key: MADLIB-1173
                 URL: https://issues.apache.org/jira/browse/MADLIB-1173
             Project: Apache MADlib
          Issue Type: Bug
          Components: Module: Decision Tree
            Reporter: Rahul Iyer
             Fix For: v2.0


Since decision trees can now take in arrays, it’s possible to train a model 
with more than 1600 (table column limit) features.  For example, we can 
assemble a feature_array field with 2000 elements, and pass that into 
tree_train as the independent variable.  The model trains successfully and 
produces the expected model and model_summary tables.
 
But then when trying to use that model table to make predictions using 
tree_predict, it gets the following error:
 
{code}
ERROR:  plpy.Error: Decision tree error: Missing columns in predict data table 
(tbl_test_1_data_final) that were used during training (plpython.c:4656)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "tree_predict", line 19, in <module>
   return decision_tree.tree_predict(**globals())
  PL/Python function "tree_predict", line 1752, in tree_predict
  PL/Python function "tree_predict", line 75, in _assert
PL/Python function "tree_predict"
 {code}

We think what’s happening (this is only a guess) is that the tree_train 
function correctly makes use of the passed features, even if they number in 
excess of 1600, but tree_predict is still for some reason limiting the feature 
count to 1600.  So when it tries to run the predictions, it sees that there are 
2000 expected features according to the model table, but tree_predict has 
limited the new_data_table to only 1600 features.  It then sees a discrepancy 
between the 2000 expected by the model and the 1600 that it has perceived on 
the new_data_table, and gets the above error.
 
We’ve observed the error only happens when over 1600 features are used.  If, 
for example, we trained a model with 900 features, it would able to predict 
successfully.  If we train a model with 2000 features, it gets the above error.
 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to