[
https://issues.apache.org/jira/browse/MADLIB-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267857#comment-16267857
]
Frank McQuillan commented on MADLIB-1173:
-----------------------------------------
RF does not fail for predict on large feature vectors because the input check
is not done in the same way as DT. We can leave RF as is for now - it just
means that the error message will not be as clear as it could be.
> DT predict fails with large feature vector arrays
> -------------------------------------------------
>
> Key: MADLIB-1173
> URL: https://issues.apache.org/jira/browse/MADLIB-1173
> Project: Apache MADlib
> Issue Type: Bug
> Components: Module: Decision Tree
> Reporter: Rahul Iyer
> Fix For: v1.13
>
>
> Since decision trees can now take in arrays, it’s possible to train a model
> with more than 1600 (table column limit) features. For example, we can
> assemble a feature_array field with 2000 elements, and pass that into
> tree_train as the independent variable. The model trains successfully and
> produces the expected model and model_summary tables.
>
> But then when trying to use that model table to make predictions using
> tree_predict, it gets the following error:
>
> {code}
> ERROR: plpy.Error: Decision tree error: Missing columns in predict data
> table (tbl_test_1_data_final) that were used during training (plpython.c:4656)
> CONTEXT: Traceback (most recent call last):
> PL/Python function "tree_predict", line 19, in <module>
> return decision_tree.tree_predict(**globals())
> PL/Python function "tree_predict", line 1752, in tree_predict
> PL/Python function "tree_predict", line 75, in _assert
> PL/Python function "tree_predict"
> {code}
> We think what’s happening (this is only a guess) is that the tree_train
> function correctly makes use of the passed features, even if they number in
> excess of 1600, but tree_predict is still for some reason limiting the
> feature count to 1600. So when it tries to run the predictions, it sees that
> there are 2000 expected features according to the model table, but
> tree_predict has limited the new_data_table to only 1600 features. It then
> sees a discrepancy between the 2000 expected by the model and the 1600 that
> it has perceived on the new_data_table, and gets the above error.
>
> We’ve observed the error only happens when over 1600 features are used. If,
> for example, we trained a model with 900 features, it would able to predict
> successfully. If we train a model with 2000 features, it gets the above
> error.
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)