[ 
https://issues.apache.org/jira/browse/MADLIB-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267857#comment-16267857
 ] 

Frank McQuillan commented on MADLIB-1173:
-----------------------------------------

RF does not fail for predict on large feature vectors because the input check 
is not done in the same way as DT.  We can leave RF as is for now - it just 
means that the error message will not be as clear as it could be.

> DT predict fails with large feature vector arrays
> -------------------------------------------------
>
>                 Key: MADLIB-1173
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1173
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Decision Tree
>            Reporter: Rahul Iyer
>             Fix For: v1.13
>
>
> Since decision trees can now take in arrays, it’s possible to train a model 
> with more than 1600 (table column limit) features.  For example, we can 
> assemble a feature_array field with 2000 elements, and pass that into 
> tree_train as the independent variable.  The model trains successfully and 
> produces the expected model and model_summary tables.
>  
> But then when trying to use that model table to make predictions using 
> tree_predict, it gets the following error:
>  
> {code}
> ERROR:  plpy.Error: Decision tree error: Missing columns in predict data 
> table (tbl_test_1_data_final) that were used during training (plpython.c:4656)
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "tree_predict", line 19, in <module>
>    return decision_tree.tree_predict(**globals())
>   PL/Python function "tree_predict", line 1752, in tree_predict
>   PL/Python function "tree_predict", line 75, in _assert
> PL/Python function "tree_predict"
>  {code}
> We think what’s happening (this is only a guess) is that the tree_train 
> function correctly makes use of the passed features, even if they number in 
> excess of 1600, but tree_predict is still for some reason limiting the 
> feature count to 1600.  So when it tries to run the predictions, it sees that 
> there are 2000 expected features according to the model table, but 
> tree_predict has limited the new_data_table to only 1600 features.  It then 
> sees a discrepancy between the 2000 expected by the model and the 1600 that 
> it has perceived on the new_data_table, and gets the above error.
>  
> We’ve observed the error only happens when over 1600 features are used.  If, 
> for example, we trained a model with 900 features, it would able to predict 
> successfully.  If we train a model with 2000 features, it gets the above 
> error.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to