[ 
https://issues.apache.org/jira/browse/MADLIB-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16252726#comment-16252726
 ] 

ASF GitHub Bot commented on MADLIB-1173:
----------------------------------------

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/madlib/pull/201

    Allow array feature with more than 1664 entries

    JIRA: MADLIB-1173
    
    The tree_predict function concatenates cat_feature_str and
    con_feature_str in summary table to obtain the feature string. This
    contains individual elements of any array feature. The concatenated
    string is used in a SELECT operation for validation, which limits the
    number of target entries to 1664. To allow arrays with more than 1664
    entries, the validation has been udpated to use the original feature
    string, which contains the array feature name instead of its indexed
    elements.
    
    Closes #201

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/dt_array_features

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/201.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #201
    
----
commit fc374cc854d957923a713901900aacd306527a77
Author: Rahul Iyer <[email protected]>
Date:   2017-11-13T18:49:29Z

    DT: Consolidate tree_rmse and tree_misclassified

commit feb9781aa1c6fee52eb94e6863998f8aadd77217
Author: Rahul Iyer <[email protected]>
Date:   2017-11-13T21:42:10Z

    DT: Validate original feature string in tree_predict
    
    JIRA: MADLIB-1173
    
    The tree_predict function concatenates cat_feature_str and
    con_feature_str in summary table to obtain the feature string. This
    contains individual elements of any array feature. The concatenated
    string is used in a SELECT operation for validation, which limits the
    number of target entries to 1664. To allow arrays with more than 1664
    entries, the validation has been udpated to use the original feature
    string, which contains the array feature name instead of its indexed
    elements.
    
    Closes #201

----


> Using feature vectors with DT
> -----------------------------
>
>                 Key: MADLIB-1173
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1173
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Decision Tree
>            Reporter: Rahul Iyer
>             Fix For: v1.13
>
>
> Since decision trees can now take in arrays, it’s possible to train a model 
> with more than 1600 (table column limit) features.  For example, we can 
> assemble a feature_array field with 2000 elements, and pass that into 
> tree_train as the independent variable.  The model trains successfully and 
> produces the expected model and model_summary tables.
>  
> But then when trying to use that model table to make predictions using 
> tree_predict, it gets the following error:
>  
> {code}
> ERROR:  plpy.Error: Decision tree error: Missing columns in predict data 
> table (tbl_test_1_data_final) that were used during training (plpython.c:4656)
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "tree_predict", line 19, in <module>
>    return decision_tree.tree_predict(**globals())
>   PL/Python function "tree_predict", line 1752, in tree_predict
>   PL/Python function "tree_predict", line 75, in _assert
> PL/Python function "tree_predict"
>  {code}
> We think what’s happening (this is only a guess) is that the tree_train 
> function correctly makes use of the passed features, even if they number in 
> excess of 1600, but tree_predict is still for some reason limiting the 
> feature count to 1600.  So when it tries to run the predictions, it sees that 
> there are 2000 expected features according to the model table, but 
> tree_predict has limited the new_data_table to only 1600 features.  It then 
> sees a discrepancy between the 2000 expected by the model and the 1600 that 
> it has perceived on the new_data_table, and gets the above error.
>  
> We’ve observed the error only happens when over 1600 features are used.  If, 
> for example, we trained a model with 900 features, it would able to predict 
> successfully.  If we train a model with 2000 features, it gets the above 
> error.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to