[ 
https://issues.apache.org/jira/browse/MADLIB-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Orhan Kislal updated MADLIB-1267:
---------------------------------
    Fix Version/s: v3.0
                       (was: v2.1)

> General predict function for PL/Python
> --------------------------------------
>
>                 Key: MADLIB-1267
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1267
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v3.0
>
>
> Context
> Follow on from https://www.pivotaltracker.com/story/show/158990284
> Story
> `As a data scientist`
> I want to call a generic PL/Python UDF from SQL to predict
> `so that`
> I can use the use any code I write or Python libraries for prediction.
> Interface
> {code}
> predict(
>               model_table,                            -- model output table
>               data_table,                                     -- data table 
> to predict
>               list_of_columns,                        -- columns you want in 
> GD, could be '*'   needed???
>               list_of_columns_to_exclude, -- columns to explicitly exclude    
>       needed???
>               predict_udf,                            -- plpython UDF to 
> predict
>               predict_udf_parameters,     -- parameters for UDF, if any
>               grouping_cols                           -- groups to build 
> separate models for (source table distributed by this grouping)  needed???
>       );
> {code}
> Arguments
> {code}
> source_table
> TEXT. Name of the table containing the data to load.
> model_table
> TEXT. Name of the table containing the model(s), with one row per group.
> list_of_columns
> TEXT. Comma-separated string of column names or expressions to load. 
> Can also be '*' implying all columns are to be loaded (except for the ones 
> included
>  in the next argument that lists exclusions). The types of the columns can be 
> mixed.  
> Array columns can also be included in the list and will be loaded as is 
> (i.e., not be flattened). (???)
> list_of_columns_to_exclude
> TEXT. Comma-separated string of column names to exclude from load. 
> Typically used when 'list_of_columns' is set to '*'.
> predict_udf
> TEXT.  plpython UDF to predict.
> predict_udf_parameters (optional)
> TEXT.  parameters for UDF, if any
> grouping_cols (optional)
> TEXT, default: NULL. Comma-separated list of column names to group the data 
> by. 
> This will produce multiple models, one for each group.
> {code}
> Open questions
> 1) Do we need separate predict functions for R and Python, or can we 
> autodetect?
> If we need separate ones, could call this module `predict_plpythonu' and the 
> R one would be `predict_plr`.
> 2) Do we need `list_of_columns` and `list_of_columns_to_exclude` 
> or assume it is the same as the training table?
> 3) Scoring should be embarrassingly parallel, so do we need `grouping_cols` 
> in the predict function?
> Notes
> 1) scikit-learn use the term `predict` and keras uses `evaluate` but I think 
> `predict` is better.
> Acceptance
> 1) Generate a model table for sample data set with multiple groups using a 
> scikit-learn model.  Use this predict function to score some test data.
> 2) Repeat for Keras/TF.
> 3) Repeat for XGBoost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to