[
https://issues.apache.org/jira/browse/MADLIB-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Orhan Kislal updated MADLIB-1267:
---------------------------------
Fix Version/s: v3.0
(was: v2.1)
> General predict function for PL/Python
> --------------------------------------
>
> Key: MADLIB-1267
> URL: https://issues.apache.org/jira/browse/MADLIB-1267
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Priority: Major
> Fix For: v3.0
>
>
> Context
> Follow on from https://www.pivotaltracker.com/story/show/158990284
> Story
> `As a data scientist`
> I want to call a generic PL/Python UDF from SQL to predict
> `so that`
> I can use the use any code I write or Python libraries for prediction.
> Interface
> {code}
> predict(
> model_table, -- model output table
> data_table, -- data table
> to predict
> list_of_columns, -- columns you want in
> GD, could be '*' needed???
> list_of_columns_to_exclude, -- columns to explicitly exclude
> needed???
> predict_udf, -- plpython UDF to
> predict
> predict_udf_parameters, -- parameters for UDF, if any
> grouping_cols -- groups to build
> separate models for (source table distributed by this grouping) needed???
> );
> {code}
> Arguments
> {code}
> source_table
> TEXT. Name of the table containing the data to load.
> model_table
> TEXT. Name of the table containing the model(s), with one row per group.
> list_of_columns
> TEXT. Comma-separated string of column names or expressions to load.
> Can also be '*' implying all columns are to be loaded (except for the ones
> included
> in the next argument that lists exclusions). The types of the columns can be
> mixed.
> Array columns can also be included in the list and will be loaded as is
> (i.e., not be flattened). (???)
> list_of_columns_to_exclude
> TEXT. Comma-separated string of column names to exclude from load.
> Typically used when 'list_of_columns' is set to '*'.
> predict_udf
> TEXT. plpython UDF to predict.
> predict_udf_parameters (optional)
> TEXT. parameters for UDF, if any
> grouping_cols (optional)
> TEXT, default: NULL. Comma-separated list of column names to group the data
> by.
> This will produce multiple models, one for each group.
> {code}
> Open questions
> 1) Do we need separate predict functions for R and Python, or can we
> autodetect?
> If we need separate ones, could call this module `predict_plpythonu' and the
> R one would be `predict_plr`.
> 2) Do we need `list_of_columns` and `list_of_columns_to_exclude`
> or assume it is the same as the training table?
> 3) Scoring should be embarrassingly parallel, so do we need `grouping_cols`
> in the predict function?
> Notes
> 1) scikit-learn use the term `predict` and keras uses `evaluate` but I think
> `predict` is better.
> Acceptance
> 1) Generate a model table for sample data set with multiple groups using a
> scikit-learn model. Use this predict function to score some test data.
> 2) Repeat for Keras/TF.
> 3) Repeat for XGBoost.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)