[ https://issues.apache.org/jira/browse/MADLIB-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Orhan Kislal updated MADLIB-1267: --------------------------------- Fix Version/s: v3.0 (was: v2.1) > General predict function for PL/Python > -------------------------------------- > > Key: MADLIB-1267 > URL: https://issues.apache.org/jira/browse/MADLIB-1267 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Utilities > Reporter: Frank McQuillan > Priority: Major > Fix For: v3.0 > > > Context > Follow on from https://www.pivotaltracker.com/story/show/158990284 > Story > `As a data scientist` > I want to call a generic PL/Python UDF from SQL to predict > `so that` > I can use the use any code I write or Python libraries for prediction. > Interface > {code} > predict( > model_table, -- model output table > data_table, -- data table > to predict > list_of_columns, -- columns you want in > GD, could be '*' needed??? > list_of_columns_to_exclude, -- columns to explicitly exclude > needed??? > predict_udf, -- plpython UDF to > predict > predict_udf_parameters, -- parameters for UDF, if any > grouping_cols -- groups to build > separate models for (source table distributed by this grouping) needed??? > ); > {code} > Arguments > {code} > source_table > TEXT. Name of the table containing the data to load. > model_table > TEXT. Name of the table containing the model(s), with one row per group. > list_of_columns > TEXT. Comma-separated string of column names or expressions to load. > Can also be '*' implying all columns are to be loaded (except for the ones > included > in the next argument that lists exclusions). The types of the columns can be > mixed. > Array columns can also be included in the list and will be loaded as is > (i.e., not be flattened). (???) > list_of_columns_to_exclude > TEXT. Comma-separated string of column names to exclude from load. > Typically used when 'list_of_columns' is set to '*'. > predict_udf > TEXT. plpython UDF to predict. > predict_udf_parameters (optional) > TEXT. parameters for UDF, if any > grouping_cols (optional) > TEXT, default: NULL. Comma-separated list of column names to group the data > by. > This will produce multiple models, one for each group. > {code} > Open questions > 1) Do we need separate predict functions for R and Python, or can we > autodetect? > If we need separate ones, could call this module `predict_plpythonu' and the > R one would be `predict_plr`. > 2) Do we need `list_of_columns` and `list_of_columns_to_exclude` > or assume it is the same as the training table? > 3) Scoring should be embarrassingly parallel, so do we need `grouping_cols` > in the predict function? > Notes > 1) scikit-learn use the term `predict` and keras uses `evaluate` but I think > `predict` is better. > Acceptance > 1) Generate a model table for sample data set with multiple groups using a > scikit-learn model. Use this predict function to score some test data. > 2) Repeat for Keras/TF. > 3) Repeat for XGBoost. -- This message was sent by Atlassian Jira (v8.20.10#820010)