[
https://issues.apache.org/jira/browse/MADLIB-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Orhan Kislal updated MADLIB-1266:
---------------------------------
Fix Version/s: v3.0
(was: v2.1)
> General fit function for PL/Python
> ----------------------------------
>
> Key: MADLIB-1266
> URL: https://issues.apache.org/jira/browse/MADLIB-1266
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Priority: Major
> Fix For: v3.0
>
>
> Story
> `As a data scientist`
> I want to call a generic PL/Python UDF from SQL to fit a model
> `so that`
> I can use the use any code I write or Python libraries for model builing.
> Interface
> {code}
> fit(
> source_table, -- source table
> model_table, -- model output table
> list_of_columns, -- columns you want in
> GD, could be '*'
> list_of_columns_to_exclude, -- columns to explicitly exclude
> fit_udf, -- plpython UDF
> to fit model
> fit_udf_parameters, -- parameters for UDF, if any
> grouping_cols -- groups to build separate
> models for (source table distributed by this grouping)
> );
> {code}
> Arguments
> {code}
> source_table
> TEXT. Name of the table containing the data to load.
> model_table
> TEXT. Name of the table containing the model(s), with one row per group.
> list_of_columns
> TEXT. Comma-separated string of column names or expressions to load.
> Can also be '*' implying all columns are to be loaded (except for the ones
> included
> in the next argument that lists exclusions). The types of the columns can be
> mixed.
> Array columns can also be included in the list and will be loaded as is
> (i.e., not be flattened). (???)
> list_of_columns_to_exclude
> TEXT. Comma-separated string of column names to exclude from load.
> Typically used when 'list_of_columns' is set to '*'.
> fit_udf
> TEXT. plpython UDF to fit model.
> fit_udf_parameters (optional)
> TEXT. parameters for UDF, if any
> grouping_cols (optional)
> TEXT, default: NULL. Comma-separated list of column names to group the data
> by.
> This will produce multiple models, one for each group.
> {code}
> Open questions
> 1) Do we need separate fit functions for R and Python, or can we autodetect?
> If we need separate ones, could call this module `fit_plpythonu' and the R
> one would be `fit_plr`.
> Notes
> 1) Both keras & scikit-learn use the term `fit` which seems better than
> `train`.
> (We will use the term `predict` for prediction in a separate story.)
> Acceptance
> 1) Generate a model table for sample data set with multiple groups using a
> scikit-learn model.
> 2) Repeat for Keras/TF.
> 3) Repeat for XGBoost.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)