[ 
https://issues.apache.org/jira/browse/MADLIB-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Orhan Kislal updated MADLIB-1266:
---------------------------------
    Fix Version/s: v3.0
                       (was: v2.1)

> General fit function for PL/Python
> ----------------------------------
>
>                 Key: MADLIB-1266
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1266
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v3.0
>
>
> Story
> `As a data scientist`
> I want to call a generic PL/Python UDF from SQL to fit a model
> `so that`
> I can use the use any code I write or Python libraries for model builing.
> Interface
> {code}
> fit(
>               source_table,                   -- source table
>               model_table,                            -- model output table
>               list_of_columns,                        -- columns you want in 
> GD, could be '*'
>               list_of_columns_to_exclude, -- columns to explicitly exclude
>               fit_udf,                                        -- plpython UDF 
> to fit model
>               fit_udf_parameters,             -- parameters for UDF, if any
>               grouping_cols                   -- groups to build separate 
> models for (source table distributed by this grouping)
>       );
> {code}
> Arguments
> {code}
> source_table
> TEXT. Name of the table containing the data to load.
> model_table
> TEXT. Name of the table containing the model(s), with one row per group.
> list_of_columns
> TEXT. Comma-separated string of column names or expressions to load. 
> Can also be '*' implying all columns are to be loaded (except for the ones 
> included
>  in the next argument that lists exclusions). The types of the columns can be 
> mixed.  
> Array columns can also be included in the list and will be loaded as is 
> (i.e., not be flattened). (???)
> list_of_columns_to_exclude
> TEXT. Comma-separated string of column names to exclude from load. 
> Typically used when 'list_of_columns' is set to '*'.
> fit_udf
> TEXT.  plpython UDF to fit model.
> fit_udf_parameters (optional)
> TEXT.  parameters for UDF, if any
> grouping_cols (optional)
> TEXT, default: NULL. Comma-separated list of column names to group the data 
> by. 
> This will produce multiple models, one for each group.
> {code}
> Open questions
> 1) Do we need separate fit functions for R and Python, or can we autodetect?
> If we need separate ones, could call this module `fit_plpythonu' and the R 
> one would be `fit_plr`.
> Notes
> 1) Both keras & scikit-learn use the term `fit` which seems better than 
> `train`.
> (We will use the term `predict` for prediction in a separate story.)
> Acceptance
> 1) Generate a model table for sample data set with multiple groups using a 
> scikit-learn model.
> 2) Repeat for Keras/TF.
> 3) Repeat for XGBoost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to