[jira] [Updated] (MADLIB-1220) Pre-processing helper function for mini-batching - grouping

Nikhil (JIRA) Fri, 23 Mar 2018 12:06:32 -0700

     [ 
https://issues.apache.org/jira/browse/MADLIB-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nikhil updated MADLIB-1220:
---------------------------
    Description: 
Related to
 https://issues.apache.org/jira/browse/MADLIB-1200

Story

{{As a}}
data scientist
{{I want to}}
add grouping to mini-batch pre-process
{{so that}}
I can handle groups with a single operation.


Interface
{code}
minibatch_preprocessor( 
     source_table, -- Name of the table containing input data
     output_table, -- Name of the output table for mini-batching
     dependent_varname, -- Name of the dependent variable column        
     independent_varname, -- Expression list to evaluate for the independent 
variables
    grouping_cols, -- Preprocess separately by group
    buffer_size  -- Number of source input rows to pack into batch
)
{code}
where
{code}
source_table
TEXT.  Name of the table containing input data.  Can also be a view.

output_table
TEXT.  Name of the output table from the preprocessor which will be used as 
input to algorithms that support mini-batching.

dependent_varname
TEXT.  Column name or expression to evaluate for the dependent variable. 

independent_varname
TEXT.  Column name or expression list to evaluate for the independent variable. 
 Will be cast to double when packing.

buffer_size (optional)
INTEGER, default: ???.  Number of source input rows to pack into batch.

grouping_cols (optional)
TEXT, default: NULL.  An expression list used to group the input dataset into 
discrete groups, running one preprocessing step per group. Similar to the SQL 
GROUP BY clause. When this value is NULL, no grouping is used and a single 
preprocessing step is performed for the whole data set.
{code}
The output table contains the following columns:
{code}
id                                      INTEGER.  Unique id for packed table.
dependent_varname                       FLOAT8[]. Packed array of dependent 
variables.
independent_varname             FLOAT8[].  Packed array of independent 
variables.
grouping_cols                           TEXT.  Name of grouping columns.
{code}
A summary table named <output_table>_summary is created together with the 
output table.  It has the following columns:
{code}
source_table                    Source table name.
output_table                    Output table name from preprocessor.
dependent_varname       Dependent variable.
independent_varname     Independent variables.
buffer_size                     Buffer size used in preprocessing step.
dependent_vartype               “Continuous” or “Categorical”
class_values                    Class values of the dependent variable (NULL 
for continuous vars).
num_rows_processed              The total number of rows that were used in the 
computation.
num_missing_rows_skipped        The total number of rows that were skipped 
because of NULL values in them.
grouping_cols                   Names of the grouping columns.
{code}
A standardization table named <output_table>_standardization is created 
together with the output table.  It has the following columns:
{code}
        grouping_cols                   Group
        mean                            Mean of independent vars by group
        std                             Standard deviation of independent vars 
by group
{code}
 
Acceptance

  was:
Related to
 https://issues.apache.org/jira/browse/MADLIB-1200

Story

{{As a}}
data scientist
{{I want to}}
add grouping to mini-batch pre-process
{{so that}}
I can handle groups with a single operation.


Interface
{code}
minibatch_preprocessor( 
     source_table, -- Name of the table containing input data
     output_table, -- Name of the output table for mini-batching
     dependent_varname, -- Name of the dependent variable column        
     independent_varname, -- Expression list to evaluate for the independent 
variables
     buffer_size  -- Number of source input rows to pack into batch,
    grouping_cols -- Preprocess separately by group
)
{code}
where
{code}
source_table
TEXT.  Name of the table containing input data.  Can also be a view.

output_table
TEXT.  Name of the output table from the preprocessor which will be used as 
input to algorithms that support mini-batching.

dependent_varname
TEXT.  Column name or expression to evaluate for the dependent variable. 

independent_varname
TEXT.  Column name or expression list to evaluate for the independent variable. 
 Will be cast to double when packing.

buffer_size (optional)
INTEGER, default: ???.  Number of source input rows to pack into batch.

grouping_cols (optional)
TEXT, default: NULL.  An expression list used to group the input dataset into 
discrete groups, running one preprocessing step per group. Similar to the SQL 
GROUP BY clause. When this value is NULL, no grouping is used and a single 
preprocessing step is performed for the whole data set.
{code}
The output table contains the following columns:
{code}
id                                      INTEGER.  Unique id for packed table.
dependent_varname                       FLOAT8[]. Packed array of dependent 
variables.
independent_varname             FLOAT8[].  Packed array of independent 
variables.
grouping_cols                           TEXT.  Name of grouping columns.
{code}
A summary table named <output_table>_summary is created together with the 
output table.  It has the following columns:
{code}
source_table                    Source table name.
output_table                    Output table name from preprocessor.
dependent_varname       Dependent variable.
independent_varname     Independent variables.
buffer_size                     Buffer size used in preprocessing step.
dependent_vartype               “Continuous” or “Categorical”
class_values                    Class values of the dependent variable (NULL 
for continuous vars).
num_rows_processed              The total number of rows that were used in the 
computation.
num_missing_rows_skipped        The total number of rows that were skipped 
because of NULL values in them.
grouping_cols                   Names of the grouping columns.
{code}
A standardization table named <output_table>_standardization is created 
together with the output table.  It has the following columns:
{code}
        grouping_cols                   Group
        mean                            Mean of independent vars by group
        std                             Standard deviation of independent vars 
by group
{code}
 
Acceptance


> Pre-processing helper function for mini-batching - grouping 
> ------------------------------------------------------------
>
>                 Key: MADLIB-1220
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1220
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Nikhil
>            Assignee: Nikhil
>            Priority: Major
>             Fix For: v1.14
>
>
> Related to
>  https://issues.apache.org/jira/browse/MADLIB-1200
> Story
> {{As a}}
> data scientist
> {{I want to}}
> add grouping to mini-batch pre-process
> {{so that}}
> I can handle groups with a single operation.
> Interface
> {code}
> minibatch_preprocessor(       
>      source_table, -- Name of the table containing input data
>      output_table, -- Name of the output table for mini-batching
>      dependent_varname, -- Name of the dependent variable column      
>      independent_varname, -- Expression list to evaluate for the independent 
> variables
>     grouping_cols, -- Preprocess separately by group
>     buffer_size  -- Number of source input rows to pack into batch
> )
> {code}
> where
> {code}
> source_table
> TEXT.  Name of the table containing input data.  Can also be a view.
> output_table
> TEXT.  Name of the output table from the preprocessor which will be used as 
> input to algorithms that support mini-batching.
> dependent_varname
> TEXT.  Column name or expression to evaluate for the dependent variable. 
> independent_varname
> TEXT.  Column name or expression list to evaluate for the independent 
> variable.  Will be cast to double when packing.
> buffer_size (optional)
> INTEGER, default: ???.  Number of source input rows to pack into batch.
> grouping_cols (optional)
> TEXT, default: NULL.  An expression list used to group the input dataset into 
> discrete groups, running one preprocessing step per group. Similar to the SQL 
> GROUP BY clause. When this value is NULL, no grouping is used and a single 
> preprocessing step is performed for the whole data set.
> {code}
> The output table contains the following columns:
> {code}
> id                                    INTEGER.  Unique id for packed table.
> dependent_varname                     FLOAT8[]. Packed array of dependent 
> variables.
> independent_varname           FLOAT8[].  Packed array of independent 
> variables.
> grouping_cols                         TEXT.  Name of grouping columns.
> {code}
> A summary table named <output_table>_summary is created together with the 
> output table.  It has the following columns:
> {code}
> source_table                  Source table name.
> output_table                  Output table name from preprocessor.
> dependent_varname     Dependent variable.
> independent_varname   Independent variables.
> buffer_size                   Buffer size used in preprocessing step.
> dependent_vartype             “Continuous” or “Categorical”
> class_values                  Class values of the dependent variable (NULL 
> for continuous vars).
> num_rows_processed            The total number of rows that were used in the 
> computation.
> num_missing_rows_skipped      The total number of rows that were skipped 
> because of NULL values in them.
> grouping_cols                 Names of the grouping columns.
> {code}
> A standardization table named <output_table>_standardization is created 
> together with the output table.  It has the following columns:
> {code}
>       grouping_cols                   Group
>       mean                            Mean of independent vars by group
>       std                             Standard deviation of independent vars 
> by group
> {code}
>  
> Acceptance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (MADLIB-1220) Pre-processing helper function for mini-batching - grouping

Reply via email to