[ 
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399514#comment-16399514
 ] 

ASF GitHub Bot commented on MADLIB-1200:
----------------------------------------

GitHub user kaknikhil opened a pull request:

    https://github.com/apache/madlib/pull/241

    MiniBatch Pre-Processor: Add new module minibatch_preprocessing

    JIRA: MADLIB-1200
    
    MiniBatch Preprocessor is a utility function to pre-process the input
    data for use with models that support mini-batching as an optimization.
    TODO add more description here ??
    
    The main purpose of the function is to prepare the training data for 
minibatching algorithms.
    1. If the dependent variable is boolean or text, perform one hot encoding.  
N/A for numeric.
    2. Typecast independent variable to double precision[]
    2. Based on the buffer size, group all the dependent and independent 
variables in a single tuple representative of the buffer.
    
    Notes
    1. Ignore null values in independent and dependent variables
    2. Standardize the input before packing it.
    
    Other changes:
    1. Removed __ from public methods in utils_regularization.py
    Renamed __utils_ind_var_scales and __utils_ind_var_scales_grouping
    so that we can access them from within a class, more specifically
    the minibatch_preprocessing module.
    2. Added new function for regex match and refactored elastic_net.py_in to 
use this function
    
    Co-authored-by: Rahul Iyer <[email protected]>
    Co-authored-by: Jingyi Mei <[email protected]>
    Co-authored-by: Nandish Jayaram <[email protected]>
    Co-authored-by: Orhan Kislal <[email protected]>

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib feature/minibatch_preprocessing

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/241.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #241
    
----
commit 7e89d4097d1d889adfa2eff3ed6217c75b519427
Author: Nikhil Kak <nkak@...>
Date:   2018-01-24T20:01:40Z

    MiniBatch Pre-Processor: Add new module minibatch_preprocessing
    
    JIRA: MADLIB-1200
    
    MiniBatch Preprocessor is a utility function to pre-process the input
    data for use with models that support mini-batching as an optimization.
    TODO add more description here ??
    
    The main purpose of the function is to prepare the training data for 
minibatching algorithms.
    1. If the dependent variable is boolean or text, perform one hot encoding.  
N/A for numeric.
    2. Typecast independent variable to double precision[]
    2. Based on the buffer size, group all the dependent and independent 
variables in a single tuple representative of the buffer.
    
    Notes
    1. Ignore null values in independent and dependent variables
    2. Standardize the input before packing it.
    
    Other changes:
    1. Removed __ from public methods in utils_regularization.py
    Rename __utils_ind_var_scales and __utils_ind_var_scales_grouping
    so that we can access them from within a class, more specifically
    the minibatch_preprocessing module.
    2. Added new function for regex match and refactored elastic_net.py_in to 
use this function
    
    Co-authored-by: Rahul Iyer <[email protected]>
    Co-authored-by: Jingyi Mei <[email protected]>
    Co-authored-by: Nandish Jayaram <[email protected]>
    Co-authored-by: Orhan Kislal <[email protected]>

----


> Pre-processing helper function for mini-batching 
> -------------------------------------------------
>
>                 Key: MADLIB-1200
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1200
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Assignee: Jingyi Mei
>            Priority: Major
>             Fix For: v1.14
>
>
> Related to
>  https://issues.apache.org/jira/browse/MADLIB-1037
>  https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
>  data scientist
>  {{I want to}}
>  pre-process input files for use with mini-batching
>  {{so that}}
>  the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
> perhaps because I am tuning parameters (i.e., pre-processing is an occasional 
> operation that I don't want to re-do every time that I train a model)
> Interface
> {code}
> minibatch_preprocessor(       
>      source_table, -- Name of the table containing input data
>      output_table, -- Name of the output table for mini-batching
>      dependent_varname, -- Name of the dependent variable column      
>      independent_varname, -- Expression list to evaluate for the independent 
> variables
>      grouping_cols, -- Preprocess separately by group
>      buffer_size  -- Number of source input rows to pack into batch
> )
> {code}
> where
> {code}
> source_table
> TEXT.  Name of the table containing input data.  Can also be a view.
> output_table
> TEXT.  Name of the output table from the preprocessor which will be used as 
> input to algorithms that support mini-batching.
> dependent_varname
> TEXT.  Column name or expression to evaluate for the dependent variable. 
> independent_varname
> TEXT.  Column name or expression list to evaluate for the independent 
> variable.  Will be cast to double when packing.
> grouping_cols (optional)
> TEXT, default: NULL.  An expression list used to group the input dataset into 
> discrete groups, running one preprocessing step per group. Similar to the SQL 
> GROUP BY clause. When this value is NULL, no grouping is used and a single 
> preprocessing step is performed for the whole data set.
> buffer_size (optional)
> INTEGER, default: ???.  Number of source input rows to pack into batch.
> The output table contains the following columns:
> id                                    INTEGER.  Unique id for packed table.
> dependent_varname                     FLOAT8[]. Packed array of dependent 
> variables.
> independent_varname           FLOAT8[].  Packed array of independent 
> variables.
> grouping_cols                         TEXT.  Name of grouping columns.
> A summary table named <output_table>_summary is created together with the 
> output table.  It has the following columns:
> source_table                  Source table name.
> output_table                  Output table name from preprocessor.
> dependent_varname     Dependent variable.
> independent_varname   Independent variables.
> buffer_size                   Buffer size used in preprocessing step.
> dependent_vartype             “Continuous” or “Categorical”
> class_values                  Class values of the dependent variable (NULL 
> for continuous vars).
> num_rows_processed            The total number of rows that were used in the 
> computation.
> num_missing_rows_skipped      The total number of rows that were skipped 
> because of NULL values in them.
> grouping_cols                 Names of the grouping columns.
> A standardization table named <output_table>_standardization is created 
> together with the output table.  It has the following columns:
>       grouping_cols                   Group
>       mean                            Mean of independent vars by group
>       std                             Standard deviation of independent vars 
> by group
> {code}
>  
> The main purpose of the function is to prepare the training data for 
> minibatching algorithms. This will be achieved in 2 stages
>  # Based on the buffer size, group all the dependent and independent 
> variables in a single tuple representative of the buffer.
>  # If the dependent variables are boolean or text, perform one hot encoding.  
> N/A for integer and floats. Note that if the integer vars are actually 
> categorical, they must be case to ::TEXT so that they get encoded.  
> Notes
> 1) Random shuffle needed for mini-batch.
>  2) Naive approach may be OK to start, not worth big investment to make run 
> 10% or 20% faster.
> Acceptance
> Summary 
>   1) Convert from standard to special format for mini-batching
>   2) Standardize by default for now but the user cannot opt out of it. We may 
> decide to add a flag later.
>   3) Some scale testing OK (does not need to be comprehensive)
>   4) Document as a helper function user docs
>   5) Always ignore nulls in dependent variable
>   6) IC



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to