GitHub user kaknikhil opened a pull request:
https://github.com/apache/madlib/pull/241
MiniBatch Pre-Processor: Add new module minibatch_preprocessing
JIRA: MADLIB-1200
MiniBatch Preprocessor is a utility function to pre-process the input
data for use with models that support mini-batching as an optimization.
TODO add more description here ??
The main purpose of the function is to prepare the training data for
minibatching algorithms.
1. If the dependent variable is boolean or text, perform one hot encoding.
N/A for numeric.
2. Typecast independent variable to double precision[]
2. Based on the buffer size, group all the dependent and independent
variables in a single tuple representative of the buffer.
Notes
1. Ignore null values in independent and dependent variables
2. Standardize the input before packing it.
Other changes:
1. Removed __ from public methods in utils_regularization.py
Renamed __utils_ind_var_scales and __utils_ind_var_scales_grouping
so that we can access them from within a class, more specifically
the minibatch_preprocessing module.
2. Added new function for regex match and refactored elastic_net.py_in to
use this function
Co-authored-by: Rahul Iyer <[email protected]>
Co-authored-by: Jingyi Mei <[email protected]>
Co-authored-by: Nandish Jayaram <[email protected]>
Co-authored-by: Orhan Kislal <[email protected]>
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/madlib/madlib feature/minibatch_preprocessing
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/madlib/pull/241.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #241
----
commit 7e89d4097d1d889adfa2eff3ed6217c75b519427
Author: Nikhil Kak <nkak@...>
Date: 2018-01-24T20:01:40Z
MiniBatch Pre-Processor: Add new module minibatch_preprocessing
JIRA: MADLIB-1200
MiniBatch Preprocessor is a utility function to pre-process the input
data for use with models that support mini-batching as an optimization.
TODO add more description here ??
The main purpose of the function is to prepare the training data for
minibatching algorithms.
1. If the dependent variable is boolean or text, perform one hot encoding.
N/A for numeric.
2. Typecast independent variable to double precision[]
2. Based on the buffer size, group all the dependent and independent
variables in a single tuple representative of the buffer.
Notes
1. Ignore null values in independent and dependent variables
2. Standardize the input before packing it.
Other changes:
1. Removed __ from public methods in utils_regularization.py
Rename __utils_ind_var_scales and __utils_ind_var_scales_grouping
so that we can access them from within a class, more specifically
the minibatch_preprocessing module.
2. Added new function for regex match and refactored elastic_net.py_in to
use this function
Co-authored-by: Rahul Iyer <[email protected]>
Co-authored-by: Jingyi Mei <[email protected]>
Co-authored-by: Nandish Jayaram <[email protected]>
Co-authored-by: Orhan Kislal <[email protected]>
----
---