[
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan updated MADLIB-1200:
------------------------------------
Description:
Related to
https://issues.apache.org/jira/browse/MADLIB-1037
https://issues.apache.org/jira/browse/MADLIB-1048
Story
{{As a}}
data scientist
{{I want to}}
pre-process input files for use with mini-batching
{{so that}}
the optimization part of MLP, SVM, etc. runs faster when I do multiple runs,
perhaps because I am tuning parameters (i.e., pre-processing is an occasional
operation that I don't want to re-do every time that I train a model)
Interface
{code:java}
minibatch_preprocessor (
source_table, -- Name of the table containing the input
data.
output_table, -- Name of the table suitable for
mini-batching.
dependent_varname, -- Name of the dependent variable column.
independent_varname, -- Expression list to evaluate for the
independent variables.
buffer_size, -- ???
){code}
The main purpose of the function is to prepare the training data for
minibatching algorithms. This will be achieved in 2 stages
# Based on the batch size, group all the dependent and independent variables
in a single tuple representative of the batch.
# If the independent variables are boolean or text, perform one hot encoding.
N/A for integer and floats. Note that if the integer vars are actually
categorical, they must be case to ::TEXT so that they get encoded.
Notes
1) Random shuffle needed for mini-batch.
2) Naive approach may be OK to start, not worth big investment to make run 10%
or 20% faster.
Acceptance
1) Convert from standard to special format for mini-batching
2) Some scale testing OK (does not need to be comprehensive)
3) Document as a helper function user docs
4) IC
was:
Related to
https://issues.apache.org/jira/browse/MADLIB-1037
https://issues.apache.org/jira/browse/MADLIB-1048
Story
{{As a}}
data scientist
{{I want to}}
pre-process input files for use with mini-batching
{{so that}}
the optimization part of MLP, SVM, etc. runs faster when I do multiple runs,
perhaps because I am tuning parameters (i.e., pre-processing is an occasional
operation that I don't want to re-do every time that I train a model)
Interface
This function is kind of the inverse of:
Suggested interface:
{code:java}
minibatch_preprocessor (
source_table,
output_table,
dependent_varname,
independent_varname,
batch_size, – Number of elements to pack
encode – One-hot encoding if set to TRUE
){code}
The main purpose of the function is to prepare the training data for
minibatching algorithms. This will be achieved in 2 stages
1. Based on the batch size, group all the dependent and independent variables
in a single tuple representative of the batch.
2. If the encode parameter is True, perform one hot encoding for the dependent
variable. Users will need to set encode to true for multi class SVM/MLP and
false for single class SVM.
Notes
1) Random shuffle needed for mini-batch.
2) Naive approach may be OK to start, not worth big investment to make run 10%
or 20% faster.
Acceptance
1) Convert from standard to special format for mini-batching
2) Some scale testing OK (does not need to be comprehensive)
3) Document as a helper function user docs
4) IC
> Pre-processing helper function for mini-batching
> -------------------------------------------------
>
> Key: MADLIB-1200
> URL: https://issues.apache.org/jira/browse/MADLIB-1200
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Priority: Major
> Fix For: v1.14
>
>
> Related to
> https://issues.apache.org/jira/browse/MADLIB-1037
> https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
> data scientist
> {{I want to}}
> pre-process input files for use with mini-batching
> {{so that}}
> the optimization part of MLP, SVM, etc. runs faster when I do multiple runs,
> perhaps because I am tuning parameters (i.e., pre-processing is an occasional
> operation that I don't want to re-do every time that I train a model)
> Interface
> {code:java}
> minibatch_preprocessor (
> source_table, -- Name of the table containing the input
> data.
> output_table, -- Name of the table suitable for
> mini-batching.
> dependent_varname, -- Name of the dependent variable column.
> independent_varname, -- Expression list to evaluate for the
> independent variables.
> buffer_size, -- ???
> ){code}
>
> The main purpose of the function is to prepare the training data for
> minibatching algorithms. This will be achieved in 2 stages
> # Based on the batch size, group all the dependent and independent variables
> in a single tuple representative of the batch.
> # If the independent variables are boolean or text, perform one hot
> encoding. N/A for integer and floats. Note that if the integer vars are
> actually categorical, they must be case to ::TEXT so that they get encoded.
> Notes
> 1) Random shuffle needed for mini-batch.
> 2) Naive approach may be OK to start, not worth big investment to make run
> 10% or 20% faster.
> Acceptance
> 1) Convert from standard to special format for mini-batching
> 2) Some scale testing OK (does not need to be comprehensive)
> 3) Document as a helper function user docs
> 4) IC
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)