[ https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356188#comment-16356188 ]
Orhan Kislal commented on MADLIB-1200: -------------------------------------- Does it make sense to have a parameter called `standardize`? Algorithms such as elastic_net and MLP standardize the independent and dependent variables internally. It might be harder to standardize the processed table that is the output of minibatch_preprocessor() in those modules. > Pre-processing helper function for mini-batching > ------------------------------------------------- > > Key: MADLIB-1200 > URL: https://issues.apache.org/jira/browse/MADLIB-1200 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Utilities > Reporter: Frank McQuillan > Priority: Major > Fix For: v1.14 > > > Related to > https://issues.apache.org/jira/browse/MADLIB-1037 > https://issues.apache.org/jira/browse/MADLIB-1048 > Story > {{As a}} > data scientist > {{I want to}} > pre-process input files for use with mini-batching > {{so that}} > the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, > perhaps because I am tuning parameters (i.e., pre-processing is an occasional > operation that I don't want to re-do every time that I train a model) > Interface > {code:java} > minibatch_preprocessor ( > source_table, -- Name of the table containing the input > data. > output_table, -- Name of the table suitable for > mini-batching. > dependent_varname, -- Name of the dependent variable column. > independent_varname, -- Expression list to evaluate for the > independent variables. > buffer_size, -- ??? > ){code} > > The main purpose of the function is to prepare the training data for > minibatching algorithms. This will be achieved in 2 stages > # Based on the batch size, group all the dependent and independent variables > in a single tuple representative of the batch. > # If the independent variables are boolean or text, perform one hot > encoding. N/A for integer and floats. Note that if the integer vars are > actually categorical, they must be case to ::TEXT so that they get encoded. > Notes > 1) Random shuffle needed for mini-batch. > 2) Naive approach may be OK to start, not worth big investment to make run > 10% or 20% faster. > Acceptance > 1) Convert from standard to special format for mini-batching > 2) Some scale testing OK (does not need to be comprehensive) > 3) Document as a helper function user docs > 4) IC -- This message was sent by Atlassian JIRA (v7.6.3#76005)