[ 
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340267#comment-16340267
 ] 

Nikhil edited comment on MADLIB-1200 at 1/25/18 11:17 PM:
----------------------------------------------------------

[~fmcquillan]
I was thinking that may be we shouldn't have {code} encode {code}  and {code}  
batch_size {code} as optional parameters because it isn't explicit and causes 
confusion. 
[~riyer] [~okislal] what do you think ? 


was (Author: nikhilkak):
[~fmcquillan]
I was thinking that may be we shouldn't have {code} encode and batch_size 
{code} as optional parameters because it isn't explicit and causes confusion. 
[~riyer][~okislal] what do you think ? 

> Pre-processing helper function for mini-batching 
> -------------------------------------------------
>
>                 Key: MADLIB-1200
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1200
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v1.14
>
>
> Related to
>  https://issues.apache.org/jira/browse/MADLIB-1037
>  https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
>  data scientist
>  {{I want to}}
>  pre-process input files for use with mini-batching
>  {{so that}}
>  the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
> perhaps because I am tuning parameters (i.e., pre-processing is an occasional 
> operation that I don't want to re-do every time that I train a model)
> Interface
> This function is kind of the inverse of:
>  
> Suggested interface:
> {code:java}
> minibatch_preprocessor (
> source_table, 
> output_table,
> dependent_varname,
> independent_varname,
> batch_size,                          – Number of elements to pack
> encode                               – One-hot encoding if set to TRUE
> ){code}
>  
> The main purpose of the function is to prepare the training data for 
> minibatching algorithms. This will be achieved in 2 stages
> 1. Based on the batch size, group all the dependent and independent variables 
> in a single tuple representative of the batch.
>  2. If the encode parameter is True, perform one hot encoding for the 
> dependent variable. Users will need to set encode to true for multi class 
> SVM/MLP and false for single class SVM.
> Notes
> 1) Random shuffle needed for mini-batch.
>  2) Naive approach may be OK to start, not worth big investment to make run 
> 10% or 20% faster.
> Acceptance
> 1) Convert from standard to special format for mini-batching
>  2) Some scale testing OK (does not need to be comprehensive)
>  3) Document as a helper function user docs
>  4) IC



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to