[jira] [Comment Edited] (MADLIB-1294) Field names in output table for minibatch preprocessor

Frank McQuillan (JIRA) Thu, 07 Feb 2019 09:25:58 -0800


    [ 
https://issues.apache.org/jira/browse/MADLIB-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762889#comment-16762889
 ]


Frank McQuillan edited comment on MADLIB-1294 at 2/7/19 5:24 PM:
-----------------------------------------------------------------

Hmm, I guess I don't see why "independent_varname" and "dependent_varname" as 
generic outputs is an issue.  If that is the known behavior and it is well 
documented in the use docs, then is that not OK?

As we support more options in the future for specifying the independent vars in 
the input table (e.g., multi-columns each scalar values), then a generic output 
column name works OK, but trying to carry over some version of the input col 
names gets complex.


was (Author: fmcquillan):
Hmm, I guess I don't see why "independent_varname" and "dependent_varname" as 
generic outputs is an issue.  If that is the known behavior and it is well 
documented in the use docs, then is that not OK?

As we support more options in the future for specifying the independent vars in 
the input table (e.g., multi-columns each scalar values), then a generic output 
column name works OK.

> Field names in output table for minibatch preprocessor
> ------------------------------------------------------
>
>                 Key: MADLIB-1294
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1294
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Utilities
>            Reporter: Domino Valdano
>            Assignee: Domino Valdano
>            Priority: Minor
>             Fix For: v1.16
>
>
> The minibatch preprocessor utility used for preparing input tables before 
> training accepts  "independent_varname" and "dependent_varname" as parameters.
> I believe the original intention was to have these refer to the names of the 
> columns in the input table as well as the output table generated from it.  
> However, there is a bug in the implementation where instead of writing out 
> the output table columns as \{independent_varname} and \{dependent_varname} 
> the curly braces were omitted, meaning whatever names were in the original 
> table get wiped out and replaced by the literal strings 'independent_varname' 
> and 'dependent_varname'.  
> This makes little sense for several reasons:
> 1.) The contents of these columns are data, not variable names, so they end 
> up misnamed in the output.
> 2.) This forces you to pass the argument strings 'independent_varname' and 
> 'dependent_varname' as the column names of the resulting batched table to the 
> fit/train function it's going to be fed into.  In other words, if you're 
> using the minibatch preprocessor, then these arguments to fit/train serve no 
> purpose, since you always have to pass the same strings rather than a custom 
> name.
> 3.) You can't pick your own names for these variables, unless you want to 
> manually rename them every time after you run the minibatch preprocessor.
> Presently, we just finished making a similar minibatch preprocessing utility 
> for deep learning support in madlib 1.16.  I'd like to avoid reproducing this 
> bug in the new utility, but we don't want them to be incompatible so that 
> means we need to either fix both the old and new or neither.  The only issue 
> with fixing the old is that it's already been released that way.  So I'm 
> opening this bug report as a way of soliciting community feedback on the 
> issue.
> If there is anyone who knows of a reason why this should be viewed as a 
> feature rather than a bug, or has a need for the functionality to remain the 
> same going forward, please comment. Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MADLIB-1294) Field names in output table for minibatch preprocessor

Reply via email to