[
https://issues.apache.org/jira/browse/MADLIB-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763003#comment-16763003
]
Frank McQuillan commented on MADLIB-1294:
-----------------------------------------
Here is a proposal for mini-batch preprocessor
http://madlib.apache.org/docs/latest/group__grp__minibatch__preprocessing.html
The output table produced by the mini-batch preprocessor contains the following
columns:
{code}
...
dependent_varname FLOAT8[]. Packed array of dependent variables. If the
dependent variable in the source table is categorical, the preprocessor will
one-hot encode it.
independent_varname FLOAT8[]. Packed array of independent variables.
...
{code}
This is misleading because these columns contain values not names, so we should
rename these columns to:
{code}
...
dependent_var
independent_var
...
{code}
The output summary table contains the following columns:
{code}
dependent_varname Dependent variable from the source table.
independent_varname Independent variable from the source table.
{code}
This is OK since the columns actually do contain names.
I am moving this JIRA to 2.0 since it will break semantic versioning if we do
it in 1.16.
> Field names in output table for minibatch preprocessor
> ------------------------------------------------------
>
> Key: MADLIB-1294
> URL: https://issues.apache.org/jira/browse/MADLIB-1294
> Project: Apache MADlib
> Issue Type: Bug
> Components: Module: Utilities
> Reporter: Domino Valdano
> Assignee: Domino Valdano
> Priority: Minor
> Fix For: v1.16
>
>
> The minibatch preprocessor utility used for preparing input tables before
> training accepts "independent_varname" and "dependent_varname" as parameters.
> I believe the original intention was to have these refer to the names of the
> columns in the input table as well as the output table generated from it.
> However, there is a bug in the implementation where instead of writing out
> the output table columns as \{independent_varname} and \{dependent_varname}
> the curly braces were omitted, meaning whatever names were in the original
> table get wiped out and replaced by the literal strings 'independent_varname'
> and 'dependent_varname'.
> This makes little sense for several reasons:
> 1.) The contents of these columns are data, not variable names, so they end
> up misnamed in the output.
> 2.) This forces you to pass the argument strings 'independent_varname' and
> 'dependent_varname' as the column names of the resulting batched table to the
> fit/train function it's going to be fed into. In other words, if you're
> using the minibatch preprocessor, then these arguments to fit/train serve no
> purpose, since you always have to pass the same strings rather than a custom
> name.
> 3.) You can't pick your own names for these variables, unless you want to
> manually rename them every time after you run the minibatch preprocessor.
> Presently, we just finished making a similar minibatch preprocessing utility
> for deep learning support in madlib 1.16. I'd like to avoid reproducing this
> bug in the new utility, but we don't want them to be incompatible so that
> means we need to either fix both the old and new or neither. The only issue
> with fixing the old is that it's already been released that way. So I'm
> opening this bug report as a way of soliciting community feedback on the
> issue.
> If there is anyone who knows of a reason why this should be viewed as a
> feature rather than a bug, or has a need for the functionality to remain the
> same going forward, please comment. Thanks!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)