[ https://issues.apache.org/jira/browse/MADLIB-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436518#comment-16436518 ]
ASF GitHub Bot commented on MADLIB-1226: ---------------------------------------- Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/259 > Add option for 1-hot encoding to minibatch preprocessor > ------------------------------------------------------- > > Key: MADLIB-1226 > URL: https://issues.apache.org/jira/browse/MADLIB-1226 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Utilities > Reporter: Frank McQuillan > Priority: Minor > Fix For: v1.14 > > > I was testing MNIST dataset with minibatch preprocessor + MLP and could not > get it to converge. It turned out to be user error (me) and not a problem > with convergence at all, because I forgot to 1-hot encode the dependent > variable. > But I am wondering if other people might do the same thing that I did and get > confused. > Here's what I did. For this input data: > {code} > madlib=# \d+ public.mnist_train > Table "public.mnist_train" > Column | Type | Modifiers > | Storage | Stats target | Description > --------+-----------+----------------------------------------------------------+----------+--------------+------------- > y | integer | > | plain | | > x | integer[] | > | extended | | > id | integer | not null default > nextval('mnist_train_id_seq'::regclass) | plain | | > {code} > I called minibatch preprocessor: > {code} > SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table > 'mnist_train_packed', -- Output table > 'y', -- Dependent > variable > 'x' -- Independent > variables > ); > {code} > then mlp: > {code} > SELECT madlib.mlp_classification( > 'mnist_train_packed', -- Source table from preprocessor output > 'mnist_result', -- Destination table > 'independent_varname', -- Independent > 'dependent_varname', -- Dependent > ARRAY[5], -- Hidden layer sizes > 'learning_rate_init=0.01, > n_iterations=20, > learning_rate_policy=exp, n_epochs=20, > lambda=0.0001, -- Regularization > tolerance=0', > 'tanh', -- Activation function > '', -- No weights > FALSE, -- No warmstart > TRUE); -- Verbose > {code} > with the result: > {code} > INFO: Iteration: 2, Loss: <-79.5295531257> > INFO: Iteration: 3, Loss: <-79.529408892> > INFO: Iteration: 4, Loss: <-79.5291940436> > INFO: Iteration: 5, Loss: <-79.5288964944> > INFO: Iteration: 6, Loss: <-79.5285051451> > INFO: Iteration: 7, Loss: <-79.5280094708> > INFO: Iteration: 8, Loss: <-79.5273995189> > INFO: Iteration: 9, Loss: <-79.5266665607> > {code} > So it did not error out but clearly is not working on data in the right > format. > I suggest 2 changes: > 1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of > scalar integer dependent variables (this JIRA) > 2) Add a check to the MLP classification code to check that the dependent var > has been 1-hot encoded, and error out if that is not the case. > (https://issues.apache.org/jira/browse/MADLIB-1226) > Proposed interface: > {code} > minibatch_preprocessor( source_table, > output_table, > dependent_varname, > independent_varname, > grouping_col, > buffer_size, > one_hot_encode_int_dep_var > ) > {code} > {code} > one_hot_encode_int_dep_var (optional) > BOOLEAN. default: FALSE. Whether to one-hot encode dependent variables that > are scalar integer. > This parameter is ignored if the dependent variable is not a scalar integer. > More detail: the mini-batch preprocessor automatically encodes dependent > variables that are > Boolean and character types such as text, char and varchar. However, scalar > integers are a > special case because they can be used in both classification and regression > problems, so > you must tell the mini-batch preprocessor whether you want to encode them or > not. > In the case that you have already encoded the dependent variable yourself, > you can ignore this parameter. Also, if you want to encode float values for > some reason, cast them > to text first. > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)