[ https://issues.apache.org/jira/browse/MADLIB-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank McQuillan updated MADLIB-1226: ------------------------------------ Priority: Minor (was: Major) > Add option for 1-hot encoding to minibatch preprocessor > ------------------------------------------------------- > > Key: MADLIB-1226 > URL: https://issues.apache.org/jira/browse/MADLIB-1226 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Utilities > Reporter: Frank McQuillan > Priority: Minor > Fix For: v1.14 > > > I was testing MNIST dataset with minibatch preprocessor + MLP and could not > get it to converge. It turned out to be user error (me) and not a problem > with convergence at all, because I forgot to 1-hot encode the dependent > variable. > But I am wondering if other people might do the same thing that I did and get > confused. > Here's what I did. For this input data: > {code} > madlib=# \d+ public.mnist_train > Table "public.mnist_train" > Column | Type | Modifiers > | Storage | Stats target | Description > --------+-----------+----------------------------------------------------------+----------+--------------+------------- > y | integer | > | plain | | > x | integer[] | > | extended | | > id | integer | not null default > nextval('mnist_train_id_seq'::regclass) | plain | | > {code} > I called minibatch preprocessor: > {code} > SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table > 'mnist_train_packed', -- Output table > 'y', -- Dependent > variable > 'x' -- Independent > variables > ); > {code} > then mlp: > {code} > SELECT madlib.mlp_classification( > 'mnist_train_packed', -- Source table from preprocessor output > 'mnist_result', -- Destination table > 'independent_varname', -- Independent > 'dependent_varname', -- Dependent > ARRAY[5], -- Hidden layer sizes > 'learning_rate_init=0.01, > n_iterations=20, > learning_rate_policy=exp, n_epochs=20, > lambda=0.0001, -- Regularization > tolerance=0', > 'tanh', -- Activation function > '', -- No weights > FALSE, -- No warmstart > TRUE); -- Verbose > {code} > with the result: > {code} > INFO: Iteration: 2, Loss: <-79.5295531257> > INFO: Iteration: 3, Loss: <-79.529408892> > INFO: Iteration: 4, Loss: <-79.5291940436> > INFO: Iteration: 5, Loss: <-79.5288964944> > INFO: Iteration: 6, Loss: <-79.5285051451> > INFO: Iteration: 7, Loss: <-79.5280094708> > INFO: Iteration: 8, Loss: <-79.5273995189> > INFO: Iteration: 9, Loss: <-79.5266665607> > {code}' > So it did not error out but clearly is not working on data in the right > format. > I suggest 2 changes: > 1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of > dependent variable (this JIRA) > 2) add a check to the MLP classification code to check that the dependent var > has been 1-hot encodeded, and error out if that is not the case. (JIRA xxx) > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)