Frank McQuillan created MADLIB-1226:
---------------------------------------
Summary: Add option for 1-hot encoding to minibatch preprocessor
Key: MADLIB-1226
URL: https://issues.apache.org/jira/browse/MADLIB-1226
Project: Apache MADlib
Issue Type: Improvement
Components: Module: Utilities
Reporter: Frank McQuillan
Fix For: v1.14
I was testing MNIST dataset with minibatch preprocessor + MLP and could not get
it to converge. It turned out to be user error (me) and not a problem with
convergence at all, because I forgot to 1-hot encode the dependent variable.
But I am wondering if other people might do the same thing that I did and get
confused.
Here's what I did. For this input data:
{code}
madlib=# \d+ public.mnist_train
Table "public.mnist_train"
Column | Type | Modifiers
| Storage | Stats target | Description
--------+-----------+----------------------------------------------------------+----------+--------------+-------------
y | integer |
| plain | |
x | integer[] |
| extended | |
id | integer | not null default nextval('mnist_train_id_seq'::regclass)
| plain | |
{code}
I called minibatch preprocessor:
{code}
SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table
'mnist_train_packed', -- Output table
'y', -- Dependent
variable
'x' -- Independent
variables
);
{code}
then mlp:
{code}
SELECT madlib.mlp_classification(
'mnist_train_packed', -- Source table from preprocessor output
'mnist_result', -- Destination table
'independent_varname', -- Independent
'dependent_varname', -- Dependent
ARRAY[5], -- Hidden layer sizes
'learning_rate_init=0.01,
n_iterations=20,
learning_rate_policy=exp, n_epochs=20,
lambda=0.0001, -- Regularization
tolerance=0',
'tanh', -- Activation function
'', -- No weights
FALSE, -- No warmstart
TRUE); -- Verbose
{code}
with the result:
{code}
INFO: Iteration: 2, Loss: <-79.5295531257>
INFO: Iteration: 3, Loss: <-79.529408892>
INFO: Iteration: 4, Loss: <-79.5291940436>
INFO: Iteration: 5, Loss: <-79.5288964944>
INFO: Iteration: 6, Loss: <-79.5285051451>
INFO: Iteration: 7, Loss: <-79.5280094708>
INFO: Iteration: 8, Loss: <-79.5273995189>
INFO: Iteration: 9, Loss: <-79.5266665607>
{code}'
So it did not error out but clearly is not working on data in the right format.
I suggest 2 changes:
1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of
dependent variable (this JIRA)
2) add a check to the MLP classification code to check that the dependent var
has been 1-hot encodeded, and error out if that is not the case. (JIRA xxx)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)