[
https://issues.apache.org/jira/browse/MADLIB-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan updated MADLIB-1226:
------------------------------------
Description:
I was testing MNIST dataset with minibatch preprocessor + MLP and could not get
it to converge. It turned out to be user error (me) and not a problem with
convergence at all, because I forgot to 1-hot encode the dependent variable.
But I am wondering if other people might do the same thing that I did and get
confused.
Here's what I did. For this input data:
{code}
madlib=# \d+ public.mnist_train
Table "public.mnist_train"
Column | Type | Modifiers
| Storage | Stats target | Description
--------+-----------+----------------------------------------------------------+----------+--------------+-------------
y | integer |
| plain | |
x | integer[] |
| extended | |
id | integer | not null default nextval('mnist_train_id_seq'::regclass)
| plain | |
{code}
I called minibatch preprocessor:
{code}
SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table
'mnist_train_packed', -- Output table
'y', -- Dependent
variable
'x' -- Independent
variables
);
{code}
then mlp:
{code}
SELECT madlib.mlp_classification(
'mnist_train_packed', -- Source table from preprocessor output
'mnist_result', -- Destination table
'independent_varname', -- Independent
'dependent_varname', -- Dependent
ARRAY[5], -- Hidden layer sizes
'learning_rate_init=0.01,
n_iterations=20,
learning_rate_policy=exp, n_epochs=20,
lambda=0.0001, -- Regularization
tolerance=0',
'tanh', -- Activation function
'', -- No weights
FALSE, -- No warmstart
TRUE); -- Verbose
{code}
with the result:
{code}
INFO: Iteration: 2, Loss: <-79.5295531257>
INFO: Iteration: 3, Loss: <-79.529408892>
INFO: Iteration: 4, Loss: <-79.5291940436>
INFO: Iteration: 5, Loss: <-79.5288964944>
INFO: Iteration: 6, Loss: <-79.5285051451>
INFO: Iteration: 7, Loss: <-79.5280094708>
INFO: Iteration: 8, Loss: <-79.5273995189>
INFO: Iteration: 9, Loss: <-79.5266665607>
{code}
So it did not error out but clearly is not working on data in the right format.
I suggest 2 changes:
1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of
scalar integer dependent variables (this JIRA)
2) Add a check to the MLP classification code to check that the dependent var
has been 1-hot encoded, and error out if that is not the case.
(https://issues.apache.org/jira/browse/MADLIB-1226)
Proposed interface:
{code}
minibatch_preprocessor( source_table,
output_table,
dependent_varname,
independent_varname,
grouping_col,
buffer_size,
one_hot_encode_int_dep_var
)
{code}
{code}
one_hot_encode_int_dep_var (optional)
BOOLEAN. default: FALSE. Whether to one-hot encode dependent variables that are
scalar integer.
This parameter is ignored if the dependent variable is not a scalar integer.
More detail: the mini-batch preprocessor automatically encodes dependent
variables that are
Boolean and character types such as text, char and varchar. However, scalar
integers are a
special case because they can be used in both classification and regression
problems, so
you must tell the mini-batch preprocessor whether you want to encode them or
not.
In the case that you have already encoded the dependent variable yourself,
you can ignore this parameter. Also, if you want to encode float values for
some reason, cast them
to text first.
{code}
was:
I was testing MNIST dataset with minibatch preprocessor + MLP and could not get
it to converge. It turned out to be user error (me) and not a problem with
convergence at all, because I forgot to 1-hot encode the dependent variable.
But I am wondering if other people might do the same thing that I did and get
confused.
Here's what I did. For this input data:
{code}
madlib=# \d+ public.mnist_train
Table "public.mnist_train"
Column | Type | Modifiers
| Storage | Stats target | Description
--------+-----------+----------------------------------------------------------+----------+--------------+-------------
y | integer |
| plain | |
x | integer[] |
| extended | |
id | integer | not null default nextval('mnist_train_id_seq'::regclass)
| plain | |
{code}
I called minibatch preprocessor:
{code}
SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table
'mnist_train_packed', -- Output table
'y', -- Dependent
variable
'x' -- Independent
variables
);
{code}
then mlp:
{code}
SELECT madlib.mlp_classification(
'mnist_train_packed', -- Source table from preprocessor output
'mnist_result', -- Destination table
'independent_varname', -- Independent
'dependent_varname', -- Dependent
ARRAY[5], -- Hidden layer sizes
'learning_rate_init=0.01,
n_iterations=20,
learning_rate_policy=exp, n_epochs=20,
lambda=0.0001, -- Regularization
tolerance=0',
'tanh', -- Activation function
'', -- No weights
FALSE, -- No warmstart
TRUE); -- Verbose
{code}
with the result:
{code}
INFO: Iteration: 2, Loss: <-79.5295531257>
INFO: Iteration: 3, Loss: <-79.529408892>
INFO: Iteration: 4, Loss: <-79.5291940436>
INFO: Iteration: 5, Loss: <-79.5288964944>
INFO: Iteration: 6, Loss: <-79.5285051451>
INFO: Iteration: 7, Loss: <-79.5280094708>
INFO: Iteration: 8, Loss: <-79.5273995189>
INFO: Iteration: 9, Loss: <-79.5266665607>
{code}
So it did not error out but clearly is not working on data in the right format.
I suggest 2 changes:
1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of
dependent variable (this JIRA)
2) add a check to the MLP classification code to check that the dependent var
has been 1-hot encoded, and error out if that is not the case.
(https://issues.apache.org/jira/browse/MADLIB-1226)
Proposed interface:
{code}
minibatch_preprocessor( source_table,
output_table,
dependent_varname,
independent_varname,
grouping_col,
buffer_size,
one_hot_encode_dependent_var
)
{code}
one_hot_encode_dependent_var (optional)
BOOLEAN. default: FALSE. Whether to one-hot encode dependent variable. Many
classification algorithms require dependent variables to be encoded, so set
this to TRUE if you want the mini-batch preprocessor to do the encoding for you.
> Add option for 1-hot encoding to minibatch preprocessor
> -------------------------------------------------------
>
> Key: MADLIB-1226
> URL: https://issues.apache.org/jira/browse/MADLIB-1226
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Priority: Minor
> Fix For: v1.14
>
>
> I was testing MNIST dataset with minibatch preprocessor + MLP and could not
> get it to converge. It turned out to be user error (me) and not a problem
> with convergence at all, because I forgot to 1-hot encode the dependent
> variable.
> But I am wondering if other people might do the same thing that I did and get
> confused.
> Here's what I did. For this input data:
> {code}
> madlib=# \d+ public.mnist_train
> Table "public.mnist_train"
> Column | Type | Modifiers
> | Storage | Stats target | Description
> --------+-----------+----------------------------------------------------------+----------+--------------+-------------
> y | integer |
> | plain | |
> x | integer[] |
> | extended | |
> id | integer | not null default
> nextval('mnist_train_id_seq'::regclass) | plain | |
> {code}
> I called minibatch preprocessor:
> {code}
> SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table
> 'mnist_train_packed', -- Output table
> 'y', -- Dependent
> variable
> 'x' -- Independent
> variables
> );
> {code}
> then mlp:
> {code}
> SELECT madlib.mlp_classification(
> 'mnist_train_packed', -- Source table from preprocessor output
> 'mnist_result', -- Destination table
> 'independent_varname', -- Independent
> 'dependent_varname', -- Dependent
> ARRAY[5], -- Hidden layer sizes
> 'learning_rate_init=0.01,
> n_iterations=20,
> learning_rate_policy=exp, n_epochs=20,
> lambda=0.0001, -- Regularization
> tolerance=0',
> 'tanh', -- Activation function
> '', -- No weights
> FALSE, -- No warmstart
> TRUE); -- Verbose
> {code}
> with the result:
> {code}
> INFO: Iteration: 2, Loss: <-79.5295531257>
> INFO: Iteration: 3, Loss: <-79.529408892>
> INFO: Iteration: 4, Loss: <-79.5291940436>
> INFO: Iteration: 5, Loss: <-79.5288964944>
> INFO: Iteration: 6, Loss: <-79.5285051451>
> INFO: Iteration: 7, Loss: <-79.5280094708>
> INFO: Iteration: 8, Loss: <-79.5273995189>
> INFO: Iteration: 9, Loss: <-79.5266665607>
> {code}
> So it did not error out but clearly is not working on data in the right
> format.
> I suggest 2 changes:
> 1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of
> scalar integer dependent variables (this JIRA)
> 2) Add a check to the MLP classification code to check that the dependent var
> has been 1-hot encoded, and error out if that is not the case.
> (https://issues.apache.org/jira/browse/MADLIB-1226)
> Proposed interface:
> {code}
> minibatch_preprocessor( source_table,
> output_table,
> dependent_varname,
> independent_varname,
> grouping_col,
> buffer_size,
> one_hot_encode_int_dep_var
> )
> {code}
> {code}
> one_hot_encode_int_dep_var (optional)
> BOOLEAN. default: FALSE. Whether to one-hot encode dependent variables that
> are scalar integer.
> This parameter is ignored if the dependent variable is not a scalar integer.
> More detail: the mini-batch preprocessor automatically encodes dependent
> variables that are
> Boolean and character types such as text, char and varchar. However, scalar
> integers are a
> special case because they can be used in both classification and regression
> problems, so
> you must tell the mini-batch preprocessor whether you want to encode them or
> not.
> In the case that you have already encoded the dependent variable yourself,
> you can ignore this parameter. Also, if you want to encode float values for
> some reason, cast them
> to text first.
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)