Frank McQuillan created MADLIB-1226:
---------------------------------------

             Summary: Add option for 1-hot encoding to minibatch preprocessor
                 Key: MADLIB-1226
                 URL: https://issues.apache.org/jira/browse/MADLIB-1226
             Project: Apache MADlib
          Issue Type: Improvement
          Components: Module: Utilities
            Reporter: Frank McQuillan
             Fix For: v1.14


I was testing MNIST dataset with minibatch preprocessor + MLP and could not get 
it to converge.   It turned out to be user error (me) and not a problem with 
convergence at all, because I forgot to 1-hot encode the dependent variable.

But I am wondering if other people might do the same thing that I did and get 
confused.

Here's what I did.  For this input data:

{code}
madlib=# \d+ public.mnist_train

                                              Table "public.mnist_train"

 Column |   Type    |                        Modifiers                         
| Storage  | Stats target | Description 

--------+-----------+----------------------------------------------------------+----------+--------------+-------------

 y      | integer   |                                                          
| plain    |              | 

 x      | integer[] |                                                          
| extended |              | 

 id     | integer   | not null default nextval('mnist_train_id_seq'::regclass) 
| plain    |              | 
{code}

I called minibatch preprocessor:

{code}
SELECT madlib.minibatch_preprocessor('mnist_train',         -- Source table
                                     'mnist_train_packed',  -- Output table
                                     'y',                   -- Dependent 
variable
                                     'x'                    -- Independent 
variables
                                     );
{code}

then mlp:

{code}
SELECT madlib.mlp_classification(
    'mnist_train_packed',        -- Source table from preprocessor output
    'mnist_result',              -- Destination table
    'independent_varname',       --  Independent
    'dependent_varname',        -- Dependent
    ARRAY[5],                    -- Hidden layer sizes
    'learning_rate_init=0.01,
    n_iterations=20,
    learning_rate_policy=exp, n_epochs=20,
    lambda=0.0001,                 -- Regularization
    tolerance=0',
    'tanh',                      -- Activation function
    '',                          -- No weights
    FALSE,                       -- No warmstart
    TRUE);                       -- Verbose
{code}

with the result:

{code}
INFO:  Iteration: 2, Loss: <-79.5295531257>
INFO:  Iteration: 3, Loss: <-79.529408892>
INFO:  Iteration: 4, Loss: <-79.5291940436>
INFO:  Iteration: 5, Loss: <-79.5288964944>
INFO:  Iteration: 6, Loss: <-79.5285051451>
INFO:  Iteration: 7, Loss: <-79.5280094708>
INFO:  Iteration: 8, Loss: <-79.5273995189>
INFO:  Iteration: 9, Loss: <-79.5266665607>
{code}'

So it did not error out but clearly is not working on data in the right format.

I suggest 2 changes:

1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of 
dependent variable (this JIRA)

2) add a check to the MLP classification code to check that the dependent var 
has been 1-hot encodeded, and error out if that is not the case. (JIRA xxx)




 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to