[jira] [Commented] (MADLIB-1226) Add option for 1-hot encoding to minibatch preprocessor

Frank McQuillan (JIRA) Thu, 12 Apr 2018 18:01:44 -0700

    [ 
https://issues.apache.org/jira/browse/MADLIB-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436614#comment-16436614
 ]


Frank McQuillan commented on MADLIB-1226:
-----------------------------------------

No encoding:

{code}
SELECT madlib.minibatch_preprocessor('mnist_train',         -- Source table
                                     'mnist_train_packed',  -- Output table
                                     'y',                   -- Dependent 
variable
                                     'x',                   -- Independent 
variables
                                     NULL,                  -- Grouping 
                                     10,                     -- Buffer size
                                     FALSE                   -- One-hot encode 
integer dependent var
                                     );
{code}

{code}
madlib=# select dependent_varname  from mnist_train_packed limit 1;
-[ RECORD 1 ]-----+------------------------------------------
dependent_varname | {{3},{0},{0},{3},{1},{3},{8},{2},{8},{3}}
{code}


Yes encoding:

{code}
SELECT madlib.minibatch_preprocessor('mnist_train',         -- Source table
                                     'mnist_train_packed',  -- Output table
                                     'y',                   -- Dependent 
variable
                                     'x',                   -- Independent 
variables
                                     NULL,                  -- Grouping 
                                     10,                     -- Buffer size
                                     TRUE                   -- One-hot encode 
integer dependent var
                                     );
{code}

{code}
-[ RECORD 1 
]-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dependent_varname | 
{{0,1,0,0,0,0,0,0,0,0},{0,0,1,0,0,0,0,0,0,0},{0,1,0,0,0,0,0,0,0,0},{0,0,1,0,0,0,0,0,0,0},{0,0,0,0,0,1,0,0,0,0},{1,0,0,0,0,0,0,0,0,0},{1,0,0,0,0,0,0,0,0,0},{0,0,0,0,0,0,0,1,0,0},{0,0,0,0,0,0,0,1,0,0},{0,0,0,0,1,0,0,0,0,0}}
{code}


Encoding flag TRUE but did not encode since float (correct):

{code}
SELECT madlib.minibatch_preprocessor('mnist_train',         -- Source table
                                     'mnist_train_packed',  -- Output table
                                     'y::FLOAT',                   -- Dependent 
variable
                                     'x',                   -- Independent 
variables
                                     NULL,                  -- Grouping 
                                     10,                     -- Buffer size
                                     TRUE                   -- One-hot encode 
integer dependent var
                                     );
{code}

{code}
madlib=# select dependent_varname  from mnist_train_packed limit 1;
             dependent_varname             
-------------------------------------------
 {{9},{5},{0},{2},{5},{7},{8},{5},{8},{8}}
(1 row)
{code}


> Add option for 1-hot encoding to minibatch preprocessor
> -------------------------------------------------------
>
>                 Key: MADLIB-1226
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1226
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Minor
>             Fix For: v1.14
>
>
> I was testing MNIST dataset with minibatch preprocessor + MLP and could not 
> get it to converge.   It turned out to be user error (me) and not a problem 
> with convergence at all, because I forgot to 1-hot encode the dependent 
> variable.
> But I am wondering if other people might do the same thing that I did and get 
> confused.
> Here's what I did.  For this input data:
> {code}
> madlib=# \d+ public.mnist_train
>                                               Table "public.mnist_train"
>  Column |   Type    |                        Modifiers                        
>  | Storage  | Stats target | Description 
> --------+-----------+----------------------------------------------------------+----------+--------------+-------------
>  y      | integer   |                                                         
>  | plain    |              | 
>  x      | integer[] |                                                         
>  | extended |              | 
>  id     | integer   | not null default 
> nextval('mnist_train_id_seq'::regclass) | plain    |              | 
> {code}
> I called minibatch preprocessor:
> {code}
> SELECT madlib.minibatch_preprocessor('mnist_train',         -- Source table
>                                      'mnist_train_packed',  -- Output table
>                                      'y',                   -- Dependent 
> variable
>                                      'x'                    -- Independent 
> variables
>                                      );
> {code}
> then mlp:
> {code}
> SELECT madlib.mlp_classification(
>     'mnist_train_packed',        -- Source table from preprocessor output
>     'mnist_result',              -- Destination table
>     'independent_varname',       --  Independent
>     'dependent_varname',        -- Dependent
>     ARRAY[5],                    -- Hidden layer sizes
>     'learning_rate_init=0.01,
>     n_iterations=20,
>     learning_rate_policy=exp, n_epochs=20,
>     lambda=0.0001,                 -- Regularization
>     tolerance=0',
>     'tanh',                      -- Activation function
>     '',                          -- No weights
>     FALSE,                       -- No warmstart
>     TRUE);                       -- Verbose
> {code}
> with the result:
> {code}
> INFO:  Iteration: 2, Loss: <-79.5295531257>
> INFO:  Iteration: 3, Loss: <-79.529408892>
> INFO:  Iteration: 4, Loss: <-79.5291940436>
> INFO:  Iteration: 5, Loss: <-79.5288964944>
> INFO:  Iteration: 6, Loss: <-79.5285051451>
> INFO:  Iteration: 7, Loss: <-79.5280094708>
> INFO:  Iteration: 8, Loss: <-79.5273995189>
> INFO:  Iteration: 9, Loss: <-79.5266665607>
> {code}
> So it did not error out but clearly is not working on data in the right 
> format.
> I suggest 2 changes:
> 1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of 
> scalar integer dependent variables (this JIRA)
> 2) Add a check to the MLP classification code to check that the dependent var 
> has been 1-hot encoded, and error out if that is not the case. 
> (https://issues.apache.org/jira/browse/MADLIB-1226)
> Proposed interface:
> {code}
> minibatch_preprocessor( source_table,
>                         output_table,
>                         dependent_varname,
>                         independent_varname,
>                         grouping_col,
>                         buffer_size,
>                         one_hot_encode_int_dep_var
>                         )
> {code}
> {code}
> one_hot_encode_int_dep_var (optional)
> BOOLEAN. default: FALSE. Whether to one-hot encode dependent variables that 
> are scalar integer.
> This parameter is ignored if the dependent variable is not a scalar integer.
> More detail:  the mini-batch preprocessor automatically encodes dependent 
> variables that are 
> Boolean and character types such as text, char and varchar.  However, scalar 
> integers are a 
> special case because they can be used in both classification and regression 
> problems, so
> you must tell the mini-batch preprocessor whether you want to encode them or 
> not.  
> In the case that you have already encoded the dependent variable yourself, 
> you can ignore this parameter.  Also, if you want to encode float values for 
> some reason, cast them 
> to text first.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MADLIB-1226) Add option for 1-hot encoding to minibatch preprocessor

Reply via email to