[GitHub] [madlib] fmcquillan99 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

GitBox Fri, 31 Jul 2020 11:16:14 -0700


fmcquillan99 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-665993038



   errors and issues
   
   (1)
   ```
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model 
architecture table
                                           'mst_table',          -- model 
selection table output
                                            ARRAY[1,2],              -- model 
ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'linear']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or 
‘random’, default ‘grid’) 
                                            5, -- num_configs (number of 
sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None) 
 
                                            );]
   ```
   produces
   ```
   InternalError: (psycopg2.errors.InternalError_) TypeError: cannot 
concatenate 'str' and 'float' objects (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 21, in <module>
       mst_loader = madlib_keras_model_selection.MstSearch(**globals())
     PL/Python function "generate_model_selection_configs", line 42, in wrapper
     PL/Python function "generate_model_selection_configs", line 287, in 
__init__
     PL/Python function "generate_model_selection_configs", line 426, in 
find_random_combinations
     PL/Python function "generate_model_selection_configs", line 490, in 
generate_row_string
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model 
architecture table
                                           'mst_table',          -- model 
selection table output
                                            ARRAY[1,2],              -- model 
ids from model architecture table
                                            $$
                                            { 'loss': 
['categorical_crossentropy'],
                                             'lr': [0.0001, 0.1, 'linear']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or 
‘random’, default ‘grid’) 
                                            5, -- num_configs (number of 
sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None) 
 
                                            );]
   (Background on this error at: http://sqlalche.me/e/2j85)
   ```
   
   Likewise
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model 
architecture table
                                           'mst_table',          -- model 
selection table output
                                            ARRAY[1,2],              -- model 
ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'log'],
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or 
‘random’, default ‘grid’) 
                                            1, -- num_configs (number of 
sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None) 
 
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   produces
   ```
   InternalError: (psycopg2.errors.InternalError_) TypeError: cannot 
concatenate 'str' and 'numpy.float64' objects (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 21, in <module>
       mst_loader = madlib_keras_model_selection.MstSearch(**globals())
     PL/Python function "generate_model_selection_configs", line 42, in wrapper
     PL/Python function "generate_model_selection_configs", line 287, in 
__init__
     PL/Python function "generate_model_selection_configs", line 426, in 
find_random_combinations
     PL/Python function "generate_model_selection_configs", line 490, in 
generate_row_string
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model 
architecture table
                                           'mst_table',          -- model 
selection table output
                                            ARRAY[1,2],              -- model 
ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'log'],
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or 
‘random’, default ‘grid’) 
                                            1, -- num_configs (number of 
sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None) 
 
                                            );]
   (Background on this error at: http://sqlalche.me/e/2j85)
   ```
   
   (2)
   For search_type = 'grid' or 'random', use should be able to enter part of 
the string, e.g., 'rand' for random or 'g' for for grid.  There is a MADlib 
function that supports this.
   
   
   (3)
   change the name of the function from `generate_model_selection_configs`
   to `generate_model_configs`
   
   
   (4)
   remove exclamations ! from error messages and random capitalization. 
Suggested messages:
   
   "DL: 'num_configs' and 'random_state' must be NULL for grid search"
   
   "DL: Cannot search from a distribution with grid search"
   
   "DL: 'num_configs' cannot be NULL for random search"
   
   "DL: 'search_type' must be either 'grid' or 'random'"
   
   "DL: Please choose a valid distribution type ('linear' or 'log')"
   
   "DL: {0} should be of the format [lower_bound, upper_bound, 
distribution_type]"
   
   
   (5)
   In addition to `linear` sampling and `log` sampling we should add another 
type
   called `log_near_one`
   ```
   config_dict[cp] = 1.0 - np.power(10,  np.random.uniform (np.log10 (1.0 - 
param_values[1]), np.log10(1.0 - param_values[0]) ) )
   ```
   This type of sampling is useful for exponentially weighted average type 
params like momentum, which are very sensitive to changes near 1.  It has the 
effect of producing more values near 1 than regular log sampling.
   
   e.g.
   momentum values in range [0.9000, 0.9005] average the prev 10 values no 
matter where you are in the range (no diff)
   but
   momentum values in range [0.9990, 0.9995] average the prev 1000 values for 
the left side and prev 2000 values for the right side (big diff), so you want 
to generate more samples nearer to the right side to get better coverage.
   
   
   (6)
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model 
architecture table
                                           'mst_table',          -- model 
selection table output
                                            ARRAY[1],              -- model ids 
from model architecture table
                                            $$
                                            { 'loss': 
['categorical_crossentropy'],
                                             'optimizer': ['Adam'],
                                             'lr': [0.9, 0.95, 'log'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 
256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None) 
 
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   followed by 
   ```
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model 
architecture table
                                           'mst_table',          -- model 
selection table output
                                            ARRAY[1],              -- model ids 
from model architecture table
                                            $$
                                            { 'loss': 
['categorical_crossentropy'],
                                             'optimizer': ['SGD'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 
256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None) 
 
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   produces
   ```
   IntegrityError: (psycopg2.errors.UniqueViolation) plpy.SPIError: duplicate 
key value violates unique constraint "mst_table_model_id_key"  (seg0 
10.128.0.41:40000 pid=22297)
   DETAIL:  Key (model_id, compile_params, fit_params)=(1, 
optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy', 
epochs=12,batch_size=32) already exists.
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 22, in <module>
       mst_loader.load()
     PL/Python function "generate_model_selection_configs", line 313, in load
     PL/Python function "generate_model_selection_configs", line 566, in 
insert_into_mst_table
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs( 'model_arch_library', 
-- model architecture table
                                           'mst_table',          -- model 
selection table output
                                            ARRAY[1],              -- model ids 
from model architecture table
                                            $$
                                            { 'loss': 
['categorical_crossentropy'],
                                             'optimizer': ['SGD'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 
256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None) 
 
                                            );]
   (Background on this error at: http://sqlalche.me/e/gkpj)
   ```
   But it only produced the error every 2nd time I did this. i.e., 1-pass it 
would work then the 2nd pass it would throw the error.
   
   When it does pass, it produces
   ```
    mst_key | model_id |                                        compile_params  
                                      |        fit_params        
   
---------+----------+----------------------------------------------------------------------------------------------+--------------------------
          1 |        1 | 
optimizer='Adam(lr=0.9063214445649174)',metrics=['accuracy'],loss='categorical_crossentropy'
 | epochs=10,batch_size=256
          2 |        1 | 
optimizer='Adam(lr=0.9367722192055232)',metrics=['accuracy'],loss='categorical_crossentropy'
 | epochs=5,batch_size=256
          3 |        1 | 
optimizer='Adam(lr=0.9212048311857509)',metrics=['accuracy'],loss='categorical_crossentropy'
 | epochs=2,batch_size=32
          4 |        1 | 
optimizer='Adam(lr=0.9193149125403647)',metrics=['accuracy'],loss='categorical_crossentropy'
 | epochs=3,batch_size=256
          5 |        1 | 
optimizer='Adam(lr=0.9326284661833211)',metrics=['accuracy'],loss='categorical_crossentropy'
 | epochs=2,batch_size=256
          6 |        1 | 
optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'          
             | epochs=10,batch_size=256
          7 |        1 | 
optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'          
             | epochs=5,batch_size=8
          8 |        1 | 
optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'          
             | epochs=2,batch_size=1024
          9 |        1 | 
optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'          
             | epochs=3,batch_size=32
         10 |        1 | 
optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'          
             | epochs=12,batch_size=8
   (10 rows)
   ```
   is `optimizer='SGD()'...` correct or should it be `optimizer='SGD'...` ?
   
   
   (7)
   Not all sub-params apply to all params.  For example, for optimizer, `lr` 
and `decay` might only apply to certain optimizer types and not others:
   ```
   optimizer='SGD'
   optimizer='rmsprop(lr=0.0001, decay=1e-6)'
   optimizer='adam(lr=0.0001)'
   ```
   In the previous method we accounted for that by doing:
   ```
   SELECT madlib.load_model_selection_table('model_arch_library', -- model 
architecture table
                                            'mst_table',          -- model 
selection table output
                                             ARRAY[1,2],          -- model ids 
from model architecture table
                                             ARRAY[               -- compile 
params   
                                                 
$$loss='categorical_crossentropy',optimizer='rmsprop(lr=0.0001, 
decay=1e-6)',metrics=['accuracy']$$,
                                                 
$$loss='categorical_crossentropy',optimizer='rmsprop(lr=0.001, 
decay=1e-6)',metrics=['accuracy']$$,
                                                 
$$loss='categorical_crossentropy',optimizer='adam(lr=0.0001)',metrics=['accuracy']$$,
                                                 
$$loss='categorical_crossentropy',optimizer='adam(lr=0.001)',metrics=['accuracy']$$
                                             ],
                                             ARRAY[                -- fit params
                                                 $$batch_size=64,epochs=5$$, 
                                                 $$batch_size=128,epochs=5$$
                                             ]
                                            );
   ```
   but how do we do this in the new method `generate_model_configs`? You could 
call it multiple times and incrementally build up the `mst_table` but when 
autoML methods call this function we need to support a 1-shot manner.  I would 
suggest nested dictionaries like:
   ```
   SELECT madlib.generate_model_configs(
                                           'model_arch_library', -- model 
architecture table
                                           'mst_table',          -- model 
selection table output
                                            ARRAY[1],              -- model ids 
from model architecture table
                                            $$
                                            { 'loss': 
['categorical_crossentropy'],
                                              'my_list': [
                                                                        
{'optimizer': ['SGD', 'Adagrad']},
                                                                        
{'optimizer': ['rmsprop'], 'lr': [0.9, 0.95, 'log'], 'decay': [1e-6, 1e-4, 
'log']},
                                                                        
{'optimizer': ['Adam'], 'lr': [0.99, 0.995, 'log']}
                                                                   ],
                                              'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 
256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None) 
 
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [madlib] fmcquillan99 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Reply via email to