[GitHub] [madlib] reductionista commented on a change in pull request #526: DL: AutoML encapsulation

GitBox Sat, 09 Jan 2021 20:23:59 -0800


reductionista commented on a change in pull request #526:
URL: https://github.com/apache/madlib/pull/526#discussion_r554188957




##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -79,6 +82,7 @@ class KerasAutoML(object):
         self.model_id_list = sorted(list(set(model_id_list)))
         self.compile_params_grid = compile_params_grid
         self.fit_params_grid = fit_params_grid
+        self.dist_key_col = DISTRIBUTION_KEY_COLNAME

Review comment:
       Currently it's not required, but while working on the Model Hopper 
refactor I realized it would help a lot with warm start if we eventually do 
require it.
   
   As I was working on optimizing weight initialization, I realized if we could 
rely on model output tables always having a dist key, then that would speed 
things up and avoid unnecessary work.  Otherwise the first step has to be 
copying the table over to one which does have the dist key, which usually 
involves shuffling the weights around to different segments.  If there is no 
dist key, then we can't assume anything about how the weights are distributed 
so there is no way to optimize that part.   For all we (or gpdb) knows, all of 
the weights might be on the same segment with none on any other segments.
   
   All newly generated output tables will have the dist key in them (I should 
make that change to fit also, come to think of it), but because they won't for 
v1.17 I don't require it as an input for warm start yet... we still do the 
extra unnecessary shuffling each time for backwards compatibility.
   
   So there's nothing necessary about it right now, but the earlier we get this 
into the codebase the earlier we can drop compatibility for warm start on 
output tables missing a dist key.

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -172,45 +174,38 @@ class KerasAutoML(object):
         random_state = 'NULL' if self.random_state is None else 
'$MAD${0}$MAD$'.format(self.random_state)
         validation_table = 'NULL' if self.validation_table is None else 
'$MAD${0}$MAD$'.format(self.validation_table)
 
-        create_query = plpy.prepare("""
+        create_query = """
                 CREATE TABLE {self.model_summary_table} AS
                 SELECT
                     $MAD${self.source_table}$MAD$::TEXT AS source_table,
                     {validation_table}::TEXT AS validation_table,
                     $MAD${self.model_output_table}$MAD$::TEXT AS model,
                     $MAD${self.model_info_table}$MAD$::TEXT AS model_info,
-                    (SELECT dependent_varname FROM 
{model_training.model_summary_table})
-                    AS dependent_varname,
-                    (SELECT independent_varname FROM 
{model_training.model_summary_table})
-                    AS independent_varname,
+                    (SELECT dependent_varname FROM {a.MODEL_SUMMARY_TABLE})

Review comment:
       My intention with this PR was to try to avoid as much as possible having 
AutoML depend on the names and meanings of internal class variables in 
FitMultiple.  In the case of the model output table, there were multiple 
different table names inside this case, and while refactoring they changed 
around.  So the motivation was mainly to avoid the headache that resulted from 
having to go through and modify all the references to those variables in AutoML 
again, if they change in the future.
   
   The name of the model output table is chosen by AutoML rather than by 
FitMultiple, so the was a strong case for having AutoML manage that one.  For 
the summary table, I agree that the case is weaker... since the name is not 
entirely chosen by AutoML... the _summary prefix is appended to whatever AutoML 
chooses by FitMultiple.
   
   I considered leaving that one as-is, but decided it seems better to have 
AutoML add the _summary itself for a couple reasons:  first, if we do change 
anything in the future in FitMultiple, it's much more likely to be the class 
variable name that changes than the convention of adding '_summary', since that 
is part of the user-facing API:  user specifies what the output table name is, 
and they should expect _summary to be appended as described in the docs.  
AutoML is using FitMultiple in a way similar to how a user would... so it will 
only break if we break backwards compatibility for the user... but won't break 
if we refactor the code inside FitMultiple.

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -172,45 +174,38 @@ class KerasAutoML(object):
         random_state = 'NULL' if self.random_state is None else 
'$MAD${0}$MAD$'.format(self.random_state)
         validation_table = 'NULL' if self.validation_table is None else 
'$MAD${0}$MAD$'.format(self.validation_table)
 
-        create_query = plpy.prepare("""
+        create_query = """
                 CREATE TABLE {self.model_summary_table} AS
                 SELECT
                     $MAD${self.source_table}$MAD$::TEXT AS source_table,
                     {validation_table}::TEXT AS validation_table,
                     $MAD${self.model_output_table}$MAD$::TEXT AS model,
                     $MAD${self.model_info_table}$MAD$::TEXT AS model_info,
-                    (SELECT dependent_varname FROM 
{model_training.model_summary_table})
-                    AS dependent_varname,
-                    (SELECT independent_varname FROM 
{model_training.model_summary_table})
-                    AS independent_varname,
+                    (SELECT dependent_varname FROM {a.MODEL_SUMMARY_TABLE})

Review comment:
       My intention with this PR was to try to avoid as much as possible having 
AutoML depend on the names and meanings of internal class variables in 
FitMultiple.  In the case of the model output table, there were multiple 
different table names inside this class, and while refactoring they changed 
around.  So the motivation was mainly to avoid the headache that resulted from 
having to go through and modify all the references to those variables in AutoML 
again, if they change in the future.
   
   The name of the model output table is chosen by AutoML rather than by 
FitMultiple, so there was a strong case for having AutoML manage that one.  For 
the summary table, I agree that the case is weaker... since the name is not 
entirely chosen by AutoML... the _summary prefix is appended to whatever AutoML 
chooses by FitMultiple.
   
   I considered leaving that one as-is, but decided it seems better to have 
AutoML add the _summary itself for a couple reasons:  first, if we do change 
anything in the future in FitMultiple, it's much more likely to be the class 
variable name that changes than the convention of adding '_summary', since that 
is part of the user-facing API:  user specifies what the output table name is, 
and they should expect _summary to be appended as described in the docs.  
AutoML is using FitMultiple in a way similar to how a user would... so it will 
only break if we break backwards compatibility for the user... but won't break 
if we refactor the code inside FitMultiple.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [madlib] reductionista commented on a change in pull request #526: DL: AutoML encapsulation

Reply via email to