[madlib] 02/04: DL: Fix validation in fit, fit multiple, evaluate and predict

2021-02-09 Thread nkak
This is an automated email from the ASF dual-hosted git repository.

nkak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git

commit bdc67ec12f0263deaac5e2728f0c01521bd3b9ea
Author: Nikhil Kak 
AuthorDate: Fri Jan 22 16:43:01 2021 -0800

DL: Fix validation in fit, fit multiple, evaluate and predict

JIRA: MADLIB-1464

Previously while calling fit/fit_multiple/evaluate/predict with invalid
input and output tables (null or missing), we would print the wrong
error message. This commit refactors the code so that we print the
expected error message.

Refactored the validator code such that we don't need to create the info
and summary table names in the fit multiple class. Instead we do that in
the validator and then the validator object can be used to get the table
names. This makes it easier to validate all the tables inside the
validator class.  This commit also refactors the code so that we move
all the validation code inside the validator class except for the source
table validation since that needs to be validated before we call the
get_data_distribution_per_segment function which has to be called before
the validator constructor.

To test this, we created a plpython function that asserts that the query
failed with the expected error message. Added a couple of wrapper
function on top of this function that test for null input and output tables.

Co-authored-by: Ekta Khanna 
---
 .../modules/deep_learning/madlib_keras.py_in   |  63 ++
 .../madlib_keras_fit_multiple_model.py_in  |  82 +++-
 .../deep_learning/madlib_keras_predict.py_in   |   3 +-
 .../deep_learning/madlib_keras_validator.py_in | 222 ++---
 .../test/madlib_keras_evaluate.sql_in  |   9 +
 .../deep_learning/test/madlib_keras_fit.sql_in |  42 
 .../test/madlib_keras_model_selection.sql_in   |  37 
 .../test/madlib_keras_multi_io.sql_in  |  25 +++
 .../deep_learning/test/madlib_keras_predict.sql_in |  20 ++
 .../test/madlib_keras_predict_byom.sql_in  |  27 +++
 .../test/unit_tests/test_madlib_keras.py_in|  33 ++-
 .../postgres/modules/utilities/utilities.sql_in|  26 +++
 12 files changed, 355 insertions(+), 234 deletions(-)

diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.py_in 
b/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
index 49892b6..c4f8611 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
@@ -103,6 +103,7 @@ def fit(schema_madlib, source_table, model, 
model_arch_table,
 fit_params = "" if not fit_params else fit_params
 _assert(compile_params, "Compile parameters cannot be empty or NULL.")
 
+input_tbl_valid(source_table, module_name)
 segments_per_host = get_data_distribution_per_segment(source_table)
 use_gpus = use_gpus if use_gpus else False
 if use_gpus:
@@ -114,51 +115,27 @@ def fit(schema_madlib, source_table, model, 
model_arch_table,
 
 if object_table is not None:
 object_table = "{0}.{1}".format(schema_madlib, 
quote_ident(object_table))
-
-source_summary_table = add_postfix(source_table, "_summary")
-input_tbl_valid(source_summary_table, module_name)
-src_summary_dict = get_source_summary_table_dict(source_summary_table)
-
-columns_dict = {}
-columns_dict['mb_dep_var_cols'] = src_summary_dict['dependent_varname']
-columns_dict['mb_indep_var_cols'] = src_summary_dict['independent_varname']
-columns_dict['dep_shape_cols'] = [add_postfix(i, "_shape") for i in 
columns_dict['mb_dep_var_cols']]
-columns_dict['ind_shape_cols'] = [add_postfix(i, "_shape") for i in 
columns_dict['mb_indep_var_cols']]
-
-multi_dep_count = len(columns_dict['mb_dep_var_cols'])
-val_dep_var = None
-val_ind_var = None
-
-val_dep_shape_cols = None
-val_ind_shape_cols = None
-if validation_table:
-validation_summary_table = add_postfix(validation_table, "_summary")
-input_tbl_valid(validation_summary_table, module_name)
-val_summary_dict = 
get_source_summary_table_dict(validation_summary_table)
-
-val_dep_var = val_summary_dict['dependent_varname']
-val_ind_var = val_summary_dict['independent_varname']
-val_dep_shape_cols = [add_postfix(i, "_shape") for i in val_dep_var]
-val_ind_shape_cols = [add_postfix(i, "_shape") for i in val_ind_var]
-
 fit_validator = FitInputValidator(
 source_table, validation_table, model, model_arch_table, model_id,
-columns_dict['mb_dep_var_cols'], columns_dict['mb_indep_var_cols'],
-columns_dict['dep_shape_cols'], columns_dict['ind_shape_cols'],
 num_iterations, metrics_compute_frequency, warm_start,
-use_gpus, accessible_gpus_for_seg, object_table,
-val_dep_var, 

[madlib] 02/04: DL: Fix validation in fit, fit multiple, evaluate and predict

2021-02-09 Thread nkak
This is an automated email from the ASF dual-hosted git repository.

nkak pushed a commit to branch dl/fit-mult-null-table-rebase-in-progress
in repository https://gitbox.apache.org/repos/asf/madlib.git

commit 17073577229c185a7a2ab453776e1ac781374655
Author: Nikhil Kak 
AuthorDate: Fri Jan 22 16:43:01 2021 -0800

DL: Fix validation in fit, fit multiple, evaluate and predict

JIRA: MADLIB-1464

Previously while calling fit/fit_multiple/evaluate/predict with invalid
input and output tables (null or missing), we would print the wrong
error message. This commit refactors the code so that we print the
expected error message.

Refactored the validator code such that we don't need to create the info
and summary table names in the fit multiple class. Instead we do that in
the validator and then the validator object can be used to get the table
names. This makes it easier to validate all the tables inside the
validator class.  This commit also refactors the code so that we move
all the validation code inside the validator class except for the source
table validation since that needs to be validated before we call the
get_data_distribution_per_segment function which has to be called before
the validator constructor.

To test this, we created a plpython function that asserts that the query
failed with the expected error message. Added a couple of wrapper
function on top of this function that test for null input and output tables.

Co-authored-by: Ekta Khanna 
---
 .../modules/deep_learning/madlib_keras.py_in   |  63 ++
 .../madlib_keras_fit_multiple_model.py_in  |  82 +++-
 .../deep_learning/madlib_keras_predict.py_in   |   3 +-
 .../deep_learning/madlib_keras_validator.py_in | 222 ++---
 .../test/madlib_keras_evaluate.sql_in  |   9 +
 .../deep_learning/test/madlib_keras_fit.sql_in |  42 
 .../test/madlib_keras_model_selection.sql_in   |  37 
 .../test/madlib_keras_multi_io.sql_in  |  25 +++
 .../deep_learning/test/madlib_keras_predict.sql_in |  20 ++
 .../test/madlib_keras_predict_byom.sql_in  |  27 +++
 .../test/unit_tests/test_madlib_keras.py_in|  33 ++-
 .../postgres/modules/utilities/utilities.sql_in|  26 +++
 12 files changed, 355 insertions(+), 234 deletions(-)

diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.py_in 
b/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
index 49892b6..c4f8611 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
@@ -103,6 +103,7 @@ def fit(schema_madlib, source_table, model, 
model_arch_table,
 fit_params = "" if not fit_params else fit_params
 _assert(compile_params, "Compile parameters cannot be empty or NULL.")
 
+input_tbl_valid(source_table, module_name)
 segments_per_host = get_data_distribution_per_segment(source_table)
 use_gpus = use_gpus if use_gpus else False
 if use_gpus:
@@ -114,51 +115,27 @@ def fit(schema_madlib, source_table, model, 
model_arch_table,
 
 if object_table is not None:
 object_table = "{0}.{1}".format(schema_madlib, 
quote_ident(object_table))
-
-source_summary_table = add_postfix(source_table, "_summary")
-input_tbl_valid(source_summary_table, module_name)
-src_summary_dict = get_source_summary_table_dict(source_summary_table)
-
-columns_dict = {}
-columns_dict['mb_dep_var_cols'] = src_summary_dict['dependent_varname']
-columns_dict['mb_indep_var_cols'] = src_summary_dict['independent_varname']
-columns_dict['dep_shape_cols'] = [add_postfix(i, "_shape") for i in 
columns_dict['mb_dep_var_cols']]
-columns_dict['ind_shape_cols'] = [add_postfix(i, "_shape") for i in 
columns_dict['mb_indep_var_cols']]
-
-multi_dep_count = len(columns_dict['mb_dep_var_cols'])
-val_dep_var = None
-val_ind_var = None
-
-val_dep_shape_cols = None
-val_ind_shape_cols = None
-if validation_table:
-validation_summary_table = add_postfix(validation_table, "_summary")
-input_tbl_valid(validation_summary_table, module_name)
-val_summary_dict = 
get_source_summary_table_dict(validation_summary_table)
-
-val_dep_var = val_summary_dict['dependent_varname']
-val_ind_var = val_summary_dict['independent_varname']
-val_dep_shape_cols = [add_postfix(i, "_shape") for i in val_dep_var]
-val_ind_shape_cols = [add_postfix(i, "_shape") for i in val_ind_var]
-
 fit_validator = FitInputValidator(
 source_table, validation_table, model, model_arch_table, model_id,
-columns_dict['mb_dep_var_cols'], columns_dict['mb_indep_var_cols'],
-columns_dict['dep_shape_cols'], columns_dict['ind_shape_cols'],
 num_iterations, metrics_compute_frequency, warm_start,
-use_gpus, accessible_gpus_for_seg,