Domino Valdano created MADLIB-1443:
--------------------------------------
Summary: Crash in fit_multiple when any model reaches loss=nan
Key: MADLIB-1443
URL: https://issues.apache.org/jira/browse/MADLIB-1443
Project: Apache MADlib
Issue Type: Bug
Components: Deep Learning
Reporter: Domino Valdano
There's a crash that can happen in {{fit_multiple}} (and probably fit as well
but I haven't tested it), when the loss ends up becoming nan for a model.
$$loss='categorical_crossentropy',optimizer='SGD(lr=0.05,
momentum=1.1)',metrics=['accuracy']$$
Clearly, this was not a great choice for the momentum hyperparameter, but keras
does accept it and trains through all the way with no errors or exceptions.
The problem is, the loss ends up becoming infinite (or undefined?) at some
point. All 8 models trained for 10 hours, printed out the results, and then
`madlib_keras_fit_multiple` crashed while trying to write out the final info
table:
{{ Training set after iteration 1:}}
{{ mst_key=7: metric=0.446168005466, loss=2.39643478394
mst_key=12: metric=0.00999999977648, loss=nan}}
{{ mst_key=11: metric=0.165068000555, loss=4.0407166481}}
{{...}}
{{ Validation set after iteration 1:}}
{{ mst_key=7: metric=0.359100013971, loss=2.89618015289
mst_key=12: metric=0.00999999977648, loss=nan
mst_key=11: metric=0.151299998164, loss=4.0829615593}}
{{...}}
{{CONTEXT: PL/Python function
"madlib_keras_fit_multiple_model"}}{{psql:run_fit_mult100.sql:14: }}
{{ERROR: spiexceptions.UndefinedColumn: column "nan" does not exist}}
{{LINE 4: training_loss_final = nan,}}
{{ ^}}
{{QUERY:
UPDATE places100_mult_model_444_july7_info SET
training_metrics_final = 0.00999999977648,
training_loss_final = nan,
metrics_elapsed_time = ARRAY[33260.02720808983],
training_metrics = ARRAY[0.009999999776482582],
training_loss = ARRAY[nan]
WHERE mst_key = 12}}
{{CONTEXT: Traceback (most recent call last):}}
{{ PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())}}
{{ PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper}}
{{ PL/Python function "madlib_keras_fit_multiple_model", line 195, in
__init__}}
{{ PL/Python function "madlib_keras_fit_multiple_model", line 543, in
insert_info_table}}
{{ PL/Python function "madlib_keras_fit_multiple_model", line 539, in
update_info_table}}
{{ PL/Python function "madlib_keras_fit_multiple_model"}}
So even though most of them trained fine, it rolled back all of the output so
that they all have to be trained from scratch again. Maybe while we're at it,
we should look for other places where {{nan}} might occur.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)