[jira] [Created] (MADLIB-1443) Crash in fit_multiple when any model reaches loss=nan

Domino Valdano (Jira) Thu, 16 Jul 2020 16:55:15 -0700

Domino Valdano created MADLIB-1443:
--------------------------------------

             Summary: Crash in fit_multiple when any model reaches loss=nan
                 Key: MADLIB-1443
                 URL: https://issues.apache.org/jira/browse/MADLIB-1443
             Project: Apache MADlib
          Issue Type: Bug
          Components: Deep Learning
            Reporter: Domino Valdano



There's a crash that can happen in {{fit_multiple}} (and probably fit as well 
but I haven't tested it), when the loss ends up becoming nan for a model.

$$loss='categorical_crossentropy',optimizer='SGD(lr=0.05, 
momentum=1.1)',metrics=['accuracy']$$

Clearly, this was not a great choice for the momentum hyperparameter, but keras 
does accept it and trains through all the way with no errors or exceptions.  
The problem is, the loss ends up becoming infinite (or undefined?) at some 
point.  All 8 models trained for 10 hours, printed out the results, and then 
`madlib_keras_fit_multiple` crashed while trying to write out the final info 
table:

{{    Training set after iteration 1:}}

{{    mst_key=7: metric=0.446168005466, loss=2.39643478394
    mst_key=12: metric=0.00999999977648, loss=nan}}

{{    mst_key=11: metric=0.165068000555, loss=4.0407166481}}

{{...}}

{{    Validation set after iteration 1:}}

{{    mst_key=7: metric=0.359100013971, loss=2.89618015289
    mst_key=12: metric=0.00999999977648, loss=nan
    mst_key=11: metric=0.151299998164, loss=4.0829615593}}

{{...}}

{{CONTEXT:  PL/Python function 
"madlib_keras_fit_multiple_model"}}{{psql:run_fit_mult100.sql:14: }}

{{ERROR:  spiexceptions.UndefinedColumn: column "nan" does not exist}}

{{LINE 4:                            training_loss_final = nan,}}

{{                               ^}}

{{QUERY:
                           UPDATE places100_mult_model_444_july7_info SET
                           training_metrics_final = 0.00999999977648,
                           training_loss_final = nan,
                           metrics_elapsed_time = ARRAY[33260.02720808983],
                           training_metrics = ARRAY[0.009999999776482582],
                           training_loss = ARRAY[nan]
                           WHERE mst_key = 12}}

{{CONTEXT:  Traceback (most recent call last):}}

{{  PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
    fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())}}

{{  PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper}}

{{  PL/Python function "madlib_keras_fit_multiple_model", line 195, in 
__init__}}

{{  PL/Python function "madlib_keras_fit_multiple_model", line 543, in 
insert_info_table}}

{{  PL/Python function "madlib_keras_fit_multiple_model", line 539, in 
update_info_table}}

{{  PL/Python function "madlib_keras_fit_multiple_model"}}

So even though most of them trained fine, it rolled back all of the output so 
that they all have to be trained from scratch again. Maybe while we're at it, 
we should look for other places where {{nan}} might occur.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (MADLIB-1443) Crash in fit_multiple when any model reaches loss=nan

Reply via email to