[
https://issues.apache.org/jira/browse/MADLIB-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Domino Valdano reassigned MADLIB-1443:
--------------------------------------
Assignee: Domino Valdano
> Crash in fit_multiple when any model reaches loss=nan
> -----------------------------------------------------
>
> Key: MADLIB-1443
> URL: https://issues.apache.org/jira/browse/MADLIB-1443
> Project: Apache MADlib
> Issue Type: Bug
> Components: Deep Learning
> Reporter: Domino Valdano
> Assignee: Domino Valdano
> Priority: Minor
> Labels: deeplearning
> Fix For: v1.18.0
>
>
> There's a crash that can happen in {{madlib_keras_fit_multiple}} (and
> probably fit as well but I haven't tested it), when the loss ends up becoming
> nan for a model.
> {{$$loss='categorical_crossentropy',optimizer='SGD(lr=0.05,
> momentum=1.1)',metrics=['accuracy']$$}}
> Clearly, this was not a great choice for the momentum hyperparameter, but
> keras does accept it and trains through all the way with no errors or
> exceptions. The problem is, the loss ends up becoming infinite (or
> undefined?) at some point. All 8 models trained for 10 hours, printed out
> the results, and then {{madlib_keras_fit_multiple}} crashed while trying to
> write out the final info table:
> Training set after iteration 1:
> mst_key=7: metric=0.446168005466, loss=2.39643478394
> mst_key=12: metric=0.00999999977648, loss=nan}}
> mst_key=11: metric=0.165068000555, loss=4.0407166481}}
> ...
> Validation set after iteration 1:
> mst_key=7: metric=0.359100013971, loss=2.89618015289
> mst_key=12: metric=0.00999999977648, loss=nan
> mst_key=11: metric=0.151299998164, loss=4.0829615593}}
> ...
> CONTEXT: PL/Python function "madlib_keras_fit_multiple_model"
> psql:run_fit_mult100.sql:14:
> ERROR: spiexceptions.UndefinedColumn: column "nan" does not exist
> LINE 4: training_loss_final = nan,
> ^
> QUERY:
> UPDATE places100_mult_model_444_july7_info SET
> training_metrics_final = 0.00999999977648,
> training_loss_final = nan,
> metrics_elapsed_time = ARRAY[33260.02720808983],
> training_metrics = ARRAY[0.009999999776482582],
> training_loss = ARRAY[nan]
> WHERE mst_key = 12
> CONTEXT: Traceback (most recent call last):
> PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
> fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
> PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
> PL/Python function "madlib_keras_fit_multiple_model", line 195, in __init__
> PL/Python function "madlib_keras_fit_multiple_model", line 543, in
> insert_info_table
> PL/Python function "madlib_keras_fit_multiple_model", line 539, in
> update_info_table
> PL/Python function "madlib_keras_fit_multiple_model"
--
This message was sent by Atlassian Jira
(v8.3.4#803005)