[ https://issues.apache.org/jira/browse/MADLIB-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Domino Valdano updated MADLIB-1426: ----------------------------------- Description: Whenever I try to run {{madlib_keras_fit_multiple_model()}} on a system without GPU's, it always fails in evaluate complaining that device {{gpu0}} is not available. This happens regardless of whether {{use_gpus=False}} or use_gpus=True. My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5. I think I've also seen this happen on CentOS in gpdb6, so I believe this is a bug that affects all platforms, but not entirely sure of that. Possibly specific to OSX or gpdb5. The problem happens in {{internal_keras_eval_transition()}} in {{madlib_keras.py_in}}. With {{use_gpus=False}}, it runs: {{with K.tf.device(device_name):}} {{ res = segment_model.evaluate(x_val, y_val)}} I added a {{plpy.info}} statement to print {{device_name}} at the beginning of this function. I also printed the value of {{use_gpus}} on master before training begins. While {{use_gpus}} is set to false, the {{device_name}} on the segments is set to {{/gpu:0}}. This is the bug (it should be set to {{/cpu:0}}). This is the error message that happens: {{LOCATION: PLy_output, plpython.c:4773}} {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg0 slice1 127.0.0.1:25432 pid=90299)}} {{CONTEXT: PL/Python function "internal_keras_eval_transition"}} {{LOCATION: PLy_output, plpython.c:4773}} {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg2 slice1 127.0.0.1:25434 pid=90301)}} {{CONTEXT: PL/Python function "internal_keras_eval_transition"}} {{LOCATION: PLy_output, plpython.c:4773}} {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg1 slice1 127.0.0.1:25433 pid=90300)}} {{CONTEXT: PL/Python function "internal_keras_eval_transition"}} {{LOCATION: PLy_output, plpython.c:4773}} {{psql:../run_fit_mult_iris.sql:1: ERROR: XX000: plpy.SPIError: tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation group_deps: Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device. (plpython.c:5038) (seg0 slice1 127.0.0.1:25432 pid=90299) (plpython.c:5038)}} {{DETAIL:}} {{[[\{{node group_deps}}}} = NoOp[_device="/device:GPU:0"](^loss/mul, ^metrics/acc/Mean)]] {{Traceback (most recent call last):}} {{ PL/Python function "internal_keras_eval_transition", line 6, in <module>}} {{ return madlib_keras.internal_keras_eval_transition(**globals())}} {{ PL/Python function "internal_keras_eval_transition", line 782, in internal_keras_eval_transition}} {{ PL/Python function "internal_keras_eval_transition", line 1112, in evaluate}} {{ PL/Python function "internal_keras_eval_transition", line 391, in test_loop}} {{ PL/Python function "internal_keras_eval_transition", line 2714, in __call__}} {{ PL/Python function "internal_keras_eval_transition", line 2670, in _call}} {{ PL/Python function "internal_keras_eval_transition", line 2622, in _make_callable}} {{ PL/Python function "internal_keras_eval_transition", line 1469, in _make_callable_from_options}} {{ PL/Python function "internal_keras_eval_transition", line 1351, in _extend_graph}} {{PL/Python function "internal_keras_eval_transition"}} {{CONTEXT: Traceback (most recent call last):}} {{ PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>}} {{ fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())}} {{ PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper}} {{ PL/Python function "madlib_keras_fit_multiple_model", line 216, in __init__}} {{ PL/Python function "madlib_keras_fit_multiple_model", line 230, in fit_multiple_model}} {{ PL/Python function "madlib_keras_fit_multiple_model", line 270, in train_multiple_model}} {{ PL/Python function "madlib_keras_fit_multiple_model", line 302, in evaluate_model}}{{PL/Python function "madlib_keras_fit_multiple_model", line 417, in compute_loss_and_metrics}} {{ PL/Python function "madlib_keras_fit_multiple_model", line 739, in get_loss_metric_from_keras_eval}} {{PL/Python function "madlib_keras_fit_multiple_model"}} {{LOCATION: PLy_elog, plpython.c:5038}} was: Whenever I try to run {{madlib_keras_fit_multiple_model()}} on a system without GPU's, it always fails in evaluate complaining that device {{gpu0}} is not available. This happens regardless of whether {{use_gpus=False}} or use_gpus=True. My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5. I think I've also seen this happen on CentOS in gpdb6, so I believe this is a bug that affects all platforms, but not entirely sure of that. Possibly specific to OSX or gpdb5. The problem happens in {{internal_keras_eval_transition()}} in {{madlib_keras.py_in}}. With {{use_gpus=False}}, it runs: {{with K.tf.device(device_name):}} {{ res = segment_model.evaluate(x_val, y_val)}} I added a {{plpy.info}} statement to print {{device_name}} at the beginning of this function. I also printed the value of {{use_gpus}} on master before training begins. While {{use_gpus}} is set to false, the {{device_name}} on the segments is set to {{/gpu:0}}. This is the bug (it should be set to {{/cpu:0}}). This is the error message that happens: {{LOCATION: PLy_output, plpython.c:4773}} {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg0 slice1 127.0.0.1:25432 pid=90299)}} {{CONTEXT: PL/Python function "internal_keras_eval_transition"}} {{LOCATION: PLy_output, plpython.c:4773}} {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg2 slice1 127.0.0.1:25434 pid=90301)}} {{CONTEXT: PL/Python function "internal_keras_eval_transition"}} {{LOCATION: PLy_output, plpython.c:4773}} {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg1 slice1 127.0.0.1:25433 pid=90300)}} {{CONTEXT: PL/Python function "internal_keras_eval_transition"}} {{LOCATION: PLy_output, plpython.c:4773}} {{psql:../run_fit_mult_iris.sql:1: ERROR: XX000: plpy.SPIError: tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation group_deps: Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device. (plpython.c:5038) (seg0 slice1 127.0.0.1:25432 pid=90299) (plpython.c:5038)}} {{DETAIL:}} {{[[\{{node group_deps}} = NoOp[_device="/device:GPU:0"](^loss/mul, ^metrics/acc/Mean)]]}} > Without GPU's, FitMultipleModel fails in evaluate() > --------------------------------------------------- > > Key: MADLIB-1426 > URL: https://issues.apache.org/jira/browse/MADLIB-1426 > Project: Apache MADlib > Issue Type: Bug > Components: Deep Learning > Reporter: Domino Valdano > Priority: Major > > Whenever I try to run {{madlib_keras_fit_multiple_model()}} on a system > without GPU's, it always fails in evaluate complaining that device {{gpu0}} > is not available. This happens regardless of whether {{use_gpus=False}} or > use_gpus=True. > My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5. > I think I've also seen this happen on CentOS in gpdb6, so I believe this is a > bug that affects all platforms, but not entirely sure of that. Possibly > specific to OSX or gpdb5. > The problem happens in {{internal_keras_eval_transition()}} in > {{madlib_keras.py_in}}. > With {{use_gpus=False}}, it runs: > {{with K.tf.device(device_name):}} > {{ res = segment_model.evaluate(x_val, y_val)}} > I added a {{plpy.info}} statement to print {{device_name}} at the beginning > of this function. I also printed the value of {{use_gpus}} on master before > training begins. While {{use_gpus}} is set to false, the {{device_name}} on > the segments is set to {{/gpu:0}}. This is the bug (it should be set to > {{/cpu:0}}). > This is the error message that happens: > {{LOCATION: PLy_output, plpython.c:4773}} > {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg0 > slice1 127.0.0.1:25432 pid=90299)}} > {{CONTEXT: PL/Python function "internal_keras_eval_transition"}} > {{LOCATION: PLy_output, plpython.c:4773}} > {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg2 > slice1 127.0.0.1:25434 pid=90301)}} > {{CONTEXT: PL/Python function "internal_keras_eval_transition"}} > {{LOCATION: PLy_output, plpython.c:4773}} > {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg1 > slice1 127.0.0.1:25433 pid=90300)}} > {{CONTEXT: PL/Python function "internal_keras_eval_transition"}} > {{LOCATION: PLy_output, plpython.c:4773}} > {{psql:../run_fit_mult_iris.sql:1: ERROR: XX000: plpy.SPIError: > tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a > device for operation group_deps: Operation was explicitly assigned to > /device:GPU:0 but available devices are [ > /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device > specification refers to a valid device. (plpython.c:5038) (seg0 slice1 > 127.0.0.1:25432 pid=90299) (plpython.c:5038)}} > {{DETAIL:}} > {{[[\{{node group_deps}}}} = NoOp[_device="/device:GPU:0"](^loss/mul, > ^metrics/acc/Mean)]] > {{Traceback (most recent call last):}} > {{ PL/Python function "internal_keras_eval_transition", line 6, in <module>}} > {{ return madlib_keras.internal_keras_eval_transition(**globals())}} > {{ PL/Python function "internal_keras_eval_transition", line 782, in > internal_keras_eval_transition}} > {{ PL/Python function "internal_keras_eval_transition", line 1112, in > evaluate}} > {{ PL/Python function "internal_keras_eval_transition", line 391, in > test_loop}} > {{ PL/Python function "internal_keras_eval_transition", line 2714, in > __call__}} > {{ PL/Python function "internal_keras_eval_transition", line 2670, in _call}} > {{ PL/Python function "internal_keras_eval_transition", line 2622, in > _make_callable}} > {{ PL/Python function "internal_keras_eval_transition", line 1469, in > _make_callable_from_options}} > {{ PL/Python function "internal_keras_eval_transition", line 1351, in > _extend_graph}} > {{PL/Python function "internal_keras_eval_transition"}} > {{CONTEXT: Traceback (most recent call last):}} > {{ PL/Python function "madlib_keras_fit_multiple_model", line 23, in > <module>}} > {{ fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())}} > {{ PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper}} > {{ PL/Python function "madlib_keras_fit_multiple_model", line 216, in > __init__}} > {{ PL/Python function "madlib_keras_fit_multiple_model", line 230, in > fit_multiple_model}} > {{ PL/Python function "madlib_keras_fit_multiple_model", line 270, in > train_multiple_model}} > {{ PL/Python function "madlib_keras_fit_multiple_model", line 302, in > evaluate_model}}{{PL/Python function "madlib_keras_fit_multiple_model", line > 417, in compute_loss_and_metrics}} > {{ PL/Python function "madlib_keras_fit_multiple_model", line 739, in > get_loss_metric_from_keras_eval}} > {{PL/Python function "madlib_keras_fit_multiple_model"}} > {{LOCATION: PLy_elog, plpython.c:5038}} -- This message was sent by Atlassian Jira (v8.3.4#803005)