[
https://issues.apache.org/jira/browse/MADLIB-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Domino Valdano updated MADLIB-1426:
-----------------------------------
Description:
Whenever I try to run {{madlib_keras_fit_multiple_model()}} on a system without
GPU's, it always fails in evaluate complaining that device {{gpu0}} is not
available. This happens regardless of whether {{use_gpus=False}} or
use_gpus=True.
My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5. I
think I've also seen this happen on CentOS in gpdb6, so I believe this is a bug
that affects all platforms, but not entirely sure of that. Possibly specific to
OSX or gpdb5.
The problem happens in {{internal_keras_eval_transition()}} in
{{madlib_keras.py_in}}.
With {{use_gpus=False}}, it runs:
{{with K.tf.device(device_name):}}
{{ res = segment_model.evaluate(x_val, y_val)}}
I added a {{plpy.info}} statement to print {{device_name}} at the beginning of
this function. I also printed the value of {{use_gpus}} on master before
training begins. While {{use_gpus}} is set to false, the {{device_name}} on the
segments is set to {{/gpu:0}}. This is the bug (it should be set to {{/cpu:0}}).
This is the error message that happens:
INFO: 00000: use_gpus = False
...
INFO: 00000: device_name = /gpu:0 (seg1 slice1 127.0.0.1:25433 pid=90300)
CONTEXT: PL/Python function "internal_keras_eval_transition"
LOCATION: PLy_output, plpython.c:4773
psql:../run_fit_mult_iris.sql:1: ERROR: XX000: plpy.SPIError:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a
device for operation group_deps: Operation was explicitly assigned to
/device:GPU:0 but available devices are [
/job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device
specification refers to a valid device. (plpython.c:5038) (seg0 slice1
127.0.0.1:25432 pid=90299) (plpython.c:5038)
DETAIL:
[[{{node group_deps}} = NoOp[_device="/device:GPU:0"](^loss/mul,
^metrics/acc/Mean)]]
Traceback (most recent call last):
PL/Python function "internal_keras_eval_transition", line 6, in <module>
return madlib_keras.internal_keras_eval_transition(**globals())
PL/Python function "internal_keras_eval_transition", line 782, in
internal_keras_eval_transition
PL/Python function "internal_keras_eval_transition", line 1112, in evaluate
PL/Python function "internal_keras_eval_transition", line 391, in test_loop
PL/Python function "internal_keras_eval_transition", line 2714, in __call__
PL/Python function "internal_keras_eval_transition", line 2670, in _call
PL/Python function "internal_keras_eval_transition", line 2622, in
_make_callable
PL/Python function "internal_keras_eval_transition", line 1469, in
_make_callable_from_options
PL/Python function "internal_keras_eval_transition", line 1351, in
_extend_graph
PL/Python function "internal_keras_eval_transition"
CONTEXT: Traceback (most recent call last):
PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
PL/Python function "madlib_keras_fit_multiple_model", line 216, in __init__
PL/Python function "madlib_keras_fit_multiple_model", line 230, in
fit_multiple_model
PL/Python function "madlib_keras_fit_multiple_model", line 270, in
train_multiple_model
PL/Python function "madlib_keras_fit_multiple_model", line 302, in
evaluate_model
PL/Python function "madlib_keras_fit_multiple_model", line 417, in
compute_loss_and_metrics
PL/Python function "madlib_keras_fit_multiple_model", line 739, in
get_loss_metric_from_keras_eval
PL/Python function "madlib_keras_fit_multiple_model"
LOCATION: PLy_elog, plpython.c:5038
was:
Whenever I try to run {{madlib_keras_fit_multiple_model()}} on a system without
GPU's, it always fails in evaluate complaining that device {{gpu0}} is not
available. This happens regardless of whether {{use_gpus=False}} or
use_gpus=True.
My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5. I
think I've also seen this happen on CentOS in gpdb6, so I believe this is a bug
that affects all platforms, but not entirely sure of that. Possibly specific to
OSX or gpdb5.
The problem happens in {{internal_keras_eval_transition()}} in
{{madlib_keras.py_in}}.
With {{use_gpus=False}}, it runs:
{{with K.tf.device(device_name):
res = segment_model.evaluate(x_val, y_val)}}
I added a {{plpy.info}} statement to print {{device_name}} at the beginning of
this function. I also printed the value of {{use_gpus}} on master before
training begins. While {{use_gpus}} is set to false, the {{device_name}} on the
segments is set to {{/gpu:0}}. This is the bug (it should be set to {{/cpu:0}}).
This is the error message that happens:
{{ INFO: 00000: use_gpus = False
...
INFO: 00000: device_name = /gpu:0 (seg1 slice1 127.0.0.1:25433 pid=90300)
CONTEXT: PL/Python function "internal_keras_eval_transition"
LOCATION: PLy_output, plpython.c:4773
psql:../run_fit_mult_iris.sql:1: ERROR: XX000: plpy.SPIError:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a
device for operation group_deps: Operation was explicitly assigned to
/device:GPU:0 but available devices are [
/job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device
specification refers to a valid device. (plpython.c:5038) (seg0 slice1
127.0.0.1:25432 pid=90299) (plpython.c:5038)
DETAIL:
[[{{node group_deps}} = NoOp[_device="/device:GPU:0"](^loss/mul,
^metrics/acc/Mean)]]
Traceback (most recent call last):
PL/Python function "internal_keras_eval_transition", line 6, in <module>
return madlib_keras.internal_keras_eval_transition(**globals())
PL/Python function "internal_keras_eval_transition", line 782, in
internal_keras_eval_transition
PL/Python function "internal_keras_eval_transition", line 1112, in evaluate
PL/Python function "internal_keras_eval_transition", line 391, in test_loop
PL/Python function "internal_keras_eval_transition", line 2714, in __call__
PL/Python function "internal_keras_eval_transition", line 2670, in _call
PL/Python function "internal_keras_eval_transition", line 2622, in
_make_callable
PL/Python function "internal_keras_eval_transition", line 1469, in
_make_callable_from_options
PL/Python function "internal_keras_eval_transition", line 1351, in
_extend_graph
PL/Python function "internal_keras_eval_transition"
CONTEXT: Traceback (most recent call last):
PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
PL/Python function "madlib_keras_fit_multiple_model", line 216, in __init__
PL/Python function "madlib_keras_fit_multiple_model", line 230, in
fit_multiple_model
PL/Python function "madlib_keras_fit_multiple_model", line 270, in
train_multiple_model
PL/Python function "madlib_keras_fit_multiple_model", line 302, in
evaluate_model
PL/Python function "madlib_keras_fit_multiple_model", line 417, in
compute_loss_and_metrics
PL/Python function "madlib_keras_fit_multiple_model", line 739, in
get_loss_metric_from_keras_eval
PL/Python function "madlib_keras_fit_multiple_model"
LOCATION: PLy_elog, plpython.c:5038}}
> Without GPU's, FitMultipleModel fails in evaluate()
> ---------------------------------------------------
>
> Key: MADLIB-1426
> URL: https://issues.apache.org/jira/browse/MADLIB-1426
> Project: Apache MADlib
> Issue Type: Bug
> Components: Deep Learning
> Reporter: Domino Valdano
> Priority: Major
>
> Whenever I try to run {{madlib_keras_fit_multiple_model()}} on a system
> without GPU's, it always fails in evaluate complaining that device {{gpu0}}
> is not available. This happens regardless of whether {{use_gpus=False}} or
> use_gpus=True.
> My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5.
> I think I've also seen this happen on CentOS in gpdb6, so I believe this is a
> bug that affects all platforms, but not entirely sure of that. Possibly
> specific to OSX or gpdb5.
> The problem happens in {{internal_keras_eval_transition()}} in
> {{madlib_keras.py_in}}.
> With {{use_gpus=False}}, it runs:
> {{with K.tf.device(device_name):}}
> {{ res = segment_model.evaluate(x_val, y_val)}}
> I added a {{plpy.info}} statement to print {{device_name}} at the beginning
> of this function. I also printed the value of {{use_gpus}} on master before
> training begins. While {{use_gpus}} is set to false, the {{device_name}} on
> the segments is set to {{/gpu:0}}. This is the bug (it should be set to
> {{/cpu:0}}).
> This is the error message that happens:
> INFO: 00000: use_gpus = False
> ...
> INFO: 00000: device_name = /gpu:0 (seg1 slice1 127.0.0.1:25433 pid=90300)
> CONTEXT: PL/Python function "internal_keras_eval_transition"
> LOCATION: PLy_output, plpython.c:4773
> psql:../run_fit_mult_iris.sql:1: ERROR: XX000: plpy.SPIError:
> tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a
> device for operation group_deps: Operation was explicitly assigned to
> /device:GPU:0 but available devices are [
> /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device
> specification refers to a valid device. (plpython.c:5038) (seg0 slice1
> 127.0.0.1:25432 pid=90299) (plpython.c:5038)
> DETAIL:
> [[{{node group_deps}} = NoOp[_device="/device:GPU:0"](^loss/mul,
> ^metrics/acc/Mean)]]
> Traceback (most recent call last):
> PL/Python function "internal_keras_eval_transition", line 6, in <module>
> return madlib_keras.internal_keras_eval_transition(**globals())
> PL/Python function "internal_keras_eval_transition", line 782, in
> internal_keras_eval_transition
> PL/Python function "internal_keras_eval_transition", line 1112, in evaluate
> PL/Python function "internal_keras_eval_transition", line 391, in test_loop
> PL/Python function "internal_keras_eval_transition", line 2714, in __call__
> PL/Python function "internal_keras_eval_transition", line 2670, in _call
> PL/Python function "internal_keras_eval_transition", line 2622, in
> _make_callable
> PL/Python function "internal_keras_eval_transition", line 1469, in
> _make_callable_from_options
> PL/Python function "internal_keras_eval_transition", line 1351, in
> _extend_graph
> PL/Python function "internal_keras_eval_transition"
> CONTEXT: Traceback (most recent call last):
> PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
> fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
> PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
> PL/Python function "madlib_keras_fit_multiple_model", line 216, in __init__
> PL/Python function "madlib_keras_fit_multiple_model", line 230, in
> fit_multiple_model
> PL/Python function "madlib_keras_fit_multiple_model", line 270, in
> train_multiple_model
> PL/Python function "madlib_keras_fit_multiple_model", line 302, in
> evaluate_model
> PL/Python function "madlib_keras_fit_multiple_model", line 417, in
> compute_loss_and_metrics
> PL/Python function "madlib_keras_fit_multiple_model", line 739, in
> get_loss_metric_from_keras_eval
> PL/Python function "madlib_keras_fit_multiple_model"
> LOCATION: PLy_elog, plpython.c:5038
--
This message was sent by Atlassian Jira
(v8.3.4#803005)