Hello Benno, +1
Tested on : Ubuntu 18.04 with SSL 8 GPUs per server NVIDIA-SMI 418.56 Tested gpu workload with: tensorflow Image used for testing: tensorflow/tensorflow:1.13.1-gpu-py3 Result: ------- versions ------ DISTRIB_ID=Ubuntu VERSION_ID="16.04" driver_version 418.56 CUDA Version 10.0.130 tf version: 1.13.1 --------------- TensorFlow: 1.13 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 32 global 32 per device Num batches: 500 Num epochs: 0.01 Devices: ['/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: parameter_server ========== Generating training model Initializing graph Running warm up Done warm up ... Executing pre-exec command '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/3b1ccd4e-e2d6-44ba-bf8d-f7b29881f6a6/backends/overlay/rootfses/e06cb46b-07e6-4e87-8b2d-fa9af29e298b/sys/fs/cgroup/cpuacct"],"shell":false,"value":"ln"}' Executing pre-exec command '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/3b1ccd4e-e2d6-44ba-bf8d-f7b29881f6a6/backends/overlay/rootfses/e06cb46b-07e6-4e87-8b2d-fa9af29e298b/sys/fs/cgroup/cpu"],"shell":false,"value":"ln"}' Changing root to /data0/mesos/work/provisioner/containers/3b1ccd4e-e2d6-44ba-bf8d-f7b29881f6a6/backends/overlay/rootfses/e06cb46b-07e6-4e87-8b2d-fa9af29e298b 2019-05-02 07:16:57.039394: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-05-02 07:16:57.250080: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4ec62d0 executing computations on platform CUDA. Devices: 2019-05-02 07:16:57.250152: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0 2019-05-02 07:16:57.273117: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2594200000 Hz 2019-05-02 07:16:57.277123: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x503da70 executing computations on platform Host. Devices: 2019-05-02 07:16:57.277177: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> 2019-05-02 07:16:57.278024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:83:00.0 totalMemory: 15.75GiB freeMemory: 15.44GiB 2019-05-02 07:16:57.278046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 > On 29 Apr 2019, at 22:05, Benno Evers <bev...@mesosphere.com> wrote: > > Hi Jorge, > > I'm admittedly not too familiar with CUDA and tensorflow but the error > message you describe sounds to me more like a build issue, i.e. it sounds > like the version of the nvidia driver is different between the docker image > and the host system? > > Maybe you could continue investigating to see if this is related to the > release itself or caused by some external cause, and create a JIRA ticket to > capture your findings? > > Thanks, > Benno > > On Fri, Apr 26, 2019 at 9:55 PM Jorge Machado <jom...@me.com > <mailto:jom...@me.com>> wrote: > Hi all, > > did someone tested it on ubuntu 18.04 + nvidia-docker2 ? We are having some > issues using the cuda 10+ images when doing real processing. We still need to > check some things but basically we get: > kernel version 418.56.0 does not match DSO version 410.48.0 -- cannot find > working devices in this configuration > > Logs: > I0424 13:27:14.000586 30 executor.cpp:726] Forked command at 73 > Preparing rootfs at > '/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b' > Marked '/' as rslave > Executing pre-exec command > '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpuacct"],"shell":false,"value":"ln"}' > Executing pre-exec command > '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpu"],"shell":false,"value":"ln"}' > Changing root to > /data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b > 2019-04-24 13:27:18.346994: I > tensorflow/core/platform/cpu_feature_guard.cc:141 > <http://cpu_feature_guard.cc:141/>] Your CPU supports instructions that this > TensorFlow binary was not compiled to use: AVX2 FMA > 2019-04-24 13:27:18.352203: E > tensorflow/stream_executor/cuda/cuda_driver.cc:300 > <http://cuda_driver.cc:300/>] failed call to cuInit: CUDA_ERROR_UNKNOWN: > unknown error > 2019-04-24 13:27:18.352243: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161 > <http://cuda_diagnostics.cc:161/>] retrieving CUDA diagnostic information for > host: __host__ > 2019-04-24 13:27:18.352252: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168 > <http://cuda_diagnostics.cc:168/>] hostname: __host__ > 2019-04-24 13:27:18.352295: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192 > <http://cuda_diagnostics.cc:192/>] libcuda reported version is: 410.48.0 > 2019-04-24 13:27:18.352329: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196 > <http://cuda_diagnostics.cc:196/>] kernel reported version is: 418.56.0 > 2019-04-24 13:27:18.352338: E > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:306 > <http://cuda_diagnostics.cc:306/>] kernel version 418.56.0 does not match DSO > version 410.48.0 -- cannot find working devices in this configuration > 2019-04-24 13:27:18.374940: I > tensorflow/core/platform/profile_utils/cpu_utils.cc:94 > <http://cpu_utils.cc:94/>] CPU Frequency: 2593920000 Hz > 2019-04-24 13:27:18.378793: I tensorflow/compiler/xla/service/service.cc:150 > <http://service.cc:150/>] XLA service 0x4f41e10 executing computations on > platform Host. Devices: > 2019-04-24 13:27:18.378821: I tensorflow/compiler/xla/service/service.cc:158 > <http://service.cc:158/>] StreamExecutor device (0): <undefined>, > <undefined> > W0424 13:27:18.385210 140191267731200 deprecation.py:323] From > /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: > colocate_with (from tensorflow.python.framework.ops) is deprecated and will > be removed in a future version. > Instructions for updating: > Colocations handled automatically by placer. > W0424 13:27:18.399287 140191267731200 deprecation.py:323] From > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:129: > conv2d (from tensorflow.python.layers.convolutional) is deprecated and will > be removed in a future version. > Instructions for updating: > Use keras.layers.conv2d instead. > W0424 13:27:18.433226 140191267731200 deprecation.py:323] From > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261: > max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will > be removed in a future version. > Instructions for updating: > Use keras.layers.max_pooling2d instead. > W0424 13:27:20.197937 140191267731200 deprecation.py:323] From > /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:209: > to_float (from tensorflow.python.ops.math_ops) is deprecated and will be > removed in a future version. > Instructions for updating: > Use tf.cast instead. > W0424 13:27:20.312573 140191267731200 deprecation.py:323] From > /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066: > to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be > removed in a future version. > Instructions for updating: > Use tf.cast instead. > W0424 13:27:21.082763 140191267731200 deprecation.py:323] From > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: > __init__ (from tensorflow.python.training.supervisor) is deprecated and will > be removed in a future version. > Instructions for updating: > Please switch to tf.train.MonitoredTrainingSession > I0424 13:27:22.013817 140191267731200 session_manager.py:491] Running > local_init_op. > I0424 13:27:22.193911 140191267731200 session_manager.py:493] Done running > local_init_op. > 2019-04-24 13:27:23.181740: E tensorflow/core/common_runtime/executor.cc:624 > <http://executor.cc:624/>] Executor failed to create kernel. Invalid > argument: Default MaxPoolingOp only supports NHWC on device type CPU > [[{{node tower_0/v/cg/mpool0/MaxPool}}]] > I0424 13:27:23.262847 140191267731200 coordinator.py:224] Error reported to > Coordinator: <class > 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Default > MaxPoolingOp only supports NHWC on device type CPU > [[node tower_0/v/cg/mpool0/MaxPool (defined at > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261) ] > running this on nvidia-docker2 works fine. > image used: tensorflow/tensorflow:latest-gpu > command: python > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py > --num_gpus=1 --batch_size=32 --model=resnet50 > --variable_update=parameter_server > on the host nvidia-smi says: NVIDIA-SMI 418.56 Driver Version: 418.56 > CUDA Version: 10.1 > thx > Jorge >> On 26 Apr 2019, at 18:28, Benno Evers <bev...@mesosphere.com >> <mailto:bev...@mesosphere.com>> wrote: >> >> Hi all, >> >> Please vote on releasing the following candidate as Apache Mesos 1.8.0. >> >> >> 1.8.0 includes the following: >> -------------------------------------------------------------------------------- >> * Greatly reduced allocator cycle time. >> * Operation feedback for v1 schedulers. >> * Per-framework minimum allocatable resources. >> * New CLI subcommands `task attach` and `task exec`. >> * New `linux/seccomp` isolator. >> * Support for Docker v2 Schema2 manifest format. >> * XFS quota for persistent volumes. >> * **Experimental** Support for the new CSI v1 API. >> >> In addition, 1.8.0-rc2 includes the following changes: >> --------------------------------------------------------------------------------- >> * Docker manifest v2s2 config with image GC. >> * Expanded `highlights` section in the CHANGELOG. >> >> In addition, 1.8.0-rc3 includes the following changes: >> --------------------------------------------------------------------------------- >> * Relaxed protobuf union validation strictness. (MESOS-9740) >> * Fixed a bug causing non-uniform random results in the random sorter. >> (MESOS-9733) >> >> >> The CHANGELOG for the release is available at: >> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.8.0-rc3 >> >> <https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.8.0-rc3> >> -------------------------------------------------------------------------------- >> >> The candidate for Mesos 1.8.0 release is available at: >> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz >> <https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz> >> >> The tag to be voted on is 1.8.0-rc3: >> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.8.0-rc3 >> <https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.8.0-rc3> >> >> The SHA512 checksum of the tarball can be found at: >> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.sha512 >> >> <https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.sha512> >> >> The signature of the tarball can be found at: >> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.asc >> >> <https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.asc> >> >> The PGP key used to sign the release is here: >> https://dist.apache.org/repos/dist/release/mesos/KEYS >> <https://dist.apache.org/repos/dist/release/mesos/KEYS> >> >> The JAR is in a staging repository here: >> https://repository.apache.org/content/repositories/orgapachemesos-1253 >> <https://repository.apache.org/content/repositories/orgapachemesos-1253> >> >> Please vote on releasing this package as Apache Mesos 1.8.0! >> >> The vote is open until and passes if a majority of at least 3 +1 PMC votes >> are cast. >> >> [ ] +1 Release this package as Apache Mesos 1.8.0 >> [ ] -1 Do not release this package because ... >> >> Thanks, >> Benno and Joseph > > > > -- > Benno Evers > Software Engineer, Mesosphere