+1
Tested on CentOS 7.4, only known flakiness

-Meng

On Tue, Apr 30, 2019 at 8:14 AM Alex Rukletsov <a...@mesosphere.com> wrote:

> Modulo Jorge's comment (hope he'll come back soon),
>
> +1 (binding).
>
> This rc has been deployed on a cluster internally by us at Mesosphere and
> has been running without noticeable issues for a couple of days for now.
>
> Alex.
>
> On Mon, Apr 29, 2019 at 10:05 PM Benno Evers <bev...@mesosphere.com>
> wrote:
>
> > Hi Jorge,
> >
> > I'm admittedly not too familiar with CUDA and tensorflow but the error
> > message you describe sounds to me more like a build issue, i.e. it sounds
> > like the version of the nvidia driver is different between the docker
> image
> > and the host system?
> >
> > Maybe you could continue investigating to see if this is related to the
> > release itself or caused by some external cause, and create a JIRA ticket
> > to capture your findings?
> >
> > Thanks,
> > Benno
> >
> > On Fri, Apr 26, 2019 at 9:55 PM Jorge Machado <jom...@me.com> wrote:
> >
> > > Hi all,
> > >
> > > did someone tested it on ubuntu 18.04 + nvidia-docker2 ? We are having
> > > some issues using the cuda 10+ images when doing real processing. We
> > still
> > > need to check some things but basically we get:
> > >
> > > kernel version 418.56.0 does not match DSO version 410.48.0 -- cannot
> > find working devices in this configuration
> > >
> > >
> > > Logs:
> > >
> > > I0424 13:27:14.000586    30 executor.cpp:726] Forked command at 73
> > > Preparing rootfs at
> >
> '/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b'
> > > Marked '/' as rslave
> > > Executing pre-exec command
> >
> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpuacct"],"shell":false,"value":"ln"}'
> > > Executing pre-exec command
> >
> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpu"],"shell":false,"value":"ln"}'
> > > Changing root to
> >
> /data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b
> > > 2019-04-24 13:27:18.346994: I
> > tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
> > instructions that this TensorFlow binary was not compiled to use: AVX2
> FMA
> > > 2019-04-24 13:27:18.352203: E
> > tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to
> cuInit:
> > CUDA_ERROR_UNKNOWN: unknown error
> > > 2019-04-24 13:27:18.352243: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA
> > diagnostic information for host: __host__
> > > 2019-04-24 13:27:18.352252: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname:
> __host__
> > > 2019-04-24 13:27:18.352295: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192] libcuda reported
> > version is: 410.48.0
> > > 2019-04-24 13:27:18.352329: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196] kernel reported
> > version is: 418.56.0*2019-04-24 13:27:18.352338: E
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:306 <
> > http://cuda_diagnostics.cc:306>] kernel version 418.56.0 does not match
> > DSO version 410.48.0 -- cannot find working devices in this
> configuration*
> > > 2019-04-24 13:27:18.374940: I
> > tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency:
> > 2593920000 Hz
> > > 2019-04-24 13:27:18.378793: I
> > tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4f41e10
> > executing computations on platform Host. Devices:
> > > 2019-04-24 13:27:18.378821: I
> > tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device
> > (0): <undefined>, <undefined>
> > > W0424 13:27:18.385210 140191267731200 deprecation.py:323] From
> >
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263:
> > colocate_with (from tensorflow.python.framework.ops) is deprecated and
> will
> > be removed in a future version.
> > > Instructions for updating:
> > > Colocations handled automatically by placer.
> > > W0424 13:27:18.399287 140191267731200 deprecation.py:323] From
> > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:129:
> > conv2d (from tensorflow.python.layers.convolutional) is deprecated and
> will
> > be removed in a future version.
> > > Instructions for updating:
> > > Use keras.layers.conv2d instead.
> > > W0424 13:27:18.433226 140191267731200 deprecation.py:323] From
> > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261:
> > max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and
> > will be removed in a future version.
> > > Instructions for updating:
> > > Use keras.layers.max_pooling2d instead.
> > > W0424 13:27:20.197937 140191267731200 deprecation.py:323] From
> >
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:209:
> > to_float (from tensorflow.python.ops.math_ops) is deprecated and will be
> > removed in a future version.
> > > Instructions for updating:
> > > Use tf.cast instead.
> > > W0424 13:27:20.312573 140191267731200 deprecation.py:323] From
> >
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066:
> > to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be
> > removed in a future version.
> > > Instructions for updating:
> > > Use tf.cast instead.
> > > W0424 13:27:21.082763 140191267731200 deprecation.py:323] From
> > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238:
> > __init__ (from tensorflow.python.training.supervisor) is deprecated and
> > will be removed in a future version.
> > > Instructions for updating:
> > > Please switch to tf.train.MonitoredTrainingSession
> > > I0424 13:27:22.013817 140191267731200 session_manager.py:491] Running
> > local_init_op.
> > > I0424 13:27:22.193911 140191267731200 session_manager.py:493] Done
> > running local_init_op.
> > > 2019-04-24 13:27:23.181740: E
> > tensorflow/core/common_runtime/executor.cc:624] Executor failed to create
> > kernel. Invalid argument: Default MaxPoolingOp only supports NHWC on
> device
> > type CPU
> > >        [[{{node tower_0/v/cg/mpool0/MaxPool}}]]
> > > I0424 13:27:23.262847 140191267731200 coordinator.py:224] Error
> reported
> > to Coordinator: <class
> > 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Default
> > MaxPoolingOp only supports NHWC on device type CPU
> > >        [[node tower_0/v/cg/mpool0/MaxPool (defined at
> >
> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261) ]
> > >
> > > running this on nvidia-docker2 works fine.
> > >
> > > image used: tensorflow/tensorflow:latest-gpu
> > >
> > > command:  python
> > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
> > --num_gpus=1 --batch_size=32 --model=resnet50
> > --variable_update=parameter_server
> > >
> > > on the host nvidia-smi says: NVIDIA-SMI 418.56       Driver Version:
> > 418.56       CUDA Version: 10.1
> > >
> > > thx
> > >
> > > Jorge
> > >
> > > On 26 Apr 2019, at 18:28, Benno Evers <bev...@mesosphere.com> wrote:
> > >
> > > Hi all,
> > >
> > > Please vote on releasing the following candidate as Apache Mesos 1.8.0.
> > >
> > >
> > > 1.8.0 includes the following:
> > >
> > >
> >
> --------------------------------------------------------------------------------
> > > * Greatly reduced allocator cycle time.
> > > * Operation feedback for v1 schedulers.
> > > * Per-framework minimum allocatable resources.
> > > * New CLI subcommands `task attach` and `task exec`.
> > > * New `linux/seccomp` isolator.
> > > * Support for Docker v2 Schema2 manifest format.
> > > * XFS quota for persistent volumes.
> > > * **Experimental** Support for the new CSI v1 API.
> > >
> > > In addition, 1.8.0-rc2 includes the following changes:
> > >
> > >
> >
> ---------------------------------------------------------------------------------
> > > * Docker manifest v2s2 config with image GC.
> > > * Expanded `highlights` section in the CHANGELOG.
> > >
> > > In addition, 1.8.0-rc3 includes the following changes:
> > >
> > >
> >
> ---------------------------------------------------------------------------------
> > > * Relaxed protobuf union validation strictness. (MESOS-9740)
> > > * Fixed a bug causing non-uniform random results in the random sorter.
> > > (MESOS-9733)
> > >
> > >
> > > The CHANGELOG for the release is available at:
> > >
> > >
> >
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.8.0-rc3
> > >
> > >
> >
> --------------------------------------------------------------------------------
> > >
> > > The candidate for Mesos 1.8.0 release is available at:
> > >
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz
> > >
> > > The tag to be voted on is 1.8.0-rc3:
> > > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.8.0-rc3
> > >
> > > The SHA512 checksum of the tarball can be found at:
> > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.sha512
> > >
> > > The signature of the tarball can be found at:
> > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.asc
> > >
> > > The PGP key used to sign the release is here:
> > > https://dist.apache.org/repos/dist/release/mesos/KEYS
> > >
> > > The JAR is in a staging repository here:
> > > https://repository.apache.org/content/repositories/orgapachemesos-1253
> > >
> > > Please vote on releasing this package as Apache Mesos 1.8.0!
> > >
> > > The vote is open until  and passes if a majority of at least 3 +1 PMC
> > votes
> > > are cast.
> > >
> > > [ ] +1 Release this package as Apache Mesos 1.8.0
> > > [ ] -1 Do not release this package because ...
> > >
> > > Thanks,
> > > Benno and Joseph
> > >
> > >
> > >
> >
> > --
> > Benno Evers
> > Software Engineer, Mesosphere
> >
>

Reply via email to