Re: [VOTE] Release Apache Mesos 1.8.0 (rc3)

Jorge Machado Thu, 02 May 2019 00:26:15 -0700

Hello Benno, 

+1


Tested on : 
Ubuntu 18.04 with SSL 
8 GPUs per server
 NVIDIA-SMI 418.56
Tested gpu workload with: tensorflow
Image used for testing: tensorflow/tensorflow:1.13.1-gpu-py3
Result: 
------- versions ------
DISTRIB_ID=Ubuntu
VERSION_ID="16.04"
driver_version
418.56
CUDA Version 10.0.130
tf version: 1.13.1
---------------
TensorFlow:  1.13
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  32 global
             32 per device
Num batches: 500
Num epochs:  0.01
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
...
Executing pre-exec command 
'{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/3b1ccd4e-e2d6-44ba-bf8d-f7b29881f6a6/backends/overlay/rootfses/e06cb46b-07e6-4e87-8b2d-fa9af29e298b/sys/fs/cgroup/cpuacct"],"shell":false,"value":"ln"}'
Executing pre-exec command 
'{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/3b1ccd4e-e2d6-44ba-bf8d-f7b29881f6a6/backends/overlay/rootfses/e06cb46b-07e6-4e87-8b2d-fa9af29e298b/sys/fs/cgroup/cpu"],"shell":false,"value":"ln"}'
Changing root to 
/data0/mesos/work/provisioner/containers/3b1ccd4e-e2d6-44ba-bf8d-f7b29881f6a6/backends/overlay/rootfses/e06cb46b-07e6-4e87-8b2d-fa9af29e298b
2019-05-02 07:16:57.039394: I 
tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports 
instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-02 07:16:57.250080: I tensorflow/compiler/xla/service/service.cc:150] 
XLA service 0x4ec62d0 executing computations on platform CUDA. Devices:
2019-05-02 07:16:57.250152: I tensorflow/compiler/xla/service/service.cc:158]   
StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0
2019-05-02 07:16:57.273117: I 
tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 
2594200000 Hz
2019-05-02 07:16:57.277123: I tensorflow/compiler/xla/service/service.cc:150] 
XLA service 0x503da70 executing computations on platform Host. Devices:
2019-05-02 07:16:57.277177: I tensorflow/compiler/xla/service/service.cc:158]   
StreamExecutor device (0): <undefined>, <undefined>
2019-05-02 07:16:57.278024: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with 
properties: 
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:83:00.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-05-02 07:16:57.278046: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu 
devices: 0

> On 29 Apr 2019, at 22:05, Benno Evers <bev...@mesosphere.com> wrote:
> 
> Hi Jorge,
> 
> I'm admittedly not too familiar with CUDA and tensorflow but the error 
> message you describe sounds to me more like a build issue, i.e. it sounds 
> like the version of the nvidia driver is different between the docker image 
> and the host system?
> 
> Maybe you could continue investigating to see if this is related to the 
> release itself or caused by some external cause, and create a JIRA ticket to 
> capture your findings?
> 
> Thanks,
> Benno
> 
> On Fri, Apr 26, 2019 at 9:55 PM Jorge Machado <jom...@me.com 
> <mailto:jom...@me.com>> wrote:
> Hi all, 
> 
> did someone tested it on ubuntu 18.04 + nvidia-docker2 ? We are having some 
> issues using the cuda 10+ images when doing real processing. We still need to 
> check some things but basically we get: 
> kernel version 418.56.0 does not match DSO version 410.48.0 -- cannot find 
> working devices in this configuration
> 
> Logs:
> I0424 13:27:14.000586    30 executor.cpp:726] Forked command at 73
> Preparing rootfs at 
> '/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b'
> Marked '/' as rslave
> Executing pre-exec command 
> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpuacct"],"shell":false,"value":"ln"}'
> Executing pre-exec command 
> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpu"],"shell":false,"value":"ln"}'
> Changing root to 
> /data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b
> 2019-04-24 13:27:18.346994: I 
> tensorflow/core/platform/cpu_feature_guard.cc:141 
> <http://cpu_feature_guard.cc:141/>] Your CPU supports instructions that this 
> TensorFlow binary was not compiled to use: AVX2 FMA
> 2019-04-24 13:27:18.352203: E 
> tensorflow/stream_executor/cuda/cuda_driver.cc:300 
> <http://cuda_driver.cc:300/>] failed call to cuInit: CUDA_ERROR_UNKNOWN: 
> unknown error
> 2019-04-24 13:27:18.352243: I 
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161 
> <http://cuda_diagnostics.cc:161/>] retrieving CUDA diagnostic information for 
> host: __host__
> 2019-04-24 13:27:18.352252: I 
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168 
> <http://cuda_diagnostics.cc:168/>] hostname: __host__
> 2019-04-24 13:27:18.352295: I 
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192 
> <http://cuda_diagnostics.cc:192/>] libcuda reported version is: 410.48.0
> 2019-04-24 13:27:18.352329: I 
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196 
> <http://cuda_diagnostics.cc:196/>] kernel reported version is: 418.56.0
> 2019-04-24 13:27:18.352338: E 
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:306 
> <http://cuda_diagnostics.cc:306/>] kernel version 418.56.0 does not match DSO 
> version 410.48.0 -- cannot find working devices in this configuration
> 2019-04-24 13:27:18.374940: I 
> tensorflow/core/platform/profile_utils/cpu_utils.cc:94 
> <http://cpu_utils.cc:94/>] CPU Frequency: 2593920000 Hz
> 2019-04-24 13:27:18.378793: I tensorflow/compiler/xla/service/service.cc:150 
> <http://service.cc:150/>] XLA service 0x4f41e10 executing computations on 
> platform Host. Devices:
> 2019-04-24 13:27:18.378821: I tensorflow/compiler/xla/service/service.cc:158 
> <http://service.cc:158/>]   StreamExecutor device (0): <undefined>, 
> <undefined>
> W0424 13:27:18.385210 140191267731200 deprecation.py:323] From 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263:
>  colocate_with (from tensorflow.python.framework.ops) is deprecated and will 
> be removed in a future version.
> Instructions for updating:
> Colocations handled automatically by placer.
> W0424 13:27:18.399287 140191267731200 deprecation.py:323] From 
> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:129: 
> conv2d (from tensorflow.python.layers.convolutional) is deprecated and will 
> be removed in a future version.
> Instructions for updating:
> Use keras.layers.conv2d instead.
> W0424 13:27:18.433226 140191267731200 deprecation.py:323] From 
> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261: 
> max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will 
> be removed in a future version.
> Instructions for updating:
> Use keras.layers.max_pooling2d instead.
> W0424 13:27:20.197937 140191267731200 deprecation.py:323] From 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:209:
>  to_float (from tensorflow.python.ops.math_ops) is deprecated and will be 
> removed in a future version.
> Instructions for updating:
> Use tf.cast instead.
> W0424 13:27:20.312573 140191267731200 deprecation.py:323] From 
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066:
>  to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be 
> removed in a future version.
> Instructions for updating:
> Use tf.cast instead.
> W0424 13:27:21.082763 140191267731200 deprecation.py:323] From 
> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: 
> __init__ (from tensorflow.python.training.supervisor) is deprecated and will 
> be removed in a future version.
> Instructions for updating:
> Please switch to tf.train.MonitoredTrainingSession
> I0424 13:27:22.013817 140191267731200 session_manager.py:491] Running 
> local_init_op.
> I0424 13:27:22.193911 140191267731200 session_manager.py:493] Done running 
> local_init_op.
> 2019-04-24 13:27:23.181740: E tensorflow/core/common_runtime/executor.cc:624 
> <http://executor.cc:624/>] Executor failed to create kernel. Invalid 
> argument: Default MaxPoolingOp only supports NHWC on device type CPU
>        [[{{node tower_0/v/cg/mpool0/MaxPool}}]]
> I0424 13:27:23.262847 140191267731200 coordinator.py:224] Error reported to 
> Coordinator: <class 
> 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Default 
> MaxPoolingOp only supports NHWC on device type CPU
>        [[node tower_0/v/cg/mpool0/MaxPool (defined at 
> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261) ]
> running this on nvidia-docker2 works fine. 
> image used: tensorflow/tensorflow:latest-gpu
> command:  python 
> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py 
> --num_gpus=1 --batch_size=32 --model=resnet50 
> --variable_update=parameter_server
> on the host nvidia-smi says: NVIDIA-SMI 418.56       Driver Version: 418.56   
>     CUDA Version: 10.1
> thx
> Jorge 
>> On 26 Apr 2019, at 18:28, Benno Evers <bev...@mesosphere.com 
>> <mailto:bev...@mesosphere.com>> wrote:
>> 
>> Hi all,
>> 
>> Please vote on releasing the following candidate as Apache Mesos 1.8.0.
>> 
>> 
>> 1.8.0 includes the following:
>> --------------------------------------------------------------------------------
>> * Greatly reduced allocator cycle time.
>> * Operation feedback for v1 schedulers.
>> * Per-framework minimum allocatable resources.
>> * New CLI subcommands `task attach` and `task exec`.
>> * New `linux/seccomp` isolator.
>> * Support for Docker v2 Schema2 manifest format.
>> * XFS quota for persistent volumes.
>> * **Experimental** Support for the new CSI v1 API.
>> 
>> In addition, 1.8.0-rc2 includes the following changes:
>> ---------------------------------------------------------------------------------
>> * Docker manifest v2s2 config with image GC.
>> * Expanded `highlights` section in the CHANGELOG.
>> 
>> In addition, 1.8.0-rc3 includes the following changes:
>> ---------------------------------------------------------------------------------
>> * Relaxed protobuf union validation strictness. (MESOS-9740)
>> * Fixed a bug causing non-uniform random results in the random sorter.
>> (MESOS-9733)
>> 
>> 
>> The CHANGELOG for the release is available at:
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.8.0-rc3
>>  
>> <https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.8.0-rc3>
>> --------------------------------------------------------------------------------
>> 
>> The candidate for Mesos 1.8.0 release is available at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz 
>> <https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz>
>> 
>> The tag to be voted on is 1.8.0-rc3:
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.8.0-rc3 
>> <https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.8.0-rc3>
>> 
>> The SHA512 checksum of the tarball can be found at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.sha512
>>  
>> <https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.sha512>
>> 
>> The signature of the tarball can be found at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.asc
>>  
>> <https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.asc>
>> 
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS 
>> <https://dist.apache.org/repos/dist/release/mesos/KEYS>
>> 
>> The JAR is in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1253 
>> <https://repository.apache.org/content/repositories/orgapachemesos-1253>
>> 
>> Please vote on releasing this package as Apache Mesos 1.8.0!
>> 
>> The vote is open until  and passes if a majority of at least 3 +1 PMC votes
>> are cast.
>> 
>> [ ] +1 Release this package as Apache Mesos 1.8.0
>> [ ] -1 Do not release this package because ...
>> 
>> Thanks,
>> Benno and Joseph
> 
> 
> 
> -- 
> Benno Evers
> Software Engineer, Mesosphere

Re: [VOTE] Release Apache Mesos 1.8.0 (rc3)

Reply via email to