mseth10 opened a new issue #19360: URL: https://github.com/apache/incubator-mxnet/issues/19360
## Description Nightly CD pipeline fails for CUDA 11.0 during testing of MXNet binaries using `pytest`. All tests run successfully. The error is thrown during cleanup after `pytest` is done running a testing module. This error was first recorded when https://github.com/apache/incubator-mxnet/commit/480d027b85d3feff6fecec70be55eb244ddff289 commit was merged, which dropped `pytest`'s `teardown` function. Before this commit, the CD pipeline was running successfully for all flavors. This error is specific to CUDA 11.0 and is not observed for CUDA 10.0 and 10.1 as can be seen here: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1848/pipeline/361/ ### Error Message ``` Stack trace: Stack trace: /usr/lib64/libcudnn_ops_infer.so.8 ( + 0x15cb68f) [0x7f7f4ce3e68f] /usr/lib64/libcudnn_ops_infer.so.8 ( cudnnDestroy + 0x6f ) [0x7f7f4ba78ddf] /work/mxnet/python/mxnet/../../lib/libmxnet.so ( mshadow::Stream<mshadow::gpu>::DestroyDnnHandle() + 0x2c ) [0x7f81869a29ec] /work/mxnet/python/mxnet/../../lib/libmxnet.so ( void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*) + 0x13b ) [0x7f81869a2c3b] /work/mxnet/python/mxnet/../../lib/libmxnet.so ( void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0x1bb ) [0x7f81869b83ab] /work/mxnet/python/mxnet/../../lib/libmxnet.so ( std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x36 ) [0x7f81869b86f6] /work/mxnet/python/mxnet/../../lib/libmxnet.so ( std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run() + 0x32 ) [0x7f81869b3db2] /work/runtime_functions.sh: line 747: 6 Segmentation fault (core dumped) pytest -m 'serial' -s --durations=50 --verbose tests/python/gpu/test_gluon_gpu.py 2020-10-16 07:44:31,682 - root - INFO - Waiting for status of container a8b282e29adf for 600 s. 2020-10-16 07:44:31,853 - root - INFO - Container exit status: {'Error': None, 'StatusCode': 139} 2020-10-16 07:44:31,854 - root - ERROR - Container exited with an error 😞 2020-10-16 07:44:31,854 - root - INFO - Executed command for reproduction: ci/build.py -e BRANCH=null --docker-registry mxnetci --nvidiadocker --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh cd_unittest_ubuntu cu110 ``` ### Steps to reproduce I was able to reproduce the error by following these steps on an AWS Ubuntu18 Deep Learning Base AMI: ``` alias python=python3 git clone --recursive https://github.com/apache/incubator-mxnet.git cd incubator-mxnet pip3 install -r ci/requirements.txt --user sudo curl -L "https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose python ci/build.py -e BRANCH=null --docker-registry mxnetci --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_static_libmxnet cu110 python ci/build.py -e BRANCH=null --docker-registry mxnetci --nvidiadocker --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh cd_unittest_ubuntu cu110 ``` ## What have you tried to solve it? 1. The above script takes a long time to run as it runs a lot of tests. I reduced the reproduction time by reducing the number of tests. Here's a code diff: ``` diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh index 40405b961..6992caa36 100755 --- a/ci/docker/runtime_functions.sh +++ b/ci/docker/runtime_functions.sh @@ -756,7 +756,9 @@ cd_unittest_ubuntu() { export DMLC_LOG_STACK_TRACE_DEPTH=10 local mxnet_variant=${1:?"This function requires a mxnet variant as the first argument"} + pytest -m 'serial' -s --durations=50 --verbose tests/python/gpu/test_gluon_gpu.py + : ' OMP_NUM_THREADS=$(expr $(nproc) / 4) pytest -m 'not serial' -n 4 --durations=50 --verbose tests/python/unittest pytest -m 'serial' --durations=50 --verbose tests/python/unittest @@ -782,6 +784,7 @@ cd_unittest_ubuntu() { if [[ ${mxnet_variant} = *mkl ]]; then OMP_NUM_THREADS=$(expr $(nproc) / 4) pytest -n 4 --durations=50 --verbose tests/python/mkl fi + ' } ``` 2. I put a print statement before the `waitall` [command](https://github.com/apache/incubator-mxnet/blob/d0ceecbb3e4f2154a7783cba8f6e152b8c9003b1/conftest.py#L68) to check whether it gets executed and observed that it gets executed after the module ends as expected. ## Environment ***We recommend using our script for collecting the diagnostic information with the following command*** `curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3` <details> <summary>Environment Information</summary> ``` ----------Python Info---------- Version : 3.6.9 Compiler : GCC 8.4.0 Build : ('default', 'Oct 8 2020 12:12:24') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 20.2.3 Directory : /usr/local/lib/python3.6/dist-packages/pip ----------MXNet Info----------- No MXNet installed. ----------System Info---------- Platform : Linux-5.4.0-1028-aws-x86_64-with-Ubuntu-18.04-bionic system : Linux node : ip-172-31-5-167 release : 5.4.0-1028-aws version : #29~18.04.1-Ubuntu SMP Tue Oct 6 17:14:23 UTC 2020 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Stepping: 7 CPU MHz: 3109.947 BogoMIPS: 5000.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0088 sec, LOAD: 0.6286 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1193 sec, LOAD: 0.1101 sec. Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.055264949798583984 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0012 sec, LOAD: 0.1100 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0014 sec, LOAD: 0.3008 sec. Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0010542869567871094 sec. ----------Environment---------- ``` </details> @leezu @TristonC ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
