DickJC123 opened a new issue #10977: test_kvstore_gpu.py:test_rsp_push_pull fails when run on a system with only one GPU URL: https://github.com/apache/incubator-mxnet/issues/10977 ## Description The test test_rsp_push_pull fails when it attempts to create and use context mx.gpu(1) even if only one GPU is present. The test should ideally perform a reduced level of testing in the absence of 2 GPUs, rather than just fail. ## Environment info (Required) ``` ----------Python Info---------- ('Version :', '2.7.12') ('Compiler :', 'GCC 5.4.0 20160609') ('Build :', ('default', 'Dec 4 2017 14:50:18')) ('Arch :', ('64bit', 'ELF')) ------------Pip Info----------- ('Version :', '10.0.1') ('Directory :', '/home/dcarter/.local/lib/python2.7/site-packages/pip') ----------MXNet Info----------- /home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/optimizer.py:136: UserWarning: WARNING: New optimizer mxnet.optimizer.NAG is overriding existing optimizer mxnet.optimizer.NAG Optimizer.opt_registry[name].__name__)) ('Version :', '1.1.0') ('Directory :', '/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet') Hashtag not found. Not installed from pre-built package. ----------System Info---------- ('Platform :', 'Linux-4.4.0-121-generic-x86_64-with-Ubuntu-16.04-xenial') ('system :', 'Linux') ('node :', 'DCARTER-DT') ('release :', '4.4.0-121-generic') ('version :', '#145-Ubuntu SMP Fri Apr 13 13:47:23 UTC 2018') ----------Hardware Info---------- ('machine :', 'x86_64') ('processor :', 'x86_64') Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz Stepping: 2 CPU MHz: 3693.867 CPU max MHz: 3700.0000 CPU min MHz: 1200.0000 BogoMIPS: 6996.26 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts ``` ## Error Message: Visible in the repro example below: ## Minimum reproducible example This error can be reproduced on a multi-gpu system by restricting the visibility of the GPUs: ``` $ CUDA_VISIBLE_DEVICES=0 nosetests --verbose -s tests/python/gpu/test_kvstore_gpu.py:test_rsp_push_pull /home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/optimizer.py:136: UserWarning: WARNING: New optimizer mxnet.optimizer.NAG is overriding existing optimizer mxnet.optimizer.NAG Optimizer.opt_registry[name].__name__)) [INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1277946489 to reproduce. test_kvstore_gpu.test_rsp_push_pull ... terminate called after throwing an instance of 'dmlc::Error' what(): [12:09:48] /home/dcarter/mxnet_dev/dgx/mxnet/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: invalid device ordinal Stack trace returned 9 entries: [bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f37871a164a] [bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f37871a21e8] [bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(void mshadow::SetDevice<mshadow::gpu>(int)+0xd0) [0x7f3789d2e140] [bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent> const&)+0x87) [0x7f3789d38117] [bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>&&)+0x4e) [0x7f3789d383ce] [bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> >::_M_run()+0x4a) [0x7f3789d31aaa] [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f37a7722c80] [bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f37af9ab6ba] [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f37af6e141d] terminate called recursively Aborted (core dumped) ``` A slightly different error signature has been observed at other times (for reasons that were not pursued): ``` $ CUDA_VISIBLE_DEVICES=0 nosetests --verbose -s tests/python/gpu/test_kvstore_gpu.py:test_rsp_push_pull /home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/optimizer.py:136: UserWarning: WARNING: New optimizer mxnet.optimizer.NAG is overriding existing optimizer mxnet.optimizer.NAG Optimizer.opt_registry[name].__name__)) [INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1963724868 to reproduce. test_kvstore_gpu.test_rsp_push_pull ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=161898769 to reproduce. ERROR terminate called after throwing an instance of 'dmlc::Error' what(): [11:54:43] src/engine/threaded_engine.cc:320: Check failed: exec_ctx.dev_id < device_count_ (1 vs. 1) Invalid GPU Id: 1, Valid device id should be less than device_count: 1 Stack trace returned 10 entries: [bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f8b068dc95b] [bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f8b068dd4c8] [bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0x33e) [0x7f8b0921f1ce] [bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::DeleteVariable(std::function<void (mxnet::RunContext)>, mxnet::Context, mxnet::engine::Var*)+0x15f) [0x7f8b0921d21f] [bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x36c) [0x7f8b08e3333c] [bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::NDArray::~NDArray()+0xca) [0x7f8b06b0c50a] [bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(MXNDArrayFree+0x1d) [0x7f8b0929586d] [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f8b2de3be40] [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f8b2de3b8ab] [bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f8b2e04b3df] Aborted (core dumped) ``` ## What have you tried to solve it? I would prefer the test creator offer the best subset of testing for when one GPU is present. Clearly, being able to know the number of GPUs on the system will be a part of this. The list_gpus() utility in test_utils.py uses nvidia-smi, but this does not reflect the reduction of visible GPUs that may be in effect when CUDA_VISIBLE_DEVICES is set. Anyone submitting a fix to this issue is free to incorporate the following improved-functionality code snippet: ``` # The list_gpus() function is based on nvidia-smi, which ignores CUDA_VISIBLE_DEVICES and # so can return an optimistic count of available GPUs. def available_gpu_count(): visible_devices_str = os.environ.get('CUDA_VISIBLE_DEVICES', None) if visible_devices_str is not None: try: gpu_list = [int(s) for s in visible_devices_str.split(',')] return len(gpu_list) except ValueError: print('Unparsable environment variable CUDA_VISIBLE_DEVICES') return len(mx.test_utils.list_gpus()) ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services