[GitHub] DickJC123 opened a new issue #10977: test_kvstore_gpu.py:test_rsp_push_pull fails when run on a system with only one GPU

GitBox Wed, 16 May 2018 12:40:31 -0700

DickJC123 opened a new issue #10977: test_kvstore_gpu.py:test_rsp_push_pull 
fails when run on a system with only one GPU
URL: https://github.com/apache/incubator-mxnet/issues/10977
 
 
   ## Description
   The test test_rsp_push_pull fails when it attempts to create and use context 
mx.gpu(1) even if only one GPU is present.  The test should ideally perform a 
reduced level of testing in the absence of 2 GPUs, rather than just fail.
   
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   ('Version      :', '2.7.12')
   ('Compiler     :', 'GCC 5.4.0 20160609')
   ('Build        :', ('default', 'Dec  4 2017 14:50:18'))
   ('Arch         :', ('64bit', 'ELF'))
   ------------Pip Info-----------
   ('Version      :', '10.0.1')
   ('Directory    :', '/home/dcarter/.local/lib/python2.7/site-packages/pip')
   ----------MXNet Info-----------
   /home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/optimizer.py:136: 
UserWarning: WARNING: New optimizer mxnet.optimizer.NAG is overriding existing 
optimizer mxnet.optimizer.NAG
     Optimizer.opt_registry[name].__name__))
   ('Version      :', '1.1.0')
   ('Directory    :', '/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet')
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   ('Platform     :', 'Linux-4.4.0-121-generic-x86_64-with-Ubuntu-16.04-xenial')
   ('system       :', 'Linux')
   ('node         :', 'DCARTER-DT')
   ('release      :', '4.4.0-121-generic')
   ('version      :', '#145-Ubuntu SMP Fri Apr 13 13:47:23 UTC 2018')
   ----------Hardware Info----------
   ('machine      :', 'x86_64')
   ('processor    :', 'x86_64')
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                12
   On-line CPU(s) list:   0-11
   Thread(s) per core:    2
   Core(s) per socket:    6
   Socket(s):             1
   NUMA node(s):          1
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 63
   Model name:            Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
   Stepping:              2
   CPU MHz:               3693.867
   CPU max MHz:           3700.0000
   CPU min MHz:           1200.0000
   BogoMIPS:              6996.26
   Virtualization:        VT-x
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              256K
   L3 cache:              15360K
   NUMA node0 CPU(s):     0-11
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 
ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single 
retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 
avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat 
pln pts
   
   ```
   
   ## Error Message:
   Visible in the repro example below:
   
   ## Minimum reproducible example
   This error can be reproduced on a multi-gpu system by restricting the 
visibility of the GPUs:
   ```
   $ CUDA_VISIBLE_DEVICES=0 nosetests --verbose -s 
tests/python/gpu/test_kvstore_gpu.py:test_rsp_push_pull
   /home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/optimizer.py:136: 
UserWarning: WARNING: New optimizer mxnet.optimizer.NAG is overriding existing 
optimizer mxnet.optimizer.NAG
     Optimizer.opt_registry[name].__name__))
   [INFO] Setting module np/mx/python random seeds, use 
MXNET_MODULE_SEED=1277946489 to reproduce.
   test_kvstore_gpu.test_rsp_push_pull ... terminate called after throwing an 
instance of 'dmlc::Error'
     what():  [12:09:48] 
/home/dcarter/mxnet_dev/dgx/mxnet/mshadow/mshadow/./tensor_gpu-inl.h:35: Check 
failed: e == cudaSuccess CUDA: invalid device ordinal
   
   Stack trace returned 9 entries:
   [bt] (0) 
/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a)
 [0x7f37871a164a]
   [bt] (1) 
/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28)
 [0x7f37871a21e8]
   [bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(void 
mshadow::SetDevice<mshadow::gpu>(int)+0xd0) [0x7f3789d2e140]
   [bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(void 
mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context,
 bool, 
mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*,
 std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent> const&)+0x87) 
[0x7f3789d38117]
   [bt] (4) 
/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Function_handler<void 
(std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>), 
mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, 
bool)::{lambda()#3}::operator()() 
const::{lambda(std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)#1}>::_M_invoke(std::_Any_data
 const&, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>&&)+0x4e) 
[0x7f3789d383ce]
   [bt] (5) 
/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void
 (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> 
(std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> >::_M_run()+0x4a) 
[0x7f3789d31aaa]
   [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f37a7722c80]
   [bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f37af9ab6ba]
   [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f37af6e141d]
   
   terminate called recursively
   Aborted (core dumped)
   ```
   A slightly different error signature has been observed at other times (for 
reasons that were not pursued):
   ```
   $ CUDA_VISIBLE_DEVICES=0 nosetests --verbose -s 
tests/python/gpu/test_kvstore_gpu.py:test_rsp_push_pull
   /home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/optimizer.py:136: 
UserWarning: WARNING: New optimizer mxnet.optimizer.NAG is overriding existing 
optimizer mxnet.optimizer.NAG
     Optimizer.opt_registry[name].__name__))
   [INFO] Setting module np/mx/python random seeds, use 
MXNET_MODULE_SEED=1963724868 to reproduce.
   test_kvstore_gpu.test_rsp_push_pull ... [INFO] Setting test np/mx/python 
random seeds, use MXNET_TEST_SEED=161898769 to reproduce.
   ERROR
   terminate called after throwing an instance of 'dmlc::Error'
     what():  [11:54:43] src/engine/threaded_engine.cc:320: Check failed: 
exec_ctx.dev_id < device_count_ (1 vs. 1) Invalid GPU Id: 1, Valid device id 
should be less than device_count: 1
   
   Stack trace returned 10 entries:
   [bt] (0) 
/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b)
 [0x7f8b068dc95b]
   [bt] (1) 
/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28)
 [0x7f8b068dd4c8]
   [bt] (2) 
/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void
 (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, 
std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, 
std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, 
mxnet::FnProperty, int, char const*, bool)+0x33e) [0x7f8b0921f1ce]
   [bt] (3) 
/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::DeleteVariable(std::function<void
 (mxnet::RunContext)>, mxnet::Context, mxnet::engine::Var*)+0x15f) 
[0x7f8b0921d21f]
   [bt] (4) 
/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x36c)
 [0x7f8b08e3333c]
   [bt] (5) 
/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::NDArray::~NDArray()+0xca)
 [0x7f8b06b0c50a]
   [bt] (6) 
/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(MXNDArrayFree+0x1d) 
[0x7f8b0929586d]
   [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) 
[0x7f8b2de3be40]
   [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) 
[0x7f8b2de3b8ab]
   [bt] (9) 
/usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f)
 [0x7f8b2e04b3df]
   
   Aborted (core dumped)
   ```
   ## What have you tried to solve it?
   
   I would prefer the test creator offer the best subset of testing for when 
one GPU is present.  Clearly, being able to know the number of GPUs on the 
system will be a part of this.  The list_gpus() utility in test_utils.py uses 
nvidia-smi, but this does not reflect the reduction of visible GPUs that may be 
in effect when CUDA_VISIBLE_DEVICES is set.  Anyone submitting a fix to this 
issue is free to incorporate the following improved-functionality code snippet:
   ```
   # The list_gpus() function is based on nvidia-smi, which ignores 
CUDA_VISIBLE_DEVICES and
   # so can return an optimistic count of available GPUs.
   def available_gpu_count():
       visible_devices_str = os.environ.get('CUDA_VISIBLE_DEVICES', None)
       if visible_devices_str is not None:
           try:
               gpu_list = [int(s) for s in visible_devices_str.split(',')]
               return len(gpu_list)
           except ValueError:
               print('Unparsable environment variable CUDA_VISIBLE_DEVICES')
       return len(mx.test_utils.list_gpus())
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] DickJC123 opened a new issue #10977: test_kvstore_gpu.py:test_rsp_push_pull fails when run on a system with only one GPU

Reply via email to