OliverColeman opened a new issue #14484: Odd behaviour with 'device' kvstore and CUDA illegal memory access errors URL: https://github.com/apache/incubator-mxnet/issues/14484 ## Description Training the FCN model from gluon-cv over 2 GPUs I encounter different but perhaps related issues depending on which kind of kvstore I use ('local' and 'device'). (I don't think this is a gluon-cv issue.) Test script included. ## Environment info (Required) ``` ----------Python Info---------- Version : 3.5.6 Compiler : GCC 7.3.0 Build : ('default', 'Aug 26 2018 21:41:56') Arch : ('64bit', '') ------------Pip Info----------- Version : 8.1.2 Directory : /opt/conda/lib/python3.5/site-packages/pip ----------MXNet Info----------- Version : 1.3.1 Directory : /opt/conda/lib/python3.5/site-packages/mxnet Commit Hash : 19c501680183237d52a862e6ae1dc4ddc296305b ----------System Info---------- Platform : Linux-4.15.0-46-generic-x86_64-with-debian-stretch-sid system : Linux node : axl1 release : 4.15.0-46-generic version : #49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 17 Model name: AMD Ryzen 3 2200G with Radeon Vega Graphics Stepping: 0 CPU MHz: 1458.994 CPU max MHz: 3500.0000 CPU min MHz: 1600.0000 BogoMIPS: 6986.85 Virtualization: AMD-V Hypervisor vendor: vertical Virtualization type: full L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 4096K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx hw_pstate sme ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca ----------Network Test---------- Setting timeout: 10 Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1691 sec, LOAD: 0.6659 sec. Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0070 sec, LOAD: 1.3928 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0083 sec, LOAD: 0.8829 sec. Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.8895 sec, LOAD: 0.7720 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0078 sec, LOAD: 0.9719 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0093 sec, LOAD: 0.0760 sec. ``` Package used (Python/R/Scala/Julia): Python ## Error Message: ### If kvstore is 'local': ``` epoch 0 [01:09:18] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) -------- autograd.backward(loss) ---------- trainer.step(batch_size) Traceback (most recent call last): File "train.py", line 131, in <module> predTop = predTop.reshape((-1,)).astype('uint8').asnumpy() File "/opt/conda/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py", line 1972, in asnumpy ctypes.c_size_t(data.size))) File "/opt/conda/lib/python3.5/site-packages/mxnet/base.py", line 251, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [01:09:26] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: an illegal memory access was encountered Stack trace returned 10 entries: [bt] (0) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x381822) [0x7fbe7f130822] [bt] (1) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x381e08) [0x7fbe7f130e08] [bt] (2) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f3e198) [0x7fbe81ced198] [bt] (3) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2faf1ea) [0x7fbe81d5e1ea] [bt] (4) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f15123) [0x7fbe81cc4123] [bt] (5) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f1d334) [0x7fbe81ccc334] [bt] (6) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f213db) [0x7fbe81cd03db] [bt] (7) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f215fe) [0x7fbe81cd05fe] [bt] (8) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f1d9fb) [0x7fbe81ccc9fb] [bt] (9) /opt/conda/bin/../lib/libstdc++.so.6(+0xb8678) [0x7fbe6a362678] ``` ### If kvstore is 'device': There is no error, the process hangs when trying to push to the kvstore in `gluon.Trainer._allreduce_grads()`. The example script below includes some debug code to narrow down where the process hangs. ``` epoch 0 [01:21:38] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) -------- autograd.backward(loss) ---------- trainer.step(batch_size) kvs 2 <mxnet.kvstore.KVStore object at 0x7f1e6f85f940> a a2 b 0 fcn0_resnetv1s_conv0_weight c c2 d g h b 1 fcn0_resnetv1s_syncbatchnorm0_gamma c c2 d g h b 2 fcn0_resnetv1s_syncbatchnorm0_beta c c2 d g h b 3 fcn0_resnetv1s_syncbatchnorm0_running_mean h b 4 fcn0_resnetv1s_syncbatchnorm0_running_var h b 5 fcn0_resnetv1s_conv1_weight c c2 d g h b 6 fcn0_resnetv1s_syncbatchnorm1_gamma c c2 d g h b 7 fcn0_resnetv1s_syncbatchnorm1_beta c c2 d g h b 8 fcn0_resnetv1s_syncbatchnorm1_running_mean h b 9 fcn0_resnetv1s_syncbatchnorm1_running_var h b 10 fcn0_resnetv1s_conv2_weight c c2 d g h b 11 fcn0_resnetv1s_syncbatchnorm2_gamma c c2 d g h b 12 fcn0_resnetv1s_syncbatchnorm2_beta c c2 d g h b 13 fcn0_resnetv1s_syncbatchnorm2_running_mean h b 14 fcn0_resnetv1s_syncbatchnorm2_running_var h b 15 fcn0_resnetv1s_layers1_conv0_weight c c2 d g h b 16 fcn0_resnetv1s_layers1_syncbatchnorm0_gamma c c2 d g h b 17 fcn0_resnetv1s_layers1_syncbatchnorm0_beta c c2 d g h b 18 fcn0_resnetv1s_layers1_syncbatchnorm0_running_mean h b 19 fcn0_resnetv1s_layers1_syncbatchnorm0_running_var h b 20 fcn0_resnetv1s_layers1_conv1_weight c c2 d g h b 21 fcn0_resnetv1s_layers1_syncbatchnorm1_gamma c c2 d g h b 22 fcn0_resnetv1s_layers1_syncbatchnorm1_beta c c2 d g h b 23 fcn0_resnetv1s_layers1_syncbatchnorm1_running_mean h b 24 fcn0_resnetv1s_layers1_syncbatchnorm1_running_var h b 25 fcn0_resnetv1s_layers1_conv2_weight c c2 d g h b 26 fcn0_resnetv1s_layers1_syncbatchnorm2_gamma c c2 d g h b 27 fcn0_resnetv1s_layers1_syncbatchnorm2_beta c c2 d g h b 28 fcn0_resnetv1s_layers1_syncbatchnorm2_running_mean h b 29 fcn0_resnetv1s_layers1_syncbatchnorm2_running_var h b 30 fcn0_resnetv1s_down1_conv0_weight c c2 d g h b 31 fcn0_resnetv1s_down1_syncbatchnorm0_gamma c c2 [...hangs here. The python process then refuses to exit regardless of which kill signal I send to it. The docker container also refuses to stop. I have to restart the machine at this point.] ``` Note: the specific layer it stops on varies. ## Minimum reproducible example ``` import sys, math import numpy as np import mxnet as mx from mxnet import gluon, autograd, metric import gluoncv from gluoncv.utils.parallel import DataParallelModel, DataParallelCriterion from gluoncv.model_zoo import get_model from gluoncv.loss import * from gluoncv.model_zoo.segbase import * from mxnet.gluon.data import dataset from gluoncv.utils import LRScheduler class DummyDataSet(dataset.Dataset): def __init__(self, crop_size): self.data = [] for i in range(5): d = mx.ndarray.ones((3, crop_size, crop_size)) l = mx.ndarray.ones((crop_size, crop_size)) r = (d, l) self.data.append(r) @property def num_class(self): return 5 def __len__(self): return len(self.data) def __getitem__(self, index): return self.data[index] class Trainer(gluon.Trainer): def step(self, batch_size, ignore_stale_grad=False): if not self._kv_initialized: print("kvs %d %s" % (len(self._contexts), str(self._kvstore_params['kvstore']))) self._init_kvstore() if self._params_to_init: self._init_params() self._optimizer.rescale_grad = self._scale / batch_size self._allreduce_grads() self._update(ignore_stale_grad) def _allreduce_grads(self): print("a") if self._kvstore: print("a2") for i, param in enumerate(self._params): print("b %d %s" % (i, param.name)) if param.grad_req != 'null': print("c") plg = param.list_grad() print("c2") self._kvstore.push(i, plg, priority=-i) print("d") if not self._update_on_kvstore: print("e") self._kvstore.pull(i, param.list_grad(), priority=-i, ignore_sparse=self._distributed) print("f") print("g") print("h") print("i") print("j") if __name__ == "__main__": input_size = 480 dataset_train = DummyDataSet(input_size) data_loader = gluon.data.DataLoader(dataset_train, 2, shuffle=True, last_batch='rollover', num_workers=4) net = get_segmentation_model(model='fcn', dataset='pascal_aug', backbone='resnet50', norm_layer=mx.gluon.contrib.nn.basic_layers.SyncBatchNorm, norm_kwargs={'num_devices': 2}, aux=True, crop_size=input_size) net.cast('float32') exec_contexts = [ mx.gpu(0), mx.gpu(1) ] net = DataParallelModel(net, exec_contexts) criterion = MixSoftmaxCrossEntropyLoss(True, aux_weight=0.5) criterion = DataParallelCriterion(criterion, exec_contexts, True) lr_scheduler = LRScheduler(mode='poly', baselr=0.001, niters=len(dataset_train), nepochs=30) optimizer_params = {'lr_scheduler': lr_scheduler, 'wd':0.0001, 'momentum': 0.9} kv = mx.kv.create('device') trainer = Trainer(net.module.collect_params(), 'sgd', optimizer_params, kvstore = kv) batch_size = 4 for epoch in range(0, 30): print ("epoch", epoch) for i, (data, label) in enumerate(data_loader): lr_scheduler.update(i, epoch) with autograd.record(True): pred = net(data) #pred = upsize_parallel_output(pred) loss = criterion(pred, label) mx.nd.waitall() print ("-------- autograd.backward(loss)") autograd.backward(loss) print ("---------- trainer.step(batch_size)") trainer.step(batch_size) # DataParallelModel output is a tuple of tuples of NDArrays. pred = [ p[0] for p in pred ] pred = mx.ndarray.concat(*pred) predTop = mx.nd.argmax(pred, 1) predNP = predTop.reshape((-1,)).astype('uint8').asnumpy() ``` ## Steps to reproduce 1. Run the above script, setting the kvstore type to either `local` or `device`. ## What have you tried to solve it? 1. Disabling gc at beginning of epoch and re-enabling at end, seemed to work in one similar-seeming issue, but made no difference for me. Note: I still get the same result when not using a sub-classed version of gluon.Trainer.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services