[GitHub] [incubator-mxnet] OliverColeman opened a new issue #14484: Odd behaviour with 'device' kvstore and CUDA illegal memory access errors

GitBox Wed, 20 Mar 2019 19:13:31 -0700

OliverColeman opened a new issue #14484: Odd behaviour with 'device' kvstore 
and CUDA illegal memory access errors
URL: https://github.com/apache/incubator-mxnet/issues/14484
 
 
   ## Description
   Training the FCN model from gluon-cv over 2 GPUs I encounter different but 
perhaps related issues depending on which kind of kvstore I use ('local' and 
'device'). (I don't think this is a gluon-cv issue.) Test script included.
   
   ## Environment info (Required)
   ```
   ----------Python Info----------
   Version      : 3.5.6
   Compiler     : GCC 7.3.0
   Build        : ('default', 'Aug 26 2018 21:41:56')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 8.1.2
   Directory    : /opt/conda/lib/python3.5/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.3.1
   Directory    : /opt/conda/lib/python3.5/site-packages/mxnet
   Commit Hash   : 19c501680183237d52a862e6ae1dc4ddc296305b
   ----------System Info----------
   Platform     : Linux-4.15.0-46-generic-x86_64-with-debian-stretch-sid
   system       : Linux
   node         : axl1
   release      : 4.15.0-46-generic
   version      : #49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                4
   On-line CPU(s) list:   0-3
   Thread(s) per core:    1
   Core(s) per socket:    4
   Socket(s):             1
   NUMA node(s):          1
   Vendor ID:             AuthenticAMD
   CPU family:            23
   Model:                 17
   Model name:            AMD Ryzen 3 2200G with Radeon Vega Graphics
   Stepping:              0
   CPU MHz:               1458.994
   CPU max MHz:           3500.0000
   CPU min MHz:           1600.0000
   BogoMIPS:              6986.85
   Virtualization:        AMD-V
   Hypervisor vendor:     vertical
   Virtualization type:   full
   L1d cache:             32K
   L1i cache:             64K
   L2 cache:              512K
   L3 cache:              4096K
   NUMA node0 CPU(s):     0-3
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid 
aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes 
xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a 
misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb 
bpext perfctr_llc mwaitx hw_pstate sme ssbd ibpb vmmcall fsgsbase bmi1 avx2 
smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves 
clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean 
flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif 
overflow_recov succor smca
   ----------Network Test----------
   Setting timeout: 10
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1691 sec, LOAD: 
0.6659 sec.
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0070 
sec, LOAD: 1.3928 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0083 sec, LOAD: 
0.8829 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.8895 sec, LOAD: 
0.7720 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0078 sec, LOAD: 0.9719 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0093 sec, 
LOAD: 0.0760 sec.
   ```
   
   Package used (Python/R/Scala/Julia):
   Python
   
   ## Error Message:
   ### If kvstore is 'local':
   ```
   epoch 0
   [01:09:18] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running 
performance tests to find the best convolution algorithm, this can take a 
while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   -------- autograd.backward(loss)
   ---------- trainer.step(batch_size)
   Traceback (most recent call last):
     File "train.py", line 131, in <module>
       predTop = predTop.reshape((-1,)).astype('uint8').asnumpy()
     File "/opt/conda/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py", 
line 1972, in asnumpy
       ctypes.c_size_t(data.size)))
     File "/opt/conda/lib/python3.5/site-packages/mxnet/base.py", line 251, in 
check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [01:09:26] 
/home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62:
 Check failed: e == cudaSuccess CUDA: an illegal memory access was encountered
   
   Stack trace returned 10 entries:
   [bt] (0) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x381822) 
[0x7fbe7f130822]
   [bt] (1) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x381e08) 
[0x7fbe7f130e08]
   [bt] (2) 
/opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f3e198) 
[0x7fbe81ced198]
   [bt] (3) 
/opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2faf1ea) 
[0x7fbe81d5e1ea]
   [bt] (4) 
/opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f15123) 
[0x7fbe81cc4123]
   [bt] (5) 
/opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f1d334) 
[0x7fbe81ccc334]
   [bt] (6) 
/opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f213db) 
[0x7fbe81cd03db]
   [bt] (7) 
/opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f215fe) 
[0x7fbe81cd05fe]
   [bt] (8) 
/opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f1d9fb) 
[0x7fbe81ccc9fb]
   [bt] (9) /opt/conda/bin/../lib/libstdc++.so.6(+0xb8678) [0x7fbe6a362678]
   ```
   ### If kvstore is 'device':
   There is no error, the process hangs when trying to push to the kvstore in 
`gluon.Trainer._allreduce_grads()`. The example script below includes some 
debug code to narrow down where the process hangs.
   ```
   epoch 0
   [01:21:38] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running 
performance tests to find the best convolution algorithm, this can take a 
while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   -------- autograd.backward(loss)
   ---------- trainer.step(batch_size)
   kvs 2 <mxnet.kvstore.KVStore object at 0x7f1e6f85f940>
   a
   a2
   b 0 fcn0_resnetv1s_conv0_weight
   c
   c2
   d
   g
   h
   b 1 fcn0_resnetv1s_syncbatchnorm0_gamma
   c
   c2
   d
   g
   h
   b 2 fcn0_resnetv1s_syncbatchnorm0_beta
   c
   c2
   d
   g
   h
   b 3 fcn0_resnetv1s_syncbatchnorm0_running_mean
   h
   b 4 fcn0_resnetv1s_syncbatchnorm0_running_var
   h
   b 5 fcn0_resnetv1s_conv1_weight
   c
   c2
   d
   g
   h
   b 6 fcn0_resnetv1s_syncbatchnorm1_gamma
   c
   c2
   d
   g
   h
   b 7 fcn0_resnetv1s_syncbatchnorm1_beta
   c
   c2
   d
   g
   h
   b 8 fcn0_resnetv1s_syncbatchnorm1_running_mean
   h
   b 9 fcn0_resnetv1s_syncbatchnorm1_running_var
   h
   b 10 fcn0_resnetv1s_conv2_weight
   c
   c2
   d
   g
   h
   b 11 fcn0_resnetv1s_syncbatchnorm2_gamma
   c
   c2
   d
   g
   h
   b 12 fcn0_resnetv1s_syncbatchnorm2_beta
   c
   c2
   d
   g
   h
   b 13 fcn0_resnetv1s_syncbatchnorm2_running_mean
   h
   b 14 fcn0_resnetv1s_syncbatchnorm2_running_var
   h
   b 15 fcn0_resnetv1s_layers1_conv0_weight
   c
   c2
   d
   g
   h
   b 16 fcn0_resnetv1s_layers1_syncbatchnorm0_gamma
   c
   c2
   d
   g
   h
   b 17 fcn0_resnetv1s_layers1_syncbatchnorm0_beta
   c
   c2
   d
   g
   h
   b 18 fcn0_resnetv1s_layers1_syncbatchnorm0_running_mean
   h
   b 19 fcn0_resnetv1s_layers1_syncbatchnorm0_running_var
   h
   b 20 fcn0_resnetv1s_layers1_conv1_weight
   c
   c2
   d
   g
   h
   b 21 fcn0_resnetv1s_layers1_syncbatchnorm1_gamma
   c
   c2
   d
   g
   h
   b 22 fcn0_resnetv1s_layers1_syncbatchnorm1_beta
   c
   c2
   d
   g
   h
   b 23 fcn0_resnetv1s_layers1_syncbatchnorm1_running_mean
   h
   b 24 fcn0_resnetv1s_layers1_syncbatchnorm1_running_var
   h
   b 25 fcn0_resnetv1s_layers1_conv2_weight
   c
   c2
   d
   g
   h
   b 26 fcn0_resnetv1s_layers1_syncbatchnorm2_gamma
   c
   c2
   d
   g
   h
   b 27 fcn0_resnetv1s_layers1_syncbatchnorm2_beta
   c
   c2
   d
   g
   h
   b 28 fcn0_resnetv1s_layers1_syncbatchnorm2_running_mean
   h
   b 29 fcn0_resnetv1s_layers1_syncbatchnorm2_running_var
   h
   b 30 fcn0_resnetv1s_down1_conv0_weight
   c
   c2
   d
   g
   h
   b 31 fcn0_resnetv1s_down1_syncbatchnorm0_gamma
   c
   c2
   [...hangs here. The python process then refuses to exit regardless of which 
kill signal I send to it. The docker container also refuses to stop. I have to 
restart the machine at this point.]
   ```
   Note: the specific layer it stops on varies.
   
   ## Minimum reproducible example
   ```
   import sys, math
   import numpy as np
   import mxnet as mx
   from mxnet import gluon, autograd, metric
   import gluoncv
   from gluoncv.utils.parallel import DataParallelModel, DataParallelCriterion
   
   from gluoncv.model_zoo import get_model
   from gluoncv.loss import *
   from gluoncv.model_zoo.segbase import *
   from mxnet.gluon.data import dataset
   from gluoncv.utils import LRScheduler
     
   
   class DummyDataSet(dataset.Dataset):
       def __init__(self, crop_size):
         self.data = []
         for i in range(5):
             d = mx.ndarray.ones((3, crop_size, crop_size))
             l = mx.ndarray.ones((crop_size, crop_size))
             r = (d, l)
             self.data.append(r)
           
       @property
       def num_class(self):
         return 5
       
       def __len__(self):
           return len(self.data)
       
       def __getitem__(self, index):
         return self.data[index]
   
     
     
   class Trainer(gluon.Trainer):
       def step(self, batch_size, ignore_stale_grad=False):
           if not self._kv_initialized:
               print("kvs %d %s" % (len(self._contexts), 
str(self._kvstore_params['kvstore'])))
               self._init_kvstore()
           if self._params_to_init:
               self._init_params()
           self._optimizer.rescale_grad = self._scale / batch_size
           self._allreduce_grads()
           self._update(ignore_stale_grad)
           
        
       def _allreduce_grads(self):
           print("a")
           if self._kvstore:
               print("a2")
               for i, param in enumerate(self._params):
                   print("b %d %s" % (i, param.name))
                   if param.grad_req != 'null':
                       print("c")
                       
                       plg = param.list_grad()
                       
                       print("c2")
                       
                       self._kvstore.push(i, plg, priority=-i)
                       
                       print("d")
                       if not self._update_on_kvstore:
                           print("e")
                           self._kvstore.pull(i, param.list_grad(), 
priority=-i, ignore_sparse=self._distributed)
                           print("f")
                       print("g")
                   print("h")
               print("i")
           print("j")
   
   
   
   if __name__ == "__main__":
       input_size = 480
       
       dataset_train = DummyDataSet(input_size)
       data_loader = gluon.data.DataLoader(dataset_train, 2, shuffle=True, 
last_batch='rollover', num_workers=4)
       
       net = get_segmentation_model(model='fcn', dataset='pascal_aug',
                                   backbone='resnet50', 
norm_layer=mx.gluon.contrib.nn.basic_layers.SyncBatchNorm,
                                   norm_kwargs={'num_devices': 2}, aux=True,
                                   crop_size=input_size)
       net.cast('float32')
       
       exec_contexts = [ mx.gpu(0), mx.gpu(1) ]
       
       net = DataParallelModel(net, exec_contexts)
   
       criterion = MixSoftmaxCrossEntropyLoss(True, aux_weight=0.5)
       criterion = DataParallelCriterion(criterion, exec_contexts, True)
   
       lr_scheduler = LRScheduler(mode='poly', baselr=0.001,
                               niters=len(dataset_train), 
                               nepochs=30)
       optimizer_params = {'lr_scheduler': lr_scheduler,
                           'wd':0.0001,
                           'momentum': 0.9}
       
       kv = mx.kv.create('device')
       
       trainer = Trainer(net.module.collect_params(), 'sgd', optimizer_params, 
kvstore = kv)
       
       batch_size = 4
       
       for epoch in range(0, 30):
           print ("epoch", epoch)
           for i, (data, label) in enumerate(data_loader):
               lr_scheduler.update(i, epoch)
               
               with autograd.record(True):
                   pred = net(data)
                   #pred = upsize_parallel_output(pred)
                   loss = criterion(pred, label)
                   mx.nd.waitall()
                   print ("-------- autograd.backward(loss)")
                   autograd.backward(loss)
               print ("---------- trainer.step(batch_size)")
               trainer.step(batch_size)
               
               # DataParallelModel output is a tuple of tuples of NDArrays.
               pred = [ p[0] for p in pred ]
               pred = mx.ndarray.concat(*pred)
               predTop = mx.nd.argmax(pred, 1)
               
               predNP = predTop.reshape((-1,)).astype('uint8').asnumpy()
   
   ```
   ## Steps to reproduce
   1. Run the above script, setting the kvstore type to either `local` or 
`device`.
   
   ## What have you tried to solve it?
   1. Disabling gc at beginning of epoch and re-enabling at end, seemed to work 
in one similar-seeming issue, but made no difference for me.
   
   Note: I still get the same result when not using a sub-classed version of 
gluon.Trainer.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] OliverColeman opened a new issue #14484: Odd behaviour with 'device' kvstore and CUDA illegal memory access errors

Reply via email to