yangshuo0323 commented on issue #19717:
URL: 
https://github.com/apache/incubator-mxnet/issues/19717#issuecomment-770141771


   I see you have trained your model based on MXNet version 1.7.0.  I want to 
train BERT on mutiple GPU, and I have another doubt want to consult you. Do you 
meet this trouble:
   ```
   [1,4]<stderr>:===================
   [1,5]<stderr>:[node106:26502:0:26502] Caught signal 11 (Segmentation fault: 
address not mapped to object at address 0x30)
   [1,5]<stderr>:==== backtrace ====
   [1,6]<stderr>:[node106:26503:0:26503] Caught signal 11 (Segmentation fault: 
address not mapped to object at address 0x30)
   [1,6]<stderr>:==== backtrace ====
   [1,5]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f40f065bcec]
   [1,5]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f40f065bf64]
   [1,5]<stderr>:    2  
/lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f42ead77d44]
   [1,5]<stderr>:    3  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44)
 [0x7f428d022564]
   [1,5]<stderr>:    4  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280)
 [0x7f428d025790]
   [1,5]<stderr>:    5  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131)
 [0x7f428d01ded1]
   [1,5]<stderr>:    6  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4)
 [0x7f428cff89d4]
   [1,5]<stderr>:    7  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f)
 [0x7f410243a18f]
   [1,5]<stderr>:    8  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54)
 [0x7f4102431d84]
   [1,5]<stderr>:    9  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd)
 [0x7f42e9da49dd]
   [1,5]<stderr>:   10  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067)
 [0x7f42e9da4067]
   [1,5]<stderr>:   11  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce)
 [0x7f42eafd527e]
   [1,5]<stderr>:   12  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4)
 [0x7f42eafd5cb4]
   [1,5]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) 
[0x564d0453c00b]
   [1,5]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x564d045a09a1]
   [1,5]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
   [1,5]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) 
[0x564d04534497]
   [1,5]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
   [1,5]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
   [1,5]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) 
[0x564d04534497]
   [1,5]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
   [1,5]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) 
[0x564d0453420b]
   [1,5]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x564d0459bbe6]
   [1,5]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
   [1,5]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x564d044e51d4]
   [1,5]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x564d044e51fc]
   [1,5]<stderr>:   26  python(+0x22bf44) [0x564d045faf44]
   [1,5]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x564d046052b1]
   [1,5]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x564d046054a3]
   [1,5]<stderr>:   29  python(+0x2375d5) [0x564d046065d5]
   [1,5]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x564d046066fc]
   [1,5]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) 
[0x7f42ea9c4840]
   [1,5]<stderr>:   32  python(+0x1dc3c0) [0x564d045ab3c0]
   [1,5]<stderr>:===================
   [1,6]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f1a6c25bcec]
   [1,6]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f1a6c25bf64]
   [1,6]<stderr>:    2  
/lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f1c66a2ad44]
   [1,6]<stderr>:    3  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44)
 [0x7f1c08cd5564]
   [1,6]<stderr>:    4  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280)
 [0x7f1c08cd8790]
   [1,6]<stderr>:    5  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131)
 [0x7f1c08cd0ed1]
   [1,6]<stderr>:    6  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4)
 [0x7f1c08cab9d4]
   [1,6]<stderr>:    7  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f)
 [0x7f1a7e0e118f]
   [1,6]<stderr>:    8  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54)
 [0x7f1a7e0d8d84]
   [1,6]<stderr>:    9  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd)
 [0x7f1c65a579dd]
   [1,6]<stderr>:   10  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067)
 [0x7f1c65a57067]
   [1,6]<stderr>:   11  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce)
 [0x7f1c66c8827e]
   [1,6]<stderr>:   12  
/home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4)
 [0x7f1c66c88cb4]
   [1,6]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) 
[0x562df52e800b]
   [1,6]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x562df534c9a1]
   [1,6]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
   [1,6]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) 
[0x562df52e0497]
   [1,6]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
   [1,6]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
   [1,6]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) 
[0x562df52e0497]
   [1,6]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
   [1,6]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) 
[0x562df52e020b]
   [1,6]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x562df5347be6]
   [1,6]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
   [1,6]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x562df52911d4]
   [1,6]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x562df52911fc]
   [1,6]<stderr>:   26  python(+0x22bf44) [0x562df53a6f44]
   [1,6]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x562df53b12b1]
   [1,6]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x562df53b14a3]
   [1,6]<stderr>:   29  python(+0x2375d5) [0x562df53b25d5]
   [1,6]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x562df53b26fc]
   [1,6]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) 
[0x7f1c66677840]
   [1,6]<stderr>:   32  python(+0x1dc3c0) [0x562df53573c0]
   [1,6]<stderr>:===================
   --------------------------------------------------------------------------
   Primary job  terminated normally, but 1 process returned
   a non-zero exit code. Per user-direction, the job has been aborted.
   --------------------------------------------------------------------------
   --------------------------------------------------------------------------
   mpirun noticed that process rank 7 with PID 0 on node node106 exited on 
signal 11 (Segmentation fault).
   ```
   - My environment is:
   ```
   gluonnlp               0.10.0
   horovod                0.19.5
   mxnet-cu102            1.7.0
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to