yangshuo0323 commented on issue #19717: URL: https://github.com/apache/incubator-mxnet/issues/19717#issuecomment-770141771
I see you have trained your model based on MXNet version 1.7.0. I want to train BERT on mutiple GPU, and I have another doubt want to consult you. Do you meet this trouble: ``` [1,4]<stderr>:=================== [1,5]<stderr>:[node106:26502:0:26502] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30) [1,5]<stderr>:==== backtrace ==== [1,6]<stderr>:[node106:26503:0:26503] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30) [1,6]<stderr>:==== backtrace ==== [1,5]<stderr>: 0 /usr/lib/libucs.so.0(+0x1fcec) [0x7f40f065bcec] [1,5]<stderr>: 1 /usr/lib/libucs.so.0(+0x1ff64) [0x7f40f065bf64] [1,5]<stderr>: 2 /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f42ead77d44] [1,5]<stderr>: 3 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f428d022564] [1,5]<stderr>: 4 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f428d025790] [1,5]<stderr>: 5 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f428d01ded1] [1,5]<stderr>: 6 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f428cff89d4] [1,5]<stderr>: 7 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f410243a18f] [1,5]<stderr>: 8 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f4102431d84] [1,5]<stderr>: 9 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f42e9da49dd] [1,5]<stderr>: 10 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f42e9da4067] [1,5]<stderr>: 11 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f42eafd527e] [1,5]<stderr>: 12 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f42eafd5cb4] [1,5]<stderr>: 13 python(_PyObject_FastCallKeywords+0x48b) [0x564d0453c00b] [1,5]<stderr>: 14 python(_PyEval_EvalFrameDefault+0x51d1) [0x564d045a09a1] [1,5]<stderr>: 15 python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9] [1,5]<stderr>: 16 python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497] [1,5]<stderr>: 17 python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba] [1,5]<stderr>: 18 python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9] [1,5]<stderr>: 19 python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497] [1,5]<stderr>: 20 python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba] [1,5]<stderr>: 21 python(_PyFunction_FastCallKeywords+0xfb) [0x564d0453420b] [1,5]<stderr>: 22 python(_PyEval_EvalFrameDefault+0x416) [0x564d0459bbe6] [1,5]<stderr>: 23 python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9] [1,5]<stderr>: 24 python(PyEval_EvalCodeEx+0x44) [0x564d044e51d4] [1,5]<stderr>: 25 python(PyEval_EvalCode+0x1c) [0x564d044e51fc] [1,5]<stderr>: 26 python(+0x22bf44) [0x564d045faf44] [1,5]<stderr>: 27 python(PyRun_FileExFlags+0xa1) [0x564d046052b1] [1,5]<stderr>: 28 python(PyRun_SimpleFileExFlags+0x1c3) [0x564d046054a3] [1,5]<stderr>: 29 python(+0x2375d5) [0x564d046065d5] [1,5]<stderr>: 30 python(_Py_UnixMain+0x3c) [0x564d046066fc] [1,5]<stderr>: 31 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f42ea9c4840] [1,5]<stderr>: 32 python(+0x1dc3c0) [0x564d045ab3c0] [1,5]<stderr>:=================== [1,6]<stderr>: 0 /usr/lib/libucs.so.0(+0x1fcec) [0x7f1a6c25bcec] [1,6]<stderr>: 1 /usr/lib/libucs.so.0(+0x1ff64) [0x7f1a6c25bf64] [1,6]<stderr>: 2 /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f1c66a2ad44] [1,6]<stderr>: 3 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f1c08cd5564] [1,6]<stderr>: 4 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f1c08cd8790] [1,6]<stderr>: 5 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f1c08cd0ed1] [1,6]<stderr>: 6 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f1c08cab9d4] [1,6]<stderr>: 7 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f1a7e0e118f] [1,6]<stderr>: 8 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f1a7e0d8d84] [1,6]<stderr>: 9 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f1c65a579dd] [1,6]<stderr>: 10 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f1c65a57067] [1,6]<stderr>: 11 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f1c66c8827e] [1,6]<stderr>: 12 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f1c66c88cb4] [1,6]<stderr>: 13 python(_PyObject_FastCallKeywords+0x48b) [0x562df52e800b] [1,6]<stderr>: 14 python(_PyEval_EvalFrameDefault+0x51d1) [0x562df534c9a1] [1,6]<stderr>: 15 python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9] [1,6]<stderr>: 16 python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497] [1,6]<stderr>: 17 python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba] [1,6]<stderr>: 18 python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9] [1,6]<stderr>: 19 python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497] [1,6]<stderr>: 20 python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba] [1,6]<stderr>: 21 python(_PyFunction_FastCallKeywords+0xfb) [0x562df52e020b] [1,6]<stderr>: 22 python(_PyEval_EvalFrameDefault+0x416) [0x562df5347be6] [1,6]<stderr>: 23 python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9] [1,6]<stderr>: 24 python(PyEval_EvalCodeEx+0x44) [0x562df52911d4] [1,6]<stderr>: 25 python(PyEval_EvalCode+0x1c) [0x562df52911fc] [1,6]<stderr>: 26 python(+0x22bf44) [0x562df53a6f44] [1,6]<stderr>: 27 python(PyRun_FileExFlags+0xa1) [0x562df53b12b1] [1,6]<stderr>: 28 python(PyRun_SimpleFileExFlags+0x1c3) [0x562df53b14a3] [1,6]<stderr>: 29 python(+0x2375d5) [0x562df53b25d5] [1,6]<stderr>: 30 python(_Py_UnixMain+0x3c) [0x562df53b26fc] [1,6]<stderr>: 31 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f1c66677840] [1,6]<stderr>: 32 python(+0x1dc3c0) [0x562df53573c0] [1,6]<stderr>:=================== -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 7 with PID 0 on node node106 exited on signal 11 (Segmentation fault). ``` - My environment is: ``` gluonnlp 0.10.0 horovod 0.19.5 mxnet-cu102 1.7.0 ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
