larroy commented on issue #16326: Mxnet 1.5.0: Crash while training mask-rcnn with horovod URL: https://github.com/apache/incubator-mxnet/issues/16326#issuecomment-537631390 Another one: ``` [1,6]<stderr>:INFO:root:[Epoch 0 Iteration 100] Set learning rate to 0.004 [1,4]<stderr>:corrupted size vs. prev_size [1,4]<stderr>:[ip-172-31-6-74:53673] *** Process received signal *** [1,4]<stderr>:[ip-172-31-6-74:53673] Signal: Aborted (6) [1,4]<stderr>:[ip-172-31-6-74:53673] Signal code: (-6) [1,4]<stderr>:[ip-172-31-6-74:53673] [ 0] [1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7fda17a35f20] [1,4]<stderr>:[ip-172-31-6-74:53673] [ 1] [1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fda17a35e97] [1,4]<stderr>:[ip-172-31-6-74:53673] [ 2] [1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fda17a37801] [1,4]<stderr>:[ip-172-31-6-74:53673] [ 3] [1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x89897)[0x7fda17a80897] [1,4]<stderr>:[ip-172-31-6-74:53673] [ 4] [1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x9090a)[0x7fda17a8790a] [1,4]<stderr>:[ip-172-31-6-74:53673] [ 5] [1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(cfree+0x80f)[0x7fda17a8f15f] [1,4]<stderr>:[ip-172-31-6-74:53673] [ 6] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b7930)[0x7fd9428dd930] [1,4]<stderr>:[ip-172-31-6-74:53673] [ 7] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25c2692)[0x7fd9428e8692] [1,4]<stderr>:[ip-172-31-6-74:53673] [ 8] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25cc8ba)[0x7fd9428f28ba] [1,4]<stderr>:[ip-172-31-6-74:53673] [ 9] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b53ce)[0x7fd9428db3ce] [1,4]<stderr>:[ip-172-31-6-74:53673] [10] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b6314)[0x7fd9428dc314] [1,4]<stderr>:[ip-172-31-6-74:53673] [11] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(_ZN5mxnet7NDArray5ChunkD1Ev+0x3c2)[0x7fd942adb582] [1,4]<stderr>:[ip-172-31-6-74:53673] [12] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4b42da)[0x7fd9407da2da] [1,4]<stderr>:[ip-172-31-6-74:53673] [13] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(_ZNSt6vectorIN5mxnet7NDArrayESaIS1_EED1Ev+0x1af)[0x7fd940b09dbf] [1,4]<stderr>:[ip-172-31-6-74:53673] [14] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(_ZZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISA_EERKS9_INS_9OpReqTypeESaISF_EESE_EEPKNS2_2OpES5_RKNS_7ContextERKS9_IPNS_6engine3VarESaISW_EES10_RKS9_INS_8ResourceESaIS11_EERKS9_IPNS_7NDArrayESaIS17_EES1B_RKS9_IjSaIjEESJ_ENUlNS_10RunContextEE_D1Ev+0x2c3)[0x7fd942984463] [1,4]<stderr>:[ip-172-31-6-74:53673] [15] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(_ZNSt14_Function_base13_Base_managerIZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS1_9OpContextERKSt6vectorINS1_5TBlobESaISC_EERKSB_INS1_9OpReqTypeESaISH_EESG_EEPKNS4_2OpES7_RKNS1_7ContextERKSB_IPNS1_6engine3VarESaISY_EES12_RKSB_INS1_8ResourceESaIS13_EERKSB_IPNS1_7NDArrayESaIS19_EES1D_RKSB_IjSaIjEESL_EUlNS1_10RunContextEE_E10_M_managerERSt9_Any_dataRKS1L_St18_Manager_operation+0x3a)[0x7fd94298629a] [1,4]<stderr>:[ip-172-31-6-74:53673] [16] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b601b)[0x7fd9428dc01b] [1,4]<stderr>:[ip-172-31-6-74:53673] [17] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25bd1e1)[0x7fd9428e31e1] [1,4]<stderr>:[ip-172-31-6-74:53673] [18] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25bd616)[0x7fd9428e3616] [1,4]<stderr>:[ip-172-31-6-74:53673] [19] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25c265d)[0x7fd9428e865d] [1,4]<stderr>:[ip-172-31-6-74:53673] [20] /home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25c5820)[0x7fd9428eb820] [1,4]<stderr>:[ip-172-31-6-74:53673] [21] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25c5ab6)[0x7fd9428ebab6] [1,4]<stderr>:[ip-172-31-6-74:53673] [22] [1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25c0a74)[0x7fd9428e6a74] [1,4]<stderr>:[ip-172-31-6-74:53673] [23] [1,4]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd9e0)[0x7fda096719e0] [1,4]<stderr>:[ip-172-31-6-74:53673] [24] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fda177df6db] [1,4]<stderr>:[ip-172-31-6-74:53673] [25] [1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fda17b1888f] [1,4]<stderr>:[ip-172-31-6-74:53673] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- [1,4]<stderr>:Process ForkPoolWorker-3: [1,4]<stderr>:Traceback (most recent call last): [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/pool.py", line 125, in worker [1,4]<stderr>: put((job, i, result)) [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/queues.py", line 347, in put [1,4]<stderr>: self._writer.send_bytes(obj) [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes [1,4]<stderr>: self._send_bytes(m[offset:offset + size]) [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes [1,4]<stderr>: self._send(header + buf) [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send [1,4]<stderr>: n = write(self._handle, buf) [1,4]<stderr>:BrokenPipeError: [Errno 32] Broken pipe [1,4]<stderr>: [1,4]<stderr>:During handling of the above exception, another exception occurred: [1,4]<stderr>: [1,4]<stderr>:Traceback (most recent call last): [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap [1,4]<stderr>: self.run() [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run [1,4]<stderr>: self._target(*self._args, **self._kwargs) [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/pool.py", line 130, in worker [1,4]<stderr>: put((job, i, (False, wrapped))) [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/queues.py", line 347, in put [1,4]<stderr>: self._writer.send_bytes(obj) [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes [1,4]<stderr>: self._send_bytes(m[offset:offset + size]) [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes [1,4]<stderr>: self._send(header + buf) [1,4]<stderr>: File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send [1,4]<stderr>: n = write(self._handle, buf) [1,4]<stderr>:BrokenPipeError: [Errno 32] Broken pipe -------------------------------------------------------------------------- mpirun noticed that process rank 4 with PID 0 on node ip-172-31-6-74 exited on signal 6 (Aborted). ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services