larroy commented on issue #16326: Mxnet 1.5.0: Crash while training mask-rcnn 
with horovod
URL: 
https://github.com/apache/incubator-mxnet/issues/16326#issuecomment-537631390
 
 
   Another one:
   
   ```
   [1,6]<stderr>:INFO:root:[Epoch 0 Iteration 100] Set learning rate to 0.004
   [1,4]<stderr>:corrupted size vs. prev_size
   [1,4]<stderr>:[ip-172-31-6-74:53673] *** Process received signal ***
   [1,4]<stderr>:[ip-172-31-6-74:53673] Signal: Aborted (6)
   [1,4]<stderr>:[ip-172-31-6-74:53673] Signal code:  (-6)
   [1,4]<stderr>:[ip-172-31-6-74:53673] [ 0] 
[1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7fda17a35f20]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [ 1] 
[1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fda17a35e97]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [ 2] 
[1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fda17a37801]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [ 3] 
[1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x89897)[0x7fda17a80897]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [ 4] 
[1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x9090a)[0x7fda17a8790a]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [ 5] 
[1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(cfree+0x80f)[0x7fda17a8f15f]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [ 6] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b7930)[0x7fd9428dd930]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [ 7] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25c2692)[0x7fd9428e8692]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [ 8] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25cc8ba)[0x7fd9428f28ba]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [ 9] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b53ce)[0x7fd9428db3ce]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [10] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b6314)[0x7fd9428dc314]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [11] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(_ZN5mxnet7NDArray5ChunkD1Ev+0x3c2)[0x7fd942adb582]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [12] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4b42da)[0x7fd9407da2da]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [13] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(_ZNSt6vectorIN5mxnet7NDArrayESaIS1_EED1Ev+0x1af)[0x7fd940b09dbf]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [14] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(_ZZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISA_EERKS9_INS_9OpReqTypeESaISF_EESE_EEPKNS2_2OpES5_RKNS_7ContextERKS9_IPNS_6engine3VarESaISW_EES10_RKS9_INS_8ResourceESaIS11_EERKS9_IPNS_7NDArrayESaIS17_EES1B_RKS9_IjSaIjEESJ_ENUlNS_10RunContextEE_D1Ev+0x2c3)[0x7fd942984463]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [15] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(_ZNSt14_Function_base13_Base_managerIZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS1_9OpContextERKSt6vectorINS1_5TBlobESaISC_EERKSB_INS1_9OpReqTypeESaISH_EESG_EEPKNS4_2OpES7_RKNS1_7ContextERKSB_IPNS1_6engine3VarESaISY_EES12_RKSB_INS1_8ResourceESaIS13_EERKSB_IPNS1_7NDArrayESaIS19_EES1D_RKSB_IjSaIjEESL_EUlNS1_10RunContextEE_E10_M_managerERSt9_Any_dataRKS1L_St18_Manager_operation+0x3a)[0x7fd94298629a]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [16] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b601b)[0x7fd9428dc01b]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [17] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25bd1e1)[0x7fd9428e31e1]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [18] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25bd616)[0x7fd9428e3616]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [19] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25c265d)[0x7fd9428e865d]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [20] 
/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25c5820)[0x7fd9428eb820]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [21] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25c5ab6)[0x7fd9428ebab6]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [22] 
[1,4]<stderr>:/home/piotr/py3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25c0a74)[0x7fd9428e6a74]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [23] 
[1,4]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd9e0)[0x7fda096719e0]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [24] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fda177df6db]
   [1,4]<stderr>:[ip-172-31-6-74:53673] [25] 
[1,4]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fda17b1888f]
   [1,4]<stderr>:[ip-172-31-6-74:53673] *** End of error message ***
   --------------------------------------------------------------------------
   Primary job  terminated normally, but 1 process returned
   a non-zero exit code. Per user-direction, the job has been aborted.
   --------------------------------------------------------------------------
   [1,4]<stderr>:Process ForkPoolWorker-3:
   [1,4]<stderr>:Traceback (most recent call last):
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/pool.py", line 125, 
in worker
   [1,4]<stderr>:    put((job, i, result))
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/queues.py", line 
347, in put
   [1,4]<stderr>:    self._writer.send_bytes(obj)
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/connection.py", 
line 200, in send_bytes
   [1,4]<stderr>:    self._send_bytes(m[offset:offset + size])
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/connection.py", 
line 404, in _send_bytes
   [1,4]<stderr>:    self._send(header + buf)
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/connection.py", 
line 368, in _send
   [1,4]<stderr>:    n = write(self._handle, buf)
   [1,4]<stderr>:BrokenPipeError: [Errno 32] Broken pipe
   [1,4]<stderr>:
   [1,4]<stderr>:During handling of the above exception, another exception 
occurred:
   [1,4]<stderr>:
   [1,4]<stderr>:Traceback (most recent call last):
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/process.py", line 
258, in _bootstrap
   [1,4]<stderr>:    self.run()
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/process.py", line 
93, in run
   [1,4]<stderr>:    self._target(*self._args, **self._kwargs)
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/pool.py", line 130, 
in worker
   [1,4]<stderr>:    put((job, i, (False, wrapped)))
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/queues.py", line 
347, in put
   [1,4]<stderr>:    self._writer.send_bytes(obj)
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/connection.py", 
line 200, in send_bytes
   [1,4]<stderr>:    self._send_bytes(m[offset:offset + size])
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/connection.py", 
line 404, in _send_bytes
   [1,4]<stderr>:    self._send(header + buf)
   [1,4]<stderr>:  File "/usr/lib/python3.6/multiprocessing/connection.py", 
line 368, in _send
   [1,4]<stderr>:    n = write(self._handle, buf)
   [1,4]<stderr>:BrokenPipeError: [Errno 32] Broken pipe
   --------------------------------------------------------------------------
   mpirun noticed that process rank 4 with PID 0 on node ip-172-31-6-74 exited 
on signal 6 (Aborted).
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to