Hi again, and thank you to Florent for answering my questions last time. The 
answers were very helpful!

We have some strange errors occurring randomly when running MPI jobs. We are 
using openmpi 4.0.3 with UCX and GPUDirect RDMA and are running multi-node 
applications using SLURM on a cluster. We only recently got GPUDirect RDMA to 
work, and we are seeing improved performance, but after RDMA started working we 
have begun to see errors like the one below randomly.

It seems that MPI_Init isn't able to establish a connection between all ranks. 
(?) Can somebody help me make sense of the core dump(s) below? Where should we 
start digging to see what is causing this, and does anyone have experience of 
similar cases?

Best regards,
Oskar


=================================Begin 
output==================================================


Lmod is automatically replacing "hpcx-mpi/2.5.0-cuda" with
"openmpi/4.0.3-cuda".

[r14g04:78340:0:78340] ib_mlx5_log.c:139  Transport retry count exceeded on 
mlx5_1:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[r14g04:78340:0:78340] ib_mlx5_log.c:139  RC QP 0x89c2 wqe[0]: SEND --e [inl 
len 18]
[r14g05:35369:0:35369] ib_mlx5_log.c:139  Transport retry count exceeded on 
mlx5_1:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[r14g05:35369:0:35369] ib_mlx5_log.c:139  RC QP 0x79df wqe[0]: SEND --e [inl 
len 18]
==== backtrace (tid:  78340) ====
 0 0x000000000004ec80 ucs_fatal_error_message()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/assert.c:33
 1 0x00000000000532b5 ucs_log_default_handler()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:140
 2 0x00000000000533e4 ucs_log_dispatch()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:191
 3 0x000000000001c793 uct_ib_mlx5_completion_with_err()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5_log.c:132
 4 0x0000000000029f7f uct_rc_mlx5_iface_handle_failure()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:216
 5 0x000000000002b17b uct_ib_mlx5_poll_cq()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5.inl:38
 6 0x000000000002b17b uct_rc_mlx5_iface_progress()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:133
 7 0x000000000001fcf2 ucs_callbackq_dispatch()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/datastruct/callbackq.h:211
 8 0x000000000001fcf2 uct_worker_progress()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/api/uct.h:2203
 9 0x000000000001fcf2 ucp_worker_progress()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucp/core/ucp_worker.c:1897
10 0x0000000000004877 mca_pml_ucx_progress()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/pml/ucx/pml_ucx.c:515
11 0x0000000000036bdc opal_progress()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/opal/runtime/opal_progress.c:231
12 0x00000000000bf179 wait_completion()  hcoll_collectives.c:0
13 0x000000000001d0f4 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
14 0x000000000001d72b comm_allreduce_hcolrte()  ???:0
15 0x000000000001380b hmca_bcol_ucx_p2p_init_query.part.4()  
bcol_ucx_p2p_component.c:0
16 0x00000000000cb86c hmca_bcol_base_init()  ???:0
17 0x000000000004a328 hmca_coll_ml_init_query()  ???:0
18 0x00000000000bff37 hcoll_init_with_opts()  ???:0
19 0x0000000000005f90 mca_coll_hcoll_comm_query()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/hcoll/coll_hcoll_module.c:292
20 0x00000000000837ca query_2_0_0()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:449
21 0x00000000000837ca query()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:432
22 0x00000000000837ca check_one_component()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:394
23 0x00000000000837ca check_components()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:344
24 0x00000000000837ca mca_coll_base_comm_select()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:126
25 0x00000000000beb66 ompi_mpi_init()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958
26 0x0000000000074fb1 PMPI_Init()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:69
27 0x000000000040250c main()  ???:0
28 0x0000000000022545 __libc_start_main()  ???:0
29 0x0000000000402e98 _start()  ???:0
=================================
[r14g04:78340] *** Process received signal ***
[r14g04:78340] Signal: Aborted (6)
[r14g04:78340] Signal code:  (-6)
[r14g04:78340] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f96c4788630]
[r14g04:78340] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f96c3d4f377]
[r14g04:78340] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f96c3d50a68]
[r14g04:78340] [ 3] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_fatal_error_message+0x55)[0x7f9641493c85]
[r14g04:78340] [ 4] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(+0x532b5)[0x7f96414982b5]
[r14g04:78340] [ 5] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_log_dispatch+0xc4)[0x7f96414983e4]
[r14g04:78340] [ 6] 
/appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x683)[0x7f96404f1793]
[r14g04:78340] [ 7] 
/appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(+0x29f7f)[0x7f96404fef7f]
[r14g04:78340] [ 8] 
/appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x41b)[0x7f964050017b]
[r14g04:78340] [ 9] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucp.so.0(ucp_worker_progress+0x22)[0x7f9641c04cf2]
[r14g04:78340] [10] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x7f9641e42877]
[r14g04:78340] [11] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f96c3c40bdc]
[r14g04:78340] [12] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(+0xbf179)[0x7f963a9e6179]
[r14g04:78340] [13] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(+0x1d0f4)[0x7f963a9440f4]
[r14g04:78340] [14] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x4b)[0x7f963a94472b]
[r14g04:78340] [15] 
/appl/opt/hcoll/4.4.2938/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x1380b)[0x7f962479980b]
[r14g04:78340] [16] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x7f963a9f286c]
[r14g04:78340] [17] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x7f963a971328]
[r14g04:78340] [18] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hcoll_init_with_opts+0x307)[0x7f963a9e6f37]
[r14g04:78340] [19] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f963ac6ef90]
[r14g04:78340] [20] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f96c618a7ca]
[r14g04:78340] [21] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(ompi_mpi_init+0xec6)[0x7f96c61c5b66]
[r14g04:78340] [22] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(MPI_Init+0x81)[0x7f96c617bfb1]
[r14g04:78340] [23] 
/scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x40250c]
[r14g04:78340] [24] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f96c3d3b545]
[r14g04:78340] [25] 
/scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x402e98]
[r14g04:78340] *** End of error message ***
==== backtrace (tid:  35369) ====
 0 0x000000000004ec80 ucs_fatal_error_message()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/assert.c:33
 1 0x00000000000532b5 ucs_log_default_handler()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:140
 2 0x00000000000533e4 ucs_log_dispatch()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:191
 3 0x000000000001c793 uct_ib_mlx5_completion_with_err()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5_log.c:132
 4 0x0000000000029f7f uct_rc_mlx5_iface_handle_failure()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:216
 5 0x000000000002b17b uct_ib_mlx5_poll_cq()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5.inl:38
 6 0x000000000002b17b uct_rc_mlx5_iface_progress()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:133
 7 0x000000000001fcf2 ucs_callbackq_dispatch()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/datastruct/callbackq.h:211
 8 0x000000000001fcf2 uct_worker_progress()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/api/uct.h:2203
 9 0x000000000001fcf2 ucp_worker_progress()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucp/core/ucp_worker.c:1897
10 0x0000000000004877 mca_pml_ucx_progress()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/pml/ucx/pml_ucx.c:515
11 0x0000000000036bdc opal_progress()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/opal/runtime/opal_progress.c:231
12 0x00000000000bf179 wait_completion()  hcoll_collectives.c:0
13 0x000000000001e52e comm_allgather_hcolrte()  ???:0
14 0x00000000000138b7 hmca_bcol_ucx_p2p_init_query.part.4()  
bcol_ucx_p2p_component.c:0
15 0x00000000000cb86c hmca_bcol_base_init()  ???:0
16 0x000000000004a328 hmca_coll_ml_init_query()  ???:0
17 0x00000000000bff37 hcoll_init_with_opts()  ???:0
18 0x0000000000005f90 mca_coll_hcoll_comm_query()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/hcoll/coll_hcoll_module.c:292
19 0x00000000000837ca query_2_0_0()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:449
20 0x00000000000837ca query()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:432
21 0x00000000000837ca check_one_component()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:394
22 0x00000000000837ca check_components()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:344
23 0x00000000000837ca mca_coll_base_comm_select()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:126
24 0x00000000000beb66 ompi_mpi_init()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958
25 0x0000000000074fb1 PMPI_Init()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:69
26 0x000000000040250c main()  ???:0
27 0x0000000000022545 __libc_start_main()  ???:0
28 0x0000000000402e98 _start()  ???:0
=================================
[r14g05:35369] *** Process received signal ***
[r14g05:35369] Signal: Aborted (6)
[r14g05:35369] Signal code:  (-6)
[r14g05:35369] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f39274da630]
[r14g05:35369] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f3926aa1377]
[r14g05:35369] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f3926aa2a68]
[r14g05:35369] [ 3] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_fatal_error_message+0x55)[0x7f38a41e5c85]
[r14g05:35369] [ 4] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(+0x532b5)[0x7f38a41ea2b5]
[r14g05:35369] [ 5] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_log_dispatch+0xc4)[0x7f38a41ea3e4]
[r14g05:35369] [ 6] 
/appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x683)[0x7f389f0de793]
[r14g05:35369] [ 7] 
/appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(+0x29f7f)[0x7f389f0ebf7f]
[r14g05:35369] [ 8] 
/appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x41b)[0x7f389f0ed17b]
[r14g05:35369] [ 9] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucp.so.0(ucp_worker_progress+0x22)[0x7f38a4956cf2]
[r14g05:35369] [10] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x7f38a4b94877]
[r14g05:35369] [11] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f3926992bdc]
[r14g05:35369] [12] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(+0xbf179)[0x7f389d6fa179]
[r14g05:35369] [13] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(comm_allgather_hcolrte+0xcae)[0x7f389d65952e]
[r14g05:35369] [14] 
/appl/opt/hcoll/4.4.2938/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x138b7)[0x7f38907118b7]
[r14g05:35369] [15] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x7f389d70686c]
[r14g05:35369] [16] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x7f389d685328]
[r14g05:35369] [17] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hcoll_init_with_opts+0x307)[0x7f389d6faf37]
[r14g05:35369] [18] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f38a4050f90]
[r14g05:35369] [19] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f3928edc7ca]
[r14g05:35369] [20] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(ompi_mpi_init+0xec6)[0x7f3928f17b66]
[r14g05:35369] [21] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(MPI_Init+0x81)[0x7f3928ecdfb1]
[r14g05:35369] [22] 
/scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x40250c]
[r14g05:35369] [23] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f3926a8d545]
[r14g05:35369] [24] 
/scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x402e98]
[r14g05:35369] *** End of error message ***
[r03g01:128574:0:128574] ib_mlx5_log.c:139  Transport retry count exceeded on 
mlx5_1:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[r03g01:128574:0:128574] ib_mlx5_log.c:139  RC QP 0xdf69 wqe[0]: SEND --e [inl 
len 18]
==== backtrace (tid: 128574) ====
 0 0x000000000004ec80 ucs_fatal_error_message()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/assert.c:33
 1 0x00000000000532b5 ucs_log_default_handler()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:140
 2 0x00000000000533e4 ucs_log_dispatch()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:191
 3 0x000000000001c793 uct_ib_mlx5_completion_with_err()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5_log.c:132
 4 0x0000000000029f7f uct_rc_mlx5_iface_handle_failure()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:216
 5 0x000000000002b17b uct_ib_mlx5_poll_cq()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5.inl:38
 6 0x000000000002b17b uct_rc_mlx5_iface_progress()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:133
 7 0x000000000001fcf2 ucs_callbackq_dispatch()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/datastruct/callbackq.h:211
 8 0x000000000001fcf2 uct_worker_progress()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/api/uct.h:2203
 9 0x000000000001fcf2 ucp_worker_progress()  
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucp/core/ucp_worker.c:1897
10 0x0000000000004877 mca_pml_ucx_progress()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/pml/ucx/pml_ucx.c:515
11 0x0000000000036bdc opal_progress()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/opal/runtime/opal_progress.c:231
12 0x00000000000bf179 wait_completion()  hcoll_collectives.c:0
13 0x000000000001e52e comm_allgather_hcolrte()  ???:0
14 0x00000000000138b7 hmca_bcol_ucx_p2p_init_query.part.4()  
bcol_ucx_p2p_component.c:0
15 0x00000000000cb86c hmca_bcol_base_init()  ???:0
16 0x000000000004a328 hmca_coll_ml_init_query()  ???:0
17 0x00000000000bff37 hcoll_init_with_opts()  ???:0
18 0x0000000000005f90 mca_coll_hcoll_comm_query()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/hcoll/coll_hcoll_module.c:292
19 0x00000000000837ca query_2_0_0()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:449
20 0x00000000000837ca query()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:432
21 0x00000000000837ca check_one_component()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:394
22 0x00000000000837ca check_components()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:344
23 0x00000000000837ca mca_coll_base_comm_select()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:126
24 0x00000000000beb66 ompi_mpi_init()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958
25 0x0000000000074fb1 PMPI_Init()  
/local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:69
26 0x000000000040250c main()  ???:0
27 0x0000000000022545 __libc_start_main()  ???:0
28 0x0000000000402e98 _start()  ???:0
=================================
[r03g01:128574] *** Process received signal ***
[r03g01:128574] Signal: Aborted (6)
[r03g01:128574] Signal code:  (-6)
[r03g01:128574] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f75e9284630]
[r03g01:128574] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f75e884b377]
[r03g01:128574] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f75e884ca68]
[r03g01:128574] [ 3] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_fatal_error_message+0x55)[0x7f7565f8fc85]
[r03g01:128574] [ 4] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(+0x532b5)[0x7f7565f942b5]
[r03g01:128574] [ 5] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_log_dispatch+0xc4)[0x7f7565f943e4]
[r03g01:128574] [ 6] 
/appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x683)[0x7f7564fed793]
[r03g01:128574] [ 7] 
/appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(+0x29f7f)[0x7f7564ffaf7f]
[r03g01:128574] [ 8] 
/appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x41b)[0x7f7564ffc17b]
[r03g01:128574] [ 9] 
/appl/opt/ucx/1.7.0-mlnx/lib/libucp.so.0(ucp_worker_progress+0x22)[0x7f7566700cf2]
[r03g01:128574] [10] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x7f756693e877]
[r03g01:128574] [11] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f75e873cbdc]
[r03g01:128574] [12] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(+0xbf179)[0x7f755f4ef179]
[r03g01:128574] [13] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(comm_allgather_hcolrte+0xcae)[0x7f755f44e52e]
[r03g01:128574] [14] 
/appl/opt/hcoll/4.4.2938/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x138b7)[0x7f755c4bb8b7]
[r03g01:128574] [15] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x7f755f4fb86c]
[r03g01:128574] [16] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x7f755f47a328]
[r03g01:128574] [17] 
/appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hcoll_init_with_opts+0x307)[0x7f755f4eff37]
[r03g01:128574] [18] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f755f777f90]
[r03g01:128574] [19] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f75eac867ca]
[r03g01:128574] [20] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(ompi_mpi_init+0xec6)[0x7f75eacc1b66]
[r03g01:128574] [21] 
/appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(MPI_Init+0x81)[0x7f75eac77fb1]
[r03g01:128574] [22] 
/scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x40250c]
[r03g01:128574] [23] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f75e8837545]
[r03g01:128574] [24] 
/scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x402e98]
[r03g01:128574] *** End of error message ***
srun: error: r14g04: task 19: Aborted (core dumped)
srun: Terminating job step 2991688.0
slurmstepd: error: *** STEP 2991688.0 ON r03g01 CANCELLED AT 
2020-07-20T18:24:23 ***
srun: error: r14g05: task 22: Aborted (core dumped)
srun: error: r03g01: tasks 0-2: Terminated
srun: error: r14g06: tasks 24-27: Terminated
srun: error: r03g03: tasks 8-11: Terminated
srun: error: r14g03: tasks 12-15: Terminated
srun: error: r03g02: tasks 4-7: Terminated
srun: error: r14g07: tasks 28-31: Terminated
srun: error: r14g05: tasks 20-21,23: Terminated
srun: error: r14g04: tasks 16-18: Terminated
srun: error: r03g01: task 3: Aborted (core dumped)
srun: Force Terminated job step 2991688.0

Reply via email to