I checked further using a modification to mpi_hello_world.c (that includes
MPI_Barrier) and a test code that checks connectivity between all processes.
1. On the mpi_hello_world_barrier.c case, openmpi5 failed the same way as
before. mpich-ofi completed without error.
2. On the connectivity_c.c case, openmpi5 failed with the same error, and
did not pass connectivity. mpich-ofi completed and passed connectivity (see
below).
So it boils down to openmpi/ucx is unable to communicate between processes
in my network setup?
-------------------------------------------------------------------------------------
[av@sms test]$ cat mpi_hello_world_barrier.c
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Print off a hello world message
int i;
for(i=0; i<world_size; i++){
printf("Hello world from processor %s, rank %d out of %d
processors\n",
processor_name, world_rank, world_size);
MPI_Barrier(MPI_COMM_WORLD);
}
// Finalize the MPI environment.
MPI_Finalize();
}
-------------------------------------------------------------------------------------
[av@c11 ompi]$ cat connectivity_c.c
/*
* Copyright (c) 2007 Sun Microsystems, Inc. All rights reserved.
*/
* Test the connectivity between all processes.
*/
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <netdb.h>
#include <unistd.h>
#include <mpi.h>
int
main(int argc, char **argv)
{
MPI_Status status;
int verbose = 0;
int rank;
int np; /* number of processes in job */
int peer;
int i;
int j;
int length;
char name[MPI_MAX_PROCESSOR_NAME+1];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &np);
/*
* If we cannot get the name for whatever reason, just
* set it to unknown. */
if (MPI_SUCCESS != MPI_Get_processor_name(name, &length)) {
strcpy(name, "unknown");
}
if (argc>1 && strcmp(argv[1], "-v")==0)
verbose = 1;
for (i=0; i<np; i++) {
if (rank==i) {
/* rank i sends to and receives from each higher rank */
for(j=i+1; j<np; j++) {
if (verbose)
printf("checking connection between rank %d on %s and
rank %-4d\n",
i, name, j);
MPI_Send(&rank, 1, MPI_INT, j, rank, MPI_COMM_WORLD);
MPI_Recv(&peer, 1, MPI_INT, j, j, MPI_COMM_WORLD, &status);
}
} else if (rank>i) {
/* receive from and reply to rank i */
MPI_Recv(&peer, 1, MPI_INT, i, i, MPI_COMM_WORLD, &status);
MPI_Send(&rank, 1, MPI_INT, i, rank, MPI_COMM_WORLD);
}
}
MPI_Barrier(MPI_COMM_WORLD);
if (rank==0)
printf("Connectivity test on %d processes PASSED.\n", np);
MPI_Finalize();
return 0;
}
------------------------------------------------------------
[av@sms ompi]$ mpicc -o openmpi5-connectivity_c connectivity_c.c
[av@sms ompi]$ which mpicc
/opt/ohpc/pub/mpi/openmpi5-gnu14/5.0.7/bin/mpicc
[av@sms ompi]$ salloc -n 6 -N 3
salloc: Granted job allocation 72
salloc: Nodes c[11-13] are ready for job
[av@c11 ompi]$ mpirun openmpi5-connectivity_c
[c11:1928 :0:1928] ud_ep.c:278 Fatal: UD endpoint 0x12e1c70 to <no
debug data>: unhandled timeout error
------------------------------------------------------------
[av@sms ompi]$ mpicc -o mpich-ofi-connectivity_c connectivity_c.c
[av@sms ompi]$ salloc -n 6 -N 3
salloc: Granted job allocation 71
salloc: Nodes c[11-13] are ready for job
[av@c11 ompi]$ mpirun ./mpich-ofi-connectivity_c
Connectivity test on 6 processes PASSED.
------------------------------------------------------------
Achilles.
On Wednesday, July 2, 2025 at 4:01:30 AM UTC-4 George Bosilca wrote:
> UCX 1.8 or UCX 1.18 ?
>
> Your application does not exchange any data so it is possible that MPICH
> behavior differs from OMPI (aka not creating connections vs creating them
> during MPI_Init). That's why running a slightly different version of the
> hello_world with a barrier would clarify the connection's status.
>
> George.
>
>
> On Tue, Jul 1, 2025 at 10:30 PM Achilles Vassilicos <[email protected]>
> wrote:
>
>> When I use openmpi5, I get the same behavior even with a very small
>> number of processes per node. However, when I use mpich-ofi it runs fine
>> (see below). That gives me confidence that the network is setup correctly.
>> The nodes are connected via infiniband ConnectX-3 adapters, and all ib
>> tests show no problems.
>> I found an older post about ucx1.18 having possible issues with
>> openmpi5. I have assumed that ucx1.18 is now fully compatible with
>> openmpi5. Could this be the cause? Does anyone use ucx1.8 with openmpi5? If
>> not ucx1.18, what version is confirmed to work with openmpi5?
>>
>> My test code:
>> ----------------------------------------------------------------------
>> [av@c12 test]$ cat mpi_hello_world.c
>> #include <mpi.h>
>> #include <stdio.h>
>>
>> int main(int argc, char** argv) {
>> // Initialize the MPI environment
>> MPI_Init(NULL, NULL);
>>
>> // Get the number of processes
>> int world_size;
>> MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>>
>> // Get the rank of the process
>> int world_rank;
>> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>>
>> // Get the name of the processor
>> char processor_name[MPI_MAX_PROCESSOR_NAME];
>> int name_len;
>> MPI_Get_processor_name(processor_name, &name_len);
>>
>> // Print off a hello world message
>> printf("Hello world from processor %s, rank %d out of %d
>> processors\n",
>> processor_name, world_rank, world_size);
>>
>> // Finalize the MPI environment.
>> MPI_Finalize();
>> }
>> -------------------------------------------------------------------------
>> [av@c12 test]$ which mpirun
>> /opt/ohpc/pub/mpi/openmpi5-gnu14/5.0.7/bin/mpirun
>> [av@sms test]$ mpicc -o openmpi5_hello_world mpi_hello_world.c
>> [av@sms test]$ salloc -n 4 -N 2
>> salloc: Granted job allocation 63
>> salloc: Nodes c[12-13] are ready for job
>> [av@c12 test]$ mpirun ./openmpi5_hello_world
>> Hello world from processor c12, rank 0 out of 4 processors
>> Hello world from processor c12, rank 1 out of 4 processors
>> Hello world from processor c13, rank 3 out of 4 processors
>> Hello world from processor c13, rank 2 out of 4 processors
>> [c12:1709 :0:1709] ud_ep.c:278 Fatal: UD endpoint 0x117ae80 to <no
>> debug data>: unhandled timeout error
>> ==== backtrace (tid: 1709) ====
>> 0
>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(ucs_handle_error+0x294)
>> [0x7f200b4f3ee4]
>> ................
>> -----------------------------------------------------------------------
>> [av@sms test]$ which mpicc
>> /opt/ohpc/pub/mpi/mpich-ofi-gnu14-ohpc/3.4.3/bin/mpicc
>> [av@sms test]$ which mpirun
>> /opt/ohpc/pub/mpi/mpich-ofi-gnu14-ohpc/3.4.3/bin/mpirun
>> [av@sms test]$ mpicc -o mpich-ofi_hello_world mpi_hello_world.c
>> [av@sms test]$ salloc -n 4 -N 2
>> salloc: Granted job allocation 66
>> salloc: Nodes c[12-13] are ready for job
>> [av@c12 test]$ mpirun ./mpich-ofi_hello_world
>> Hello world from processor c13, rank 2 out of 4 processors
>> Hello world from processor c13, rank 3 out of 4 processors
>> Hello world from processor c12, rank 0 out of 4 processors
>> Hello world from processor c12, rank 1 out of 4 processors
>> [av@c12 test]$
>> ------------------------------------------------------------------------
>> Achilles
>> On Tuesday, July 1, 2025 at 7:14:06 AM UTC-4 George Bosilca wrote:
>>
>>> This error message is usually due to a misconfiguration of the network.
>>> However, I don't think this is the case here because the output contains
>>> messages from both odd and even ranks (which according to your binding
>>> policy were placed on different nodes) suggesting at least some of the
>>> processes were able to connect (and thus the network configuration is
>>> correct).
>>>
>>> So I'm thinking about some timing issues during network setup due to the
>>> fact that you have many processes per node, and an application that does
>>> nothing except creating and then shutting down the network layer. Does this
>>> happen if you have less processes per node ? Does it happen if you add
>>> anything else in the application (such as an `MPI_Barrier(MPI_COMM_WORLD)`)
>>> ?
>>>
>>> George.
>>>
>>>
>>> On Mon, Jun 30, 2025 at 10:00 PM Achilles Vassilicos <[email protected]>
>>> wrote:
>>>
>>>> Hello all, new to the list.
>>>> While testing my openmpi5.0.7 installation using the simple
>>>> mpi_hello_world.c code, I am experiencing an unexpected behavior where the
>>>> execution on the last processor rank hangs with a "fatal unhandled timeout
>>>> error", which leads to core dumps. It confirmed that it happens regardless
>>>> of the compiler I use, i.e., gnu14 or intel2024.0. Moreover, it does not
>>>> happen when I use mpich3.4.3-ofi. Below I am including the setting I am
>>>> using and the runtime error. You will notice that the error happened on
>>>> node c11, which may suggest that there is something wrong with this node.
>>>> However, it turns out that any other other node that happens to execute
>>>> the
>>>> last processor rank leads to the same error. I must be missing something.
>>>> Any Thoughts?
>>>> Sorry about the length of the post.
>>>>
>>>> -----------------------------------------------------
>>>> ]$ module list
>>>> Currently Loaded Modules:
>>>> 1) cmake/4.0.0 6) spack/0.23.1 11) mkl/2024.0
>>>> 16) ifort/2024.0.0 21) EasyBuild/5.0.0
>>>> 2) autotools 7) oclfpga/2024.0.0 12) intel/2024.0.0
>>>> 17) inspector/2024.2 22) valgrind/3.24.0
>>>> 3) hwloc/2.12.0 8) tbb/2021.11 13) debugger/2024.0.0
>>>> 18) intel_ipp_intel64/2021.10 23) openmpi5/5.0.7
>>>> 4) libfabric/1.18.0 9) compiler-rt/2024.0.0 14) dpl/2022.3
>>>> 19) intel_ippcp_intel64/2021.9 24) ucx/1.18.0
>>>> 5) prun/2.2 10) compiler/2024.0.0 15) icc/2023.2.1
>>>> 20) vtune/2025.3
>>>> ----------------------------------------------------------
>>>> $ sinfo
>>>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
>>>> normal* up infinite 10 idle* c[2-10,12]
>>>> normal* up infinite 3 idle c[1,11,13]
>>>> [av@sms test]$ salloc -n 24 -N 2
>>>> salloc: Granted job allocation 61
>>>> salloc: Nodes c[1,11] are ready for job
>>>> [av@c1 test]$ mpirun --display-map --map-by node -x
>>>> MXM_RDMA_PORTS=mlx4_0:1 -mca btl_openib_if_include mlx4_0:1
>>>> mpi_hello_world
>>>>
>>>> ======================== JOB MAP ========================
>>>> Data for JOB prterun-c1-1575@1 offset 0 Total slots allocated 24
>>>> Mapping policy: BYNODE:NOOVERSUBSCRIBE Ranking policy: NODE
>>>> Binding policy: NUMA:IF-SUPPORTED
>>>> Cpu set: N/A PPR: N/A Cpus-per-rank: N/A Cpu Type: CORE
>>>>
>>>>
>>>> Data for node: c1 Num slots: 12 Max slots: 0 Num procs: 12
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 0 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 2 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 4 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 6 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 8 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 10 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 12 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 14 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 16 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 18 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 20 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 22 Bound:
>>>> package[0][core:0-17]
>>>>
>>>> Data for node: c11 Num slots: 12 Max slots: 0 Num procs: 12
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 1 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 3 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 5 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 7 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 9 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 11 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 13 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 15 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 17 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 19 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 21 Bound:
>>>> package[0][core:0-17]
>>>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 23 Bound:
>>>> package[0][core:0-17]
>>>>
>>>> =============================================================
>>>> Hello world from processor c1, rank 6 out of 24 processors
>>>> Hello world from processor c1, rank 20 out of 24 processors
>>>> Hello world from processor c1, rank 16 out of 24 processors
>>>> Hello world from processor c1, rank 12 out of 24 processors
>>>> Hello world from processor c1, rank 0 out of 24 processors
>>>> Hello world from processor c1, rank 2 out of 24 processors
>>>> Hello world from processor c1, rank 14 out of 24 processors
>>>> Hello world from processor c1, rank 10 out of 24 processors
>>>> Hello world from processor c1, rank 4 out of 24 processors
>>>> Hello world from processor c1, rank 22 out of 24 processors
>>>> Hello world from processor c1, rank 18 out of 24 processors
>>>> Hello world from processor c1, rank 8 out of 24 processors
>>>> Hello world from processor c11, rank 11 out of 24 processors
>>>> Hello world from processor c11, rank 1 out of 24 processors
>>>> Hello world from processor c11, rank 3 out of 24 processors
>>>> Hello world from processor c11, rank 13 out of 24 processors
>>>> Hello world from processor c11, rank 19 out of 24 processors
>>>> Hello world from processor c11, rank 7 out of 24 processors
>>>> Hello world from processor c11, rank 17 out of 24 processors
>>>> Hello world from processor c11, rank 21 out of 24 processors
>>>> Hello world from processor c11, rank 15 out of 24 processors
>>>> Hello world from processor c11, rank 23 out of 24 processors
>>>> Hello world from processor c11, rank 9 out of 24 processors
>>>> Hello world from processor c11, rank 5 out of 24 processors
>>>> [c11:2028 :0:2028] ud_ep.c:278 Fatal: UD endpoint 0x1c8da90 to
>>>> <no debug data>: unhandled timeout error
>>>> [c11:2035 :0:2035] ud_ep.c:278 Fatal: UD endpoint 0x722a90 to
>>>> <no debug data>: unhandled timeout error
>>>> [c11:2025 :0:2025] ud_ep.c:278 Fatal: UD endpoint 0xc52a90 to
>>>> <no debug data>: unhandled timeout error
>>>> ==== backtrace (tid: 2028) ====
>>>> 0
>>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(ucs_handle_error+0x294)
>>>> [0x7fade4326ee4]
>>>> 1
>>>>
>>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(ucs_fatal_error_message+0xb2)
>>>>
>>>> [0x7fade4324292]
>>>> 2 /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f369)
>>>> [0x7fade4324369]
>>>> 3 /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/ucx/libuct_ib.so.0(+0x263f0)
>>>> [0x7fade110d3f0]
>>>> 4 /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x24987)
>>>> [0x7fade4319987]
>>>> 5
>>>>
>>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucp.so.0(ucp_worker_progress+0x2a)
>>>>
>>>> [0x7fade43abc9a]
>>>> 6
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(+0xa09bc)
>>>> [0x7fade471b9bc]
>>>> 7
>>>>
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs_nofence+0x6a)
>>>>
>>>> [0x7fade471b79a]
>>>> 8
>>>>
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs+0x20)
>>>>
>>>> [0x7fade471baf0]
>>>> 9
>>>>
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(mca_pml_ucx_del_procs+0x140)
>>>>
>>>> [0x7fade4d1cd70]
>>>> 10 /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(+0xac837)
>>>> [0x7fade4b27837]
>>>> 11
>>>>
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_finalize_cleanup_domain+0x53)
>>>>
>>>> [0x7fade46aebd3]
>>>> 12
>>>>
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_finalize+0x2e)
>>>>
>>>> [0x7fade46a22be]
>>>> 13
>>>>
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_rte_finalize+0x1f9)
>>>>
>>>> [0x7fade4b21909]
>>>> 14 /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(+0xab304)
>>>> [0x7fade4b26304]
>>>> 15
>>>>
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_mpi_instance_finalize+0xe5)
>>>>
>>>> [0x7fade4b26935]
>>>> 16
>>>>
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_mpi_finalize+0x3d1)
>>>>
>>>> [0x7fade4b1e091]
>>>> 17 mpi_hello_world() [0x40258f]
>>>> 18 /lib64/libc.so.6(+0x295d0) [0x7fade47b95d0]
>>>> 19 /lib64/libc.so.6(__libc_start_main+0x80) [0x7fade47b9680]
>>>> 20 mpi_hello_world() [0x402455]
>>>> =================================
>>>> [c11:02028] *** Process received signal ***
>>>> [c11:02028] Signal: Aborted (6)
>>>> [c11:02028] Signal code: (-6)
>>>> [c11:02028] [ 0] /lib64/libc.so.6(+0x3ebf0)[0x7fade47cebf0]
>>>> [c11:02028] [ 1] /lib64/libc.so.6(+0x8bedc)[0x7fade481bedc]
>>>> [c11:02028] [ 2] /lib64/libc.so.6(raise+0x16)[0x7fade47ceb46]
>>>> [c11:02028] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7fade47b8833]
>>>> [c11:02028] [ 4]
>>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f297)[0x7fade4324297]
>>>> [c11:02028] [ 5]
>>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f369)[0x7fade4324369]
>>>> [c11:02028] [ 6]
>>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/ucx/libuct_ib.so.0(+0x263f0)[0x7fade110d3f0]
>>>> [c11:02028] [ 7]
>>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x24987)[0x7fade4319987]
>>>> [c11:02028] [ 8]
>>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucp.so.0(ucp_worker_progress+0x2a)[0x7fade43abc9a]
>>>> [c11:02028] [ 9]
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(+0xa09bc)[0x7fade471b9bc]
>>>> [c11:02028] [10]
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs_nofence+0x6a)[0x7fade471b79a]
>>>> [c11:02028] [11]
>>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs+0x20)[0x7fade471baf0]
>>>> c11:02028] [12] ==== backtrace (tid: 2035) ====
>>>> ..................
>>>>
>>>> --------------------------------------------------------------------------------
>>>> Achilles
>>>>
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>>
>>>
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].