We are upgrading a cluster from RHEL6 to RHEL8, and have migrated some nodes to a new partition and reimaged with RHEL8. I am having some issues getting openmpi to work with infiniband on the nodes upgraded to RHEL8.
For testing purposes, I am trying to run a simple MPI "hello world" code on the local RHEL8 host (obviously, also having issues on multiple nodes, but trying to simplify). If I run with BTL set to vader,self or tcp,self on command line, the MPI code runs as expected. If I set to openib,self (or leave unset), the job just hangs indefinitely, e.g. bash> mpirun -H localhost -v --mca mpi_cuda_support 0 --mca btl_openib_verbose 1 --mca btl openib,self -n 1 --show-progress -d --debug-daemons ./hello-world-mpi [compute-a20-3.XXX.YYY.ZZZ:30383] procdir: /tmp/ompi.compute-a20-3.34676/pid.30383/0/0 [compute-a20-3.XXX.YYY.ZZZ:30383] jobdir: /tmp/ompi.compute-a20-3.34676/pid.30383/0 [compute-a20-3.XXX.YYY.ZZZ:30383] top: /tmp/ompi.compute-a20-3.34676/pid.30383 [compute-a20-3.XXX.YYY.ZZZ:30383] top: /tmp/ompi.compute-a20-3.34676 [compute-a20-3.XXX.YYY.ZZZ:30383] tmp: /tmp [compute-a20-3.XXX.YYY.ZZZ:30383] sess_dir_cleanup: job session dir does not exist [compute-a20-3.XXX.YYY.ZZZ:30383] sess_dir_cleanup: top session dir not empty - leaving [compute-a20-3.XXX.YYY.ZZZ:30383] procdir: /tmp/ompi.compute-a20-3.34676/pid.30383/0/0 [compute-a20-3.XXX.YYY.ZZZ:30383] jobdir: /tmp/ompi.compute-a20-3.34676/pid.30383/0 [compute-a20-3.XXX.YYY.ZZZ:30383] top: /tmp/ompi.compute-a20-3.34676/pid.30383 [compute-a20-3.XXX.YYY.ZZZ:30383] top: /tmp/ompi.compute-a20-3.34676 [compute-a20-3.XXX.YYY.ZZZ:30383] tmp: /tmp [compute-a20-3.XXX.YYY.ZZZ:30383] [[29315,0],0] orted_cmd: received add_local_procs [compute-a20-3.XXX.YYY.ZZZ:30383] [[29315,0],0] Releasing job data for [INVALID] App launch reported: 1 (out of 1) daemons - 0 (out of 1) procs MPIR_being_debugged = 0 MPIR_debug_state = 1 MPIR_partial_attach_ok = 1 MPIR_i_am_starter = 0 MPIR_forward_output = 0 MPIR_proctable_size = 1 MPIR_proctable: (i, host, exe, pid) = (0, compute-a20-3, /software/hello-world/1.0/gcc/8.4.0/openmpi/3.1.5/linux-rhel8-x86_64/bin/./hello-world-mpi, 30387) MPIR_executable_path: NULL MPIR_server_arguments: NULL [compute-a20-3.XXX.YYY.ZZZ:30387] procdir: /tmp/ompi.compute-a20-3.34676/pid.30383/1/0 [compute-a20-3.XXX.YYY.ZZZ:30387] jobdir: /tmp/ompi.compute-a20-3.34676/pid.30383/1 [compute-a20-3.XXX.YYY.ZZZ:30387] top: /tmp/ompi.compute-a20-3.34676/pid.30383 [compute-a20-3.XXX.YYY.ZZZ:30387] top: /tmp/ompi.compute-a20-3.34676 [compute-a20-3.XXX.YYY.ZZZ:30387] tmp: /tmp [compute-a20-3][[29315,1],0][btl_openib_ini.c:172:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4099 [compute-a20-3][[29315,1],0][btl_openib_ini.c:188:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox Hermon [compute-a20-3][[29315,1],0][btl_openib_ini.c:172:opal_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0 [compute-a20-3][[29315,1],0][btl_openib_ini.c:188:opal_btl_openib_ini_query] Found corresponding INI values: default At this point the code just hangs indefinitely. I see a PID 30387 named hello-world-mpi with 3 threads,, which is consuming ~100% of a CPU core but strace just shows doing epool_wait calls. The "Releasing job data for [INVALID]" looks suspicious, but looking at source code I think that is just because I am running outside of a scheduler so no job number. I suspect the problem is the 0 in the line App launch reported: 1 (out of 1) daemons - 0 (out of 1) procs but I am at a loss as to why or how to fix it. I can run the same example above on one of the nodes still at RHEL6 (compiled for the OpenMPI we have on that system) and it works as expected. I am able to run ibv_tc_pingpong between nodes (both between a pair of RHEL8 nodes, a pair of RHEL6 nodes, and mixed (one RHEL6 and one RHEL8, and of course within the same node), so I do not see any obvious Infiniband issues. If anyone could give suggestions/tips/ideas on how to proceed/diagnose/fix this issue would be grateful. Thanks in advance for any suggestions. ================================================ System/etc details ================================================ The issue is occurring on RHEL8 system, specifically 8.1 with kernel 4.18.0-147.5.1.el8_1.x86_64 running OpenMPI 3.1.5 (built with gcc 8.4.0 using spack) The issue is in the openib BTL (vader and tcp BTLs seem to be working) and is using OpenFabrics from Mellanox (libibverbs-41mlnx1-OFED.5.0.0.0.9.50100.0.src.rpm) We are using a subnet manager running on a Mellanox FDR IB switch (SX_PPC_M460EX) The "working" RHEL6 system are running 6.10, kernel 2.6.32-754.25.1.el6.x86_64, with OpenMPI 1.10.2 built with gcc 6.1.0) The memorylocked limit on both RHEL8 and RHEL6 is unlimited. On the RHEL8 node, ibv_devinfo returns: hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.32.5100 node_guid: f452:1403:0070:1c80 sys_image_guid: f452:1403:0070:1c83 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x1 board_id: DEL0A30000019 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 532 port_lid: 536 port_lmc: 0x00 link_layer: InfiniBand (The "working" RHEL6 system basically has an identical result from ibv_devinfo, with exception of different values node_guid, sys_image_guid, and port_lid.) The results of ompi-info --all on the RHEL8 node is attached. As indicated earlier, I am running on the same node as the mpirun command is issued. The result of ifconfig -a on the RHEL8 node is: eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.103.132.13 netmask 255.255.224.0 broadcast 10.103.159.255 inet6 fe80::3617:ebff:fee6:6a31 prefixlen 64 scopeid 0x20<link> ether 34:17:eb:e6:6a:31 txqueuelen 1000 (Ethernet) RX packets 1599943 bytes 345477382 (329.4 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2147871 bytes 3010964444 (2.8 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device memory 0x91120000-9113ffff eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 34:17:eb:e6:6a:32 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device memory 0x91100000-9111ffff ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 inet 192.168.68.13 netmask 255.255.224.0 broadcast 192.168.95.255 inet6 fe80::f652:1403:70:1c81 prefixlen 64 scopeid 0x20<link> Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). infiniband A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) RX packets 49701 bytes 45121502 (43.0 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 25427 bytes 5740480 (5.4 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 476287 bytes 23889166 (22.7 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 476287 bytes 23889166 (22.7 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 -- Tom Payerle DIT-ACIGS/Mid-Atlantic Crossroads paye...@umd.edu 5825 University Research Park (301) 405-6135 University of Maryland College Park, MD 20740-3831
ompi-info.all.a20-3.bz2
Description: application/bzip