Greetings Slurm gurus --

I've been having an issue where very occasionally an srun launched OpenMPI job 
launched will die during startup within MPI_Init().  E.g. srun -N 8 
--ntasks-per-node=1 ./hello_world_mpi.  Same binary launched with mpirun does 
not experience the issue.  E.g. mpirun -n 64 -H cn01,... ./hello_world_mpi.  
The failure rate seems to be in the 0.5% - 1.0% range when using srun for 
launch.

SW stack is self-built with:

*         Dual socket AMD nodes

*         RHEL 9.3 base system + tools

*         Single 100 Gb card per host

*         hwloc 2.9.3

*         pmix 4.2.9 (5.0.2 also tried but continued to see the same issues)

*         slurm 23.11.6 (started with 23.11.5 - update did not change the 
behavior)

*         openmpi 5.0.3

The MPI code is a simple hello_world_mpi.c - anything that goes through startup 
via srun - does not seem to matter.  Application core dump looks like the 
following regardless of the test running:

[cn04:1194785] *** Process received signal ***
[cn04:1194785] Signal: Segmentation fault (11)
[cn04:1194785] Signal code: Address not mapped (1)
[cn04:1194785] Failing at address: 0xe0
[cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0]
[cn04:1194785] [ 1] 
/share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d]
[cn04:1194785] [ 2] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c]
[cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88]
[cn04:1194785] [ 4] 
/lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7]
[cn04:1194785] [ 5] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af]
[cn04:1194785] [ 6] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365]
[cn04:1194785] [ 7] 
/share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d]
[cn04:1194785] [ 8] 
/share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae]
[cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780]
[cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0]
[cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60]
[cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5]
[cn04:1194785] *** End of error message ***

More than one rank can die with the same stacktrace on a node when this happens 
- I've seen as many as 6.  One other interesting note is that if I change my 
srun command line to include strace (e.g. srun -N 8 --ntasks-per-node=8 strace 
<strace-options> ./hello_world_mpi) the issue appears to go away.  0 failures 
in ~2500 runs.  Another thing that seems to help is to disabling cgroups in the 
slurm.conf.  After the change, saw 0 failures in >6100 hello_world_mpi runs.

The changes in the slurm.conf were - original:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup

Changed
ProctrackType=proctrack/linuxproc
TaskPlugin=task/affinity
JobAcctGatherType=jobacct_gather/linux

My cgroup.conf file contains:
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedRamSpace=95

Curious is anyone has any thoughts on next steps to help figure out what might 
be going on and how to resolve it.  Currently, I'm planning to back down to the 
23.02.7 release and see how that goes but open to other suggestions.

Thanks,

Brent


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to