[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-10 Thread Henderson, Brent via slurm-users
Thanks for the suggestion Ole - I tried this out yesterday with RHEL 9.4 with 
two slightly different setups.  

1) Using the stock ice driver that comes with RHEL 9.4 for the card still saw 
the issue.  

2) There was not a pre-built version of the ice driver on the intel download 
site, so I built it myself, rebooted and re-ran the test.  It did greatly 
reduced the number of occurrences of the issue - but didn't eliminate them.  

This is similar to what I saw on the RHEL 9.3 setup (adding the intel ICE 
driver reduced occurrences but did not eliminate them entirely).

I can also report that the 23.02.7 tree had the similar results on the 9.3 node 
setup.  Going backwards on the slurm bits did not seem to change the number of 
occurrences.  

Unfortunately I think I'm out of time for experiments on these nodes, but maybe 
this thread will be useful to others down the road.

Brent

PS - sorry for my last post getting tagged as s new issue.  Hopefully this one 
will thread correctly.


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-08 Thread Henderson, Brent via slurm-users
Thanks for the suggestion Ole - I'll see if I can get that in the mix  to try 
over the next few days.

I can report that 23.02.7 tree had the same issues, so going backwards on the 
slurm bits did not have any impact.

Brent


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Ole Holm Nielsen via slurm-users

On 5/7/24 15:32, Henderson, Brent via slurm-users wrote:
Over the past few days I grabbed some time on the nodes and ran for a few 
hours.  Looks like I **can** still hit the issue with cgroups disabled.  
Incident rate was 8 out of >11k jobs so dropped an order of magnitude or 
so.  Guessing that exonerates cgroups as the cause, but possibly just a 
good way to tickle the real issue.  Over the next few days, I’ll try to 
roll everything back to RHEL 8.9 and see how that goes.


My 2 cents: RHEL/AlmaLinux/RockyLinux 9.4 is out now, maybe it's worth a 
try to update to 9.4?


/Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Henderson, Brent via slurm-users
Over the past few days I grabbed some time on the nodes and ran for a few 
hours.  Looks like I *can* still hit the issue with cgroups disabled.  Incident 
rate was 8 out of >11k jobs so dropped an order of magnitude or so.  Guessing 
that exonerates cgroups as the cause, but possibly just a good way to tickle 
the real issue.  Over the next few days, I'll try to roll everything back to 
RHEL 8.9 and see how that goes.

Brent


From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com]
Sent: Thursday, May 2, 2024 11:32 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: srun launched mpi job occasionally core dumps

Re-tested with slurm 23.02.7 (had to also disable slurmdbd and run the 
controller with the '-i' option) but still reproduced the issue fairly quickly. 
 Feels like the issue might be some interaction with RHEL 9.3 cgroups and 
slurm.  Not sure what to try next - hoping for some suggestions.

Thanks,

Brent


From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com]
Sent: Wednesday, May 1, 2024 11:21 AM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: [slurm-users] srun launched mpi job occasionally core dumps

Greetings Slurm gurus --

I've been having an issue where very occasionally an srun launched OpenMPI job 
launched will die during startup within MPI_Init().  E.g. srun -N 8 
--ntasks-per-node=1 ./hello_world_mpi.  Same binary launched with mpirun does 
not experience the issue.  E.g. mpirun -n 64 -H cn01,... ./hello_world_mpi.  
The failure rate seems to be in the 0.5% - 1.0% range when using srun for 
launch.

SW stack is self-built with:

* Dual socket AMD nodes

* RHEL 9.3 base system + tools

* Single 100 Gb card per host

* hwloc 2.9.3

* pmix 4.2.9 (5.0.2 also tried but continued to see the same issues)

* slurm 23.11.6 (started with 23.11.5 - update did not change the 
behavior)

* openmpi 5.0.3

The MPI code is a simple hello_world_mpi.c - anything that goes through startup 
via srun - does not seem to matter.  Application core dump looks like the 
following regardless of the test running:

[cn04:1194785] *** Process received signal ***
[cn04:1194785] Signal: Segmentation fault (11)
[cn04:1194785] Signal code: Address not mapped (1)
[cn04:1194785] Failing at address: 0xe0
[cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0]
[cn04:1194785] [ 1] 
/share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d]
[cn04:1194785] [ 2] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c]
[cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88]
[cn04:1194785] [ 4] 
/lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7]
[cn04:1194785] [ 5] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af]
[cn04:1194785] [ 6] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365]
[cn04:1194785] [ 7] 
/share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d]
[cn04:1194785] [ 8] 
/share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae]
[cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780]
[cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0]
[cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60]
[cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5]
[cn04:1194785] *** End of error message ***

More than one rank can die with the same stacktrace on a node when this happens 
- I've seen as many as 6.  One other interesting note is that if I change my 
srun command line to include strace (e.g. srun -N 8 --ntasks-per-node=8 strace 
 ./hello_world_mpi) the issue appears to go away.  0 failures 
in ~2500 runs.  Another thing that seems to help is to disabling cgroups in the 
slurm.conf.  After the change, saw 0 failures in >6100 hello_world_mpi runs.

The changes in the slurm.conf were - original:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup

Changed
ProctrackType=proctrack/linuxproc
TaskPlugin=task/affinity
JobAcctGatherType=jobacct_gather/linux

My cgroup.conf file contains:
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedRamSpace=95

Curious is anyone has any thoughts on next steps to help figure out what might 
be going on and how to resolve it.  Currently, I'm planning to back down to the 
23.02.7 release and see how that goes but open to other suggestions.

Thanks,

Brent



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-02 Thread Henderson, Brent via slurm-users
Re-tested with slurm 23.02.7 (had to also disable slurmdbd and run the 
controller with the '-i' option) but still reproduced the issue fairly quickly. 
 Feels like the issue might be some interaction with RHEL 9.3 cgroups and 
slurm.  Not sure what to try next - hoping for some suggestions.

Thanks,

Brent


From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com]
Sent: Wednesday, May 1, 2024 11:21 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] srun launched mpi job occasionally core dumps

Greetings Slurm gurus --

I've been having an issue where very occasionally an srun launched OpenMPI job 
launched will die during startup within MPI_Init().  E.g. srun -N 8 
--ntasks-per-node=1 ./hello_world_mpi.  Same binary launched with mpirun does 
not experience the issue.  E.g. mpirun -n 64 -H cn01,... ./hello_world_mpi.  
The failure rate seems to be in the 0.5% - 1.0% range when using srun for 
launch.

SW stack is self-built with:

* Dual socket AMD nodes

* RHEL 9.3 base system + tools

* Single 100 Gb card per host

* hwloc 2.9.3

* pmix 4.2.9 (5.0.2 also tried but continued to see the same issues)

* slurm 23.11.6 (started with 23.11.5 - update did not change the 
behavior)

* openmpi 5.0.3

The MPI code is a simple hello_world_mpi.c - anything that goes through startup 
via srun - does not seem to matter.  Application core dump looks like the 
following regardless of the test running:

[cn04:1194785] *** Process received signal ***
[cn04:1194785] Signal: Segmentation fault (11)
[cn04:1194785] Signal code: Address not mapped (1)
[cn04:1194785] Failing at address: 0xe0
[cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0]
[cn04:1194785] [ 1] 
/share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d]
[cn04:1194785] [ 2] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c]
[cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88]
[cn04:1194785] [ 4] 
/lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7]
[cn04:1194785] [ 5] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af]
[cn04:1194785] [ 6] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365]
[cn04:1194785] [ 7] 
/share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d]
[cn04:1194785] [ 8] 
/share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae]
[cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780]
[cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0]
[cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60]
[cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5]
[cn04:1194785] *** End of error message ***

More than one rank can die with the same stacktrace on a node when this happens 
- I've seen as many as 6.  One other interesting note is that if I change my 
srun command line to include strace (e.g. srun -N 8 --ntasks-per-node=8 strace 
 ./hello_world_mpi) the issue appears to go away.  0 failures 
in ~2500 runs.  Another thing that seems to help is to disabling cgroups in the 
slurm.conf.  After the change, saw 0 failures in >6100 hello_world_mpi runs.

The changes in the slurm.conf were - original:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup

Changed
ProctrackType=proctrack/linuxproc
TaskPlugin=task/affinity
JobAcctGatherType=jobacct_gather/linux

My cgroup.conf file contains:
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedRamSpace=95

Curious is anyone has any thoughts on next steps to help figure out what might 
be going on and how to resolve it.  Currently, I'm planning to back down to the 
23.02.7 release and see how that goes but open to other suggestions.

Thanks,

Brent



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com