[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-10 Thread Henderson, Brent via slurm-users
Thanks for the suggestion Ole - I tried this out yesterday with RHEL 9.4 with two slightly different setups. 1) Using the stock ice driver that comes with RHEL 9.4 for the card still saw the issue. 2) There was not a pre-built version of the ice driver on the intel download site, so I

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-08 Thread Henderson, Brent via slurm-users
Thanks for the suggestion Ole - I'll see if I can get that in the mix to try over the next few days. I can report that 23.02.7 tree had the same issues, so going backwards on the slurm bits did not have any impact. Brent -- slurm-users mailing list -- slurm-users@lists.schedmd.com To

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Ole Holm Nielsen via slurm-users
On 5/7/24 15:32, Henderson, Brent via slurm-users wrote: Over the past few days I grabbed some time on the nodes and ran for a few hours.  Looks like I **can** still hit the issue with cgroups disabled. Incident rate was 8 out of >11k jobs so dropped an order of magnitude or so.  Guessing

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Henderson, Brent via slurm-users
urm-users] Re: srun launched mpi job occasionally core dumps Re-tested with slurm 23.02.7 (had to also disable slurmdbd and run the controller with the '-i' option) but still reproduced the issue fairly quickly. Feels like the issue might be some interaction with RHEL 9.3 cgroups and sl

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-02 Thread Henderson, Brent via slurm-users
Re-tested with slurm 23.02.7 (had to also disable slurmdbd and run the controller with the '-i' option) but still reproduced the issue fairly quickly. Feels like the issue might be some interaction with RHEL 9.3 cgroups and slurm. Not sure what to try next - hoping for some suggestions.