Thanks for the suggestion Ole - I tried this out yesterday with RHEL 9.4 with
two slightly different setups.
1) Using the stock ice driver that comes with RHEL 9.4 for the card still saw
the issue.
2) There was not a pre-built version of the ice driver on the intel download
site, so I buil
Thanks for the suggestion Ole - I'll see if I can get that in the mix to try
over the next few days.
I can report that 23.02.7 tree had the same issues, so going backwards on the
slurm bits did not have any impact.
Brent
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubs
On 5/7/24 15:32, Henderson, Brent via slurm-users wrote:
Over the past few days I grabbed some time on the nodes and ran for a few
hours. Looks like I **can** still hit the issue with cgroups disabled.
Incident rate was 8 out of >11k jobs so dropped an order of magnitude or
so. Guessing that
: [slurm-users] Re: srun launched mpi job occasionally core dumps
Re-tested with slurm 23.02.7 (had to also disable slurmdbd and run the
controller with the '-i' option) but still reproduced the issue fairly quickly.
Feels like the issue might be some interaction with RHEL 9.3 cgroup
Re-tested with slurm 23.02.7 (had to also disable slurmdbd and run the
controller with the '-i' option) but still reproduced the issue fairly quickly.
Feels like the issue might be some interaction with RHEL 9.3 cgroups and
slurm. Not sure what to try next - hoping for some suggestions.
Thank