Hello all: We upgraded from 20.11.8 to 21.08.5 (CentOS 7.9, Slurm built without pmix support) recently. After that, we found that in many cases, 'mpirun' was causing multi-node MPI jobs to have all MPI ranks within a node run on the same core. We've moved on to 'srun'.
Now we see a problem in which the OOM killer is in some cases predictably killing job steps who don't seem to deserve it. In some cases these are job scripts and input files which ran fine before our Slurm upgrade. More details follow, but that's it the issue in a nutshell. Other than the version, our one Slurm config change was to remove the deprecated 'TaskPluginParam=Sched' from slurm.conf, giving it its default 'null' value. Our TaskPlugin remains 'task/affinity,task/cgroup'. We've had apparently correct cgroup-based mem limit enforcement in place for a long time, so the OOM-killing of the jobs I’m referencing is a change in behavior. Below are some of our support team's findings. I haven't finished trying to correlate the anomalous job events with specific OOM complaints, or recorded job resource usage at those times. I'm just throwing out this message in case what we've seen so far, or the painfully obvious thing I’m missing, looks familiar to anyone. Thanks! Application: VASP 6.1.2 launched with srun MPI libraries: intel/2019b Observations: Test 1. QDR-fabric Intel nodes (20 nodes x 10 cores/node) outcome: job failed right away, no output generated error text: 20 occurrences resembling in form "[13:ra8-10] unexpected reject event from 9:ra8-9" Test 2. EDR-fabric Intel nodes (20 nodes x 10 cores/node) outcome: job ran for 12 minutes, generated some output data that look fine error text: no error messages, job failed. Test 3. AMD Rome (20 nodes x 10 cores/node) outcome: job completed successfully after 31 minutes, user confirmed the results are fine Application: Quantum Espresso 6.5 launched with srun MPI libraries: intel/2019b Observations: - Works correctly when using: 1 node x 64 cores 64 MPI processes), 1x128 (128 MPI processes) (other QE parameters -nk 1 -nt 4 , mem-per-cpu=1500mb) - A few processes get OOM killed after a while when using: 4 nodes x 32 cores (128 MPI processes), 4 nodes x 64 cores (256 MPI processes) - job fails within seconds: 16 nodes x 8 cores -- Paul Brunk, system administrator Georgia Advanced Resource Computing Center Enterprise IT Svcs, the University of Georgia