Hi Jeff & list,

we've encountered the same problem after upgrade to 21.08.8-2.  All jobs failed 
with "Slurmd could not execve job".
I've traced this down to the slurmstepd process failing to modify the cgroup 
setting "memory.memsw.limit_in_bytes",
which happens because we have "ConstrainSwapSpace=yes" in Slurm's cgroup.conf 
file.
The error happens on Debian/Ubuntu systems as they don't turn on cgroup swap 
accounting by default.
The fix is to boot with the option "swapaccount=1" (i.e. add this to the 
grub/pxelinux/... boot config),
or one could set "ConstrainSwapSpace=no" if swap accounting is not needed.

In my understanding, with our config this error should also have shown up in 
previous
versions of Slurm. Perhaps it did happen, but wasn't caught properly.

BTW, what's the correct process to see the debug messages from slurmstepd?  
Even when running slurmd
with "-D -vvvvv", these didn't show up.  I resorted to running slurmd in 
strace, in order to see where the
error happens, and it revealed that slurmstepd was printing some messages. But 
strace has a lot of overhead.

Regards,
Frank

(apologies for breaking threading, I wasn't subscribed to slurm-users at the 
time and can't reply properly)


> My site recently updated to Slurm 21.08.6 and for the most part everything 
> went fine.  Two Ubuntu nodes however are having issues.    Slurmd cannot 
> execve the jobs on the nodes.  As an example:
> 
> [jrlang at tmgt1 ~]$ salloc -A ARCC --nodes=1 --ntasks=20 -t 1:00:00 --bell 
> --nodelist=mdgx01 --partition=dgx /bin/bash
> salloc: Granted job allocation 2328489
> [jrlang at tmgt1 ~]$ srun hostname
> srun: error: task 0 launch failed: Slurmd could not execve job
> srun: error: task 1 launch failed: Slurmd could not execve job
> [...]
> srun: error: task 19 launch failed: Slurmd could not execve job
> 
> Looking in slurmd-mdgx01.log we only see
> 
> [2022-03-24T14:44:02.408] [2328501.interactive] error: Failed to invoke task 
> plugins: one of task_p_pre_setuid functions returned error
> [2022-03-24T14:44:02.409] [2328501.interactive] error: job_manager: exiting 
> abnormally: Slurmd could not execve job
> [2022-03-24T14:44:02.411] [2328501.interactive] done with job
> 
> 
> Note that this issues didn't occure with Slurm 20.11.8.
> 
> Any ideas what could be causing the issue, cause I'm stumped?
> 
> Jeff

Reply via email to