[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users
The end goal is to see the following 2 things - jobs under the slurmstepd cgroup path, and the cpu,cpuset,memory at least in the cgroup.controllers file within the jobs cgroups.controller list. The pattern you have would be the processes left after boot, first failed slurmd service start which

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Josef Dvoracek via slurm-users
thanks for hint. so you end with two "slurmstepd infinity" processes like me when I tried this workaround? [root@node ~]# ps aux | grep slurm root    1833  0.0  0.0  33716  2188 ?    Ss   21:02   0:00 /usr/sbin/slurmstepd infinity root    2259  0.0  0.0 236796 12108 ?    Ss   

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users
There needs to be a slurmstepd infinity process running before slurmd starts. This doc goes into it: https://slurm.schedmd.com/cgroup_v2.html Probably a better way to do this, but this is what we do to deal with that: :: files/slurm-cgrepair.service :: [Unit]

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Josef Dvoracek via slurm-users
I observe same behavior on slurm 23.11.5 Rocky Linux8.9.. > [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control > memory pids > [root@compute ~]# systemctl disable slurmd > Removed /etc/systemd/system/multi-user.target.wants/slurmd.service. > [root@compute ~]# cat

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-11 Thread Christopher Samuel via slurm-users
On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote: In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case? Mark the node as "DOWN" in Slurm, this is what we do when we get