date:20240411

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users

The end goal is to see the following 2 things - jobs under the slurmstepd cgroup path, and the cpu,cpuset,memory at least in the cgroup.controllers file within the jobs cgroups.controller list. The pattern you have would be the processes left after boot, first failed slurmd service start which

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Josef Dvoracek via slurm-users

thanks for hint. so you end with two "slurmstepd infinity" processes like me when I tried this workaround? [root@node ~]# ps aux | grep slurm root 1833 0.0 0.0 33716 2188 ? Ss 21:02 0:00 /usr/sbin/slurmstepd infinity root 2259 0.0 0.0 236796 12108 ? Ss

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users

There needs to be a slurmstepd infinity process running before slurmd starts. This doc goes into it: https://slurm.schedmd.com/cgroup_v2.html Probably a better way to do this, but this is what we do to deal with that: :: files/slurm-cgrepair.service :: [Unit]

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Josef Dvoracek via slurm-users

I observe same behavior on slurm 23.11.5 Rocky Linux8.9.. > [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control > memory pids > [root@compute ~]# systemctl disable slurmd > Removed /etc/systemd/system/multi-user.target.wants/slurmd.service. > [root@compute ~]# cat

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-11 Thread Christopher Samuel via slurm-users

On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote: In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case? Mark the node as "DOWN" in Slurm, this is what we do when we get

[slurm-users] Re: Slurmd enabled crash with CgroupV2

[slurm-users] Re: Slurmd enabled crash with CgroupV2

[slurm-users] Re: Slurmd enabled crash with CgroupV2

[slurm-users] Re: Slurmd enabled crash with CgroupV2

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

5 matches

Site Navigation

Mail list logo

Footer information