[slurm-users] Re: [EXTERN] Re: scheduling according time requirements
Hi Loris, On 4/30/24 4:26 PM, Loris Bennett via slurm-users wrote: Hi Dietmar, Dietmar Rieder via slurm-users writes: Hi Loris, On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote: Hi Dietmar, Dietmar Rieder via slurm-users writes: Hi Loris, On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote: Hi Dietmar, Dietmar Rieder via slurm-users writes: Hi, is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition? e.g. 3 partitions PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h. I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively. However, my job will not get scheduled at all when -t is greater than 01:00:00 i.e. ]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition. Do I misunderstand something there? Or can this be somehow configured. You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes. Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts? I am not aware that you can set multiple partition as a default. Diego suggested a possible way which seems to work after a quick test. Yes, I wasn't aware of that, but it might also be useful for us, too. The question is why you actually need partitions with different maximum runtimes. we would like to have only a sub set of the nodes in a partition for long running jobs, so that there are enough nodes available for short jobs. The nodes for the long partition, however are also part of the short partition so they can also be utilized when no long jobs are running. That's our idea If you have plenty of short running jobs, that is probably a reasonable approach. On our system, the number of short running jobs would probably tend to dip significantly over the weekend and public holidays, so resources would potentially be blocked for the long running jobs. On the other hand, long-running jobs on our system often run for days, so one day here or there might not be so significant. And if the long-running jobs were able to start in the short partition, they could block short jobs. The other thing to think about with regard to short jobs is backfilling. With our mix of jobs, unless a job needs a large amount of memory or number of cores, those with a run-time of only a few hours should be backfilled fairly efficiently. you are absolutely right, and I guess we will nee to optimize using QoS. Thanks for your input and thoughts. Regards Loris In our case, a university cluster with a very wide range of codes and usage patterns, multiple partitions would probably lead to fragmentation and wastage of resources due to the job mix not always fitting well to the various partitions. Therefore, I am a member of the "as few partitions as possible" camp and so in our set-up we have as essentially only one partition with a DefaultTime of 14 days. We do however let users set a QOS to gain a priority boost in return for accepting a shorter run-time and a reduced maximum number of cores. we didn't look into QOS yet, but this might also a way to go, thanks. Occasionally people complain about short jobs having to wait in the queue for too long, but I have generally been successful in solving the problem by having them estimate their resource requirements better or bundling their work in ordert to increase the run-time-to-wait-time ratio. Dietmar OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: scheduling according time requirements
Dear Thomas, the QoS seems really helpful, we'll look into it. Perhaps as a starting point for us could you eventually translate my simple example into a QoS config/setting? Thanks so much Dietmar On 4/30/24 4:26 PM, Thomas Hartmann via slurm-users wrote: Hi Dietmar, I was facing quite similar requirements to yours. We ended up using QoS instead of partitions because this approach provides higher flexibility and more features. The basic distinction between the two approaches is that partitions are node-based while QoS are (essentially) resource based. So, instead of saying "Long jobs can only run on nodes 9 and 10" you would be able to say "Long jobs can only use X CPU cores in total". However, yes, your partition based approach is going to do the job, as long as you do not need any QoS based preemption. Cheers, Thomas Am 30.04.24 um 16:00 schrieb Dietmar Rieder via slurm-users: Hi Loris, On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote: Hi Dietmar, Dietmar Rieder via slurm-users writes: Hi Loris, On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote: Hi Dietmar, Dietmar Rieder via slurm-users writes: Hi, is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition? e.g. 3 partitions PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h. I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively. However, my job will not get scheduled at all when -t is greater than 01:00:00 i.e. ]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition. Do I misunderstand something there? Or can this be somehow configured. You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes. Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts? I am not aware that you can set multiple partition as a default. Diego suggested a possible way which seems to work after a quick test. The question is why you actually need partitions with different maximum runtimes. we would like to have only a sub set of the nodes in a partition for long running jobs, so that there are enough nodes available for short jobs. The nodes for the long partition, however are also part of the short partition so they can also be utilized when no long jobs are running. That's our idea In our case, a university cluster with a very wide range of codes and usage patterns, multiple partitions would probably lead to fragmentation and wastage of resources due to the job mix not always fitting well to the various partitions. Therefore, I am a member of the "as few partitions as possible" camp and so in our set-up we have as essentially only one partition with a DefaultTime of 14 days. We do however let users set a QOS to gain a priority boost in return for accepting a shorter run-time and a reduced maximum number of cores. we didn't look into QOS yet, but this might also a way to go, thanks. Occasionally people complain about short jobs having to wait in the queue for too long, but I have generally been successful in solving the problem by having them estimate their resource requirements better or bundling their work in ordert to increase the run-time-to-wait-time ratio. OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: scheduling according time requirements
Hi Loris, On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote: Hi Dietmar, Dietmar Rieder via slurm-users writes: Hi Loris, On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote: Hi Dietmar, Dietmar Rieder via slurm-users writes: Hi, is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition? e.g. 3 partitions PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h. I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively. However, my job will not get scheduled at all when -t is greater than 01:00:00 i.e. ]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition. Do I misunderstand something there? Or can this be somehow configured. You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes. Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts? I am not aware that you can set multiple partition as a default. Diego suggested a possible way which seems to work after a quick test. The question is why you actually need partitions with different maximum runtimes. we would like to have only a sub set of the nodes in a partition for long running jobs, so that there are enough nodes available for short jobs. The nodes for the long partition, however are also part of the short partition so they can also be utilized when no long jobs are running. That's our idea In our case, a university cluster with a very wide range of codes and usage patterns, multiple partitions would probably lead to fragmentation and wastage of resources due to the job mix not always fitting well to the various partitions. Therefore, I am a member of the "as few partitions as possible" camp and so in our set-up we have as essentially only one partition with a DefaultTime of 14 days. We do however let users set a QOS to gain a priority boost in return for accepting a shorter run-time and a reduced maximum number of cores. we didn't look into QOS yet, but this might also a way to go, thanks. Occasionally people complain about short jobs having to wait in the queue for too long, but I have generally been successful in solving the problem by having them estimate their resource requirements better or bundling their work in ordert to increase the run-time-to-wait-time ratio. Dietmar OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: scheduling according time requirements
Hi Diego, thanks a lot, it seems to work as far as I was able to test now. Dietmar On 4/30/24 3:24 PM, Diego Zuccato via slurm-users wrote: Try adding to the config: EnforcePartLimits=ANY JobSubmitPlugins=all_partitions Diego Il 30/04/2024 15:11, Dietmar Rieder via slurm-users ha scritto: Hi Loris, On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote: Hi Dietmar, Dietmar Rieder via slurm-users writes: Hi, is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition? e.g. 3 partitions PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h. I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively. However, my job will not get scheduled at all when -t is greater than 01:00:00 i.e. ]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition. Do I misunderstand something there? Or can this be somehow configured. You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes. Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts? Dietmar OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: scheduling according time requirements
Hi Loris, On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote: Hi Dietmar, Dietmar Rieder via slurm-users writes: Hi, is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition? e.g. 3 partitions PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h. I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively. However, my job will not get scheduled at all when -t is greater than 01:00:00 i.e. ]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition. Do I misunderstand something there? Or can this be somehow configured. You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes. Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts? Dietmar OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] scheduling according time requirements
Hi, is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition? e.g. 3 partitions PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h. I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively. However, my job will not get scheduled at all when -t is greater than 01:00:00 i.e. ]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition. Do I misunderstand something there? Or can this be somehow configured. Thanks so much and sorry for the naive question Dietmar OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2
Hi list, I finally got it working, I completely overlooked that I set Oversubscribe=EXCLUSIVE for the partition that I used to test, stupid me. sorry for the noise and thanks again for your answers Best Dietmar On 2/29/24 13:19, Dietmar Rieder via slurm-users wrote: Hi Josef, hi list, I now rebuild the rpms from OpenHPC but using the original sources form version 23.11.4. The configure command that is genereated from the spec is the following: ./configure --build=x86_64-redhat-linux-gnu \ --host=x86_64-redhat-linux-gnu \ --program-prefix= \ --disable-dependency-tracking \ --prefix=/usr \ --exec-prefix=/usr \ --bindir=/usr/bin \ --sbindir=/usr/sbin \ --sysconfdir=/etc/slurm \ --datadir=/usr/share \ --includedir=/usr/include \ --libdir=/usr/lib64 \ --libexecdir=/usr/libexec \ --localstatedir=/var \ --sharedstatedir=/var/lib \ --mandir=/usr/share/man \ --infodir=/usr/share/info \ --enable-multiple-slurmd \ --with-pmix=/opt/ohpc/admin/pmix \ --with-hwloc=/opt/ohpc/pub/libs/hwloc (Do I miss something here) the configure output shows: [...] checking for bpf installation... /usr checking for dbus-1... yes [...] config.log dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include ' dbus_LIBS='-ldbus-1 confdefs.h. #define WITH_CGROUP 1 #define HAVE_BPF 1 However I still can't see any CPU limits when I use sbatch to run a batch job. $ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status' $ cat slurm-72.out Cpus_allowed: ,, Cpus_allowed_list: 0-95 The logs from the head node (leto) and the compute node (apollo-01) are showing: Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: _start_job: Started JobId=72 in standard on apollo-01 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 WEXITSTATUS 0 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 done Best Dietmar On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote: > I'm running slurm 22.05.11 which is available with OpenHCP 3.x > Do you think an upgrade is needed? I feel that lot of slurm operators tend to not use 3rd party sources of slurm binaries, as you do not have the build environment fully in your hands. But before making such a complex decision, perhaps look for build logs of slurm you use (somewhere in OpenHPC buildsystem?) and check if it was built with libraries needed to have cgroupsv2 working.. Not having cgroupsv2 dependencies during build-time is only one of all possible causes.. josef OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2
Hi Josef, hi list, I now rebuild the rpms from OpenHPC but using the original sources form version 23.11.4. The configure command that is genereated from the spec is the following: ./configure --build=x86_64-redhat-linux-gnu \ --host=x86_64-redhat-linux-gnu \ --program-prefix= \ --disable-dependency-tracking \ --prefix=/usr \ --exec-prefix=/usr \ --bindir=/usr/bin \ --sbindir=/usr/sbin \ --sysconfdir=/etc/slurm \ --datadir=/usr/share \ --includedir=/usr/include \ --libdir=/usr/lib64 \ --libexecdir=/usr/libexec \ --localstatedir=/var \ --sharedstatedir=/var/lib \ --mandir=/usr/share/man \ --infodir=/usr/share/info \ --enable-multiple-slurmd \ --with-pmix=/opt/ohpc/admin/pmix \ --with-hwloc=/opt/ohpc/pub/libs/hwloc (Do I miss something here) the configure output shows: [...] checking for bpf installation... /usr checking for dbus-1... yes [...] config.log dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include ' dbus_LIBS='-ldbus-1 confdefs.h. #define WITH_CGROUP 1 #define HAVE_BPF 1 However I still can't see any CPU limits when I use sbatch to run a batch job. $ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status' $ cat slurm-72.out Cpus_allowed: ,, Cpus_allowed_list: 0-95 The logs from the head node (leto) and the compute node (apollo-01) are showing: Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: _start_job: Started JobId=72 in standard on apollo-01 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 WEXITSTATUS 0 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 done Best Dietmar On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote: > I'm running slurm 22.05.11 which is available with OpenHCP 3.x > Do you think an upgrade is needed? I feel that lot of slurm operators tend to not use 3rd party sources of slurm binaries, as you do not have the build environment fully in your hands. But before making such a complex decision, perhaps look for build logs of slurm you use (somewhere in OpenHPC buildsystem?) and check if it was built with libraries needed to have cgroupsv2 working.. Not having cgroupsv2 dependencies during build-time is only one of all possible causes.. josef OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2
Hi, I'm running slurm 22.05.11 which is available with OpenHCP 3.x Do you think an upgrade is needed? Best Dietmar On 2/28/24 14:55, Josef Dvoracek via slurm-users wrote: Hi Dietmar; I tried this on ${my cluster}, as I switched to cgroupsv2 quite recently.. I must say that on my setup it looks it works as expected, see the grepped stdout from your reproducer below. I use recent slurm 23.11.4 . Wild guess.. Has your build machine bpt and dbus devel packages installed? (both packages are fine to be absent when doing build for cgroupsv1 - slurm..) cheers josef [jose@koios1 test_cgroups]$ cat slurm-7177217.out | grep eli ValueError: CPU number 7 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 4 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 5 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 11 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 9 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 10 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 14 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 8 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 12 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 6 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 13 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 15 is not eligible; choose between [0, 1, 2, 3] [jose@koios1 test_cgroups]$ On 28. 02. 24 14:28, Dietmar Rieder via slurm-users wrote: ... OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2
Hi Hermann, I get: Cpus_allowed: ,, Cpus_allowed_list: 0-95 Best Dietmar p.s.: lg aus dem CCB On 2/28/24 15:01, Hermann Schwärzler via slurm-users wrote: Hi Dietmar, what do you find in the output-file of this job sbatch --time 5 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status' On our 64 cores machines with enabled hyperthreading I see e.g. Cpus_allowed: 0400,,0400, Cpus_allowed_list: 58,122 Greetings Hermann On 2/28/24 14:28, Dietmar Rieder via slurm-users wrote: Hi, I'm new to slrum, but maybe someone can help me: I'm trying to restrict the CPU usage to the actually requested/allocated resources using cgroup v2. For this I made the following settings in slurmd.conf: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity And in cgroup.conf CgroupPlugin=cgroup/v2 CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes AllowedRAMSpace=98 cgroup v2 seems to be active on the compute node: # mount | grep cgroup cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot) # cat /sys/fs/cgroup/cgroup.subtree_control cpuset cpu io memory pids # cat /sys/fs/cgroup/system.slice/cgroup.subtree_control cpuset cpu io memory pids Now, when I use sbatch to submit the following test script, the python script which is started from the batch script is utilizing all CPUs (96) at 100% on the allocated node, although I only ask for 4 cpus (--cpus-per-task=4). I'd expect that the task can not use more that these 4. #!/bin/bash #SBATCH --output=/local/users/appadmin/test-%j.log #SBATCH --job-name=test #SBATCH --chdir=/local/users/appadmin #SBATCH --cpus-per-task=4 #SBATCH --ntasks=1 #SBATCH --nodes=1 #SBATCH --mem=64gb #SBATCH --time=4:00:00 #SBATCH --partition=standard #SBATCH --gpus=0 #SBATCH --export #SBATCH --get-user-env=L export PATH=/usr/local/bioinf/jupyterhub/bin:/usr/local/bioinf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/bioinf/miniforge/condabin source .bashrc conda activate test python test.py The python code in test.py is the following using the cpu_load_generator package from [1]: #!/usr/bin/env python import sys from cpu_load_generator import load_single_core, load_all_cores, from_profile load_all_cores(duration_s=120, target_load=1) # generates load on all cores Interestingly, when I use srun to launch an interactive job, and run the python script manually, I see with top that only 4 cpus are running at 100%. And I also python errors thrown when the script tries to start the 5th process (which makes sense): File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/cpu_load_generator/_interface.py", line 24, in load_single_core process.cpu_affinity([core_num]) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/__init__.py", line 867, in cpu_affinity self._proc.cpu_affinity_set(list(set(cpus))) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 1714, in wrapper return fun(self, *args, **kwargs) ^^ File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 2213, in cpu_affinity_set cext.proc_cpu_affinity_set(self.pid, cpus) OSError: [Errno 22] Invalid argument What am I missing, why are the CPU resources not restricted when I use sbatch? Thanks for any input or hint Dietmar [1]: https://pypi.org/project/cpu-load-generator/ -- _ D i e t m a r R i e d e r Innsbruck Medical University Biocenter - Institute of Bioinformatics Innrain 80, 6020 Innsbruck Phone: +43 512 9003 71402 | Mobile: +43 676 8716 72402 Email: dietmar.rie...@i-med.ac.at Web: http://www.icbi.at OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] sbatch and cgroup v2
Hi, I'm new to slrum, but maybe someone can help me: I'm trying to restrict the CPU usage to the actually requested/allocated resources using cgroup v2. For this I made the following settings in slurmd.conf: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity And in cgroup.conf CgroupPlugin=cgroup/v2 CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes AllowedRAMSpace=98 cgroup v2 seems to be active on the compute node: # mount | grep cgroup cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot) # cat /sys/fs/cgroup/cgroup.subtree_control cpuset cpu io memory pids # cat /sys/fs/cgroup/system.slice/cgroup.subtree_control cpuset cpu io memory pids Now, when I use sbatch to submit the following test script, the python script which is started from the batch script is utilizing all CPUs (96) at 100% on the allocated node, although I only ask for 4 cpus (--cpus-per-task=4). I'd expect that the task can not use more that these 4. #!/bin/bash #SBATCH --output=/local/users/appadmin/test-%j.log #SBATCH --job-name=test #SBATCH --chdir=/local/users/appadmin #SBATCH --cpus-per-task=4 #SBATCH --ntasks=1 #SBATCH --nodes=1 #SBATCH --mem=64gb #SBATCH --time=4:00:00 #SBATCH --partition=standard #SBATCH --gpus=0 #SBATCH --export #SBATCH --get-user-env=L export PATH=/usr/local/bioinf/jupyterhub/bin:/usr/local/bioinf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/bioinf/miniforge/condabin source .bashrc conda activate test python test.py The python code in test.py is the following using the cpu_load_generator package from [1]: #!/usr/bin/env python import sys from cpu_load_generator import load_single_core, load_all_cores, from_profile load_all_cores(duration_s=120, target_load=1) # generates load on all cores Interestingly, when I use srun to launch an interactive job, and run the python script manually, I see with top that only 4 cpus are running at 100%. And I also python errors thrown when the script tries to start the 5th process (which makes sense): File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/cpu_load_generator/_interface.py", line 24, in load_single_core process.cpu_affinity([core_num]) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/__init__.py", line 867, in cpu_affinity self._proc.cpu_affinity_set(list(set(cpus))) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 1714, in wrapper return fun(self, *args, **kwargs) ^^ File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 2213, in cpu_affinity_set cext.proc_cpu_affinity_set(self.pid, cpus) OSError: [Errno 22] Invalid argument What am I missing, why are the CPU resources not restricted when I use sbatch? Thanks for any input or hint Dietmar [1]: https://pypi.org/project/cpu-load-generator/ OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com