[slurm-users] Re: [EXTERN] Re: scheduling according time requirements

2024-04-30 Thread Dietmar Rieder via slurm-users

Hi Loris,

On 4/30/24 4:26 PM, Loris Bennett via slurm-users wrote:

Hi Dietmar,

Dietmar Rieder via slurm-users  writes:


Hi Loris,

On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote:

Hi Dietmar,
Dietmar Rieder via slurm-users 
writes:


Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

Hi Dietmar,
Dietmar Rieder via slurm-users 
writes:


Hi,

is it possible to have slurm scheduling jobs automatical according to
the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00
DefaultTime=00:10:00 State=UP OverSubscribe=NO
PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00
DefaultTime=04:00:00 State=UP OverSubscribe=NO
PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00
DefaultTime=24:00:00 State=UP OverSubscribe=NO


So in the standard partition which is the default we have all nodes
and a max time of 4h, in the medium partition we have 4 nodes with a
max time of 24h and in the long partition we have 2 nodes with a max
time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on
any node (standard partition), whereas when specifying -t 05:00:00 or
-t 48:00:00 the job will run on the nodes of the medium or long
partition respectively.

However, my job will not get scheduled at all when -t is greater than
01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01  --pty bash
srun: Requested partition configuration not available now
srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was
thinking that slurm would automatically switch to the medium
partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g.
   $ salloc --cpus-per-task=1 --time=01:00:01
--partition=standard,medium,long
Notice that rather than using 'srun ... --pty bash', as far as I
understand, the preferred method is to use 'salloc' as above, and to use
'srun' for starting MPI processes.


Thanks for the hint. This works nicely, but it would be nice that I
would not need to specify the partition at all. Any thoughts?

I am not aware that you can set multiple partition as a default.


Diego suggested a possible way which seems to work after a quick test.


Yes, I wasn't aware of that, but it might also be useful for us, too.


The question is why you actually need partitions with different
maximum
runtimes.


we would like to have only a sub set of the nodes in a partition for
long running jobs, so that there are enough nodes available for short
jobs.

The nodes for the long partition, however are also part of the short
partition so they can also be utilized when no long jobs are running.

That's our idea


If you have plenty of short running jobs, that is probably a reasonable
approach.  On our system, the number of short running jobs would
probably tend to dip significantly over the weekend and public holidays,
so resources would potentially be blocked for the long running jobs.  On
the other hand, long-running jobs on our system often run for days, so
one day here or there might not be so significant.  And if the
long-running jobs were able to start in the short partition, they could
block short jobs.

The other thing to think about with regard to short jobs is backfilling.
With our mix of jobs, unless a job needs a large amount of memory or
number of cores, those with a run-time of only a few hours should be
backfilled fairly efficiently.



you are absolutely right, and I guess we will nee to optimize using QoS. 
Thanks for your input and thoughts.




Regards

Loris


In our case, a university cluster with a very wide range of codes
and
usage patterns, multiple partitions would probably lead to fragmentation
and wastage of resources due to the job mix not always fitting well to
the various partitions.  Therefore, I am a member of the "as few
partitions as possible" camp and so in our set-up we have as essentially
only one partition with a DefaultTime of 14 days.  We do however let
users set a QOS to gain a priority boost in return for accepting a
shorter run-time and a reduced maximum number of cores.


we didn't look into QOS yet, but this might also a way to go, thanks.


Occasionally people complain about short jobs having to wait in the
queue for too long, but I have generally been successful in solving the
problem by having them estimate their resource requirements better or
bundling their work in ordert to increase the run-time-to-wait-time
ratio.




Dietmar



OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: scheduling according time requirements

2024-04-30 Thread Dietmar Rieder via slurm-users

Dear Thomas,

the QoS seems really helpful, we'll look into it.
Perhaps as a starting point for us could you eventually translate my 
simple example into a QoS config/setting?


Thanks so much
  Dietmar

On 4/30/24 4:26 PM, Thomas Hartmann via slurm-users wrote:

Hi Dietmar,

I was facing quite similar requirements to yours. We ended up using QoS 
instead of partitions because this approach provides higher flexibility 
and more features. The basic distinction between the two approaches is 
that partitions are node-based while QoS are (essentially) resource 
based. So, instead of saying "Long jobs can only run on nodes 9 and 10" 
you would be able to say "Long jobs  can only use X CPU cores in total".


However, yes, your partition based approach is going to do the job, as 
long as you do not need any QoS based preemption.


Cheers,

Thomas

Am 30.04.24 um 16:00 schrieb Dietmar Rieder via slurm-users:

Hi Loris,

On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote:

Hi Dietmar,

Dietmar Rieder via slurm-users  writes:


Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

Hi Dietmar,
Dietmar Rieder via slurm-users 
writes:


Hi,

is it possible to have slurm scheduling jobs automatical according to
the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00
DefaultTime=00:10:00 State=UP OverSubscribe=NO
PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00
DefaultTime=04:00:00 State=UP OverSubscribe=NO
PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00
DefaultTime=24:00:00 State=UP OverSubscribe=NO


So in the standard partition which is the default we have all nodes
and a max time of 4h, in the medium partition we have 4 nodes with a
max time of 24h and in the long partition we have 2 nodes with a max
time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on
any node (standard partition), whereas when specifying -t 05:00:00 or
-t 48:00:00 the job will run on the nodes of the medium or long
partition respectively.

However, my job will not get scheduled at all when -t is greater than
01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01  --pty bash
srun: Requested partition configuration not available now
srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I 
was

thinking that slurm would automatically switch to the medium
partition.

Do I misunderstand something there? Or can this be somehow 
configured.

You can specify multiple partitions, e.g.
  $ salloc --cpus-per-task=1 --time=01:00:01
--partition=standard,medium,long
Notice that rather than using 'srun ... --pty bash', as far as I
understand, the preferred method is to use 'salloc' as above, and 
to use

'srun' for starting MPI processes.


Thanks for the hint. This works nicely, but it would be nice that I
would not need to specify the partition at all. Any thoughts?


I am not aware that you can set multiple partition as a default.


Diego suggested a possible way which seems to work after a quick test.



The question is why you actually need partitions with different maximum
runtimes.


we would like to have only a sub set of the nodes in a partition for 
long running jobs, so that there are enough nodes available for short 
jobs.


The nodes for the long partition, however are also part of the short 
partition so they can also be utilized when no long jobs are running.


That's our idea




In our case, a university cluster with a very wide range of codes and
usage patterns, multiple partitions would probably lead to fragmentation
and wastage of resources due to the job mix not always fitting well to
the various partitions.  Therefore, I am a member of the "as few
partitions as possible" camp and so in our set-up we have as essentially
only one partition with a DefaultTime of 14 days.  We do however let
users set a QOS to gain a priority boost in return for accepting a
shorter run-time and a reduced maximum number of cores.


we didn't look into QOS yet, but this might also a way to go, thanks.


Occasionally people complain about short jobs having to wait in the
queue for too long, but I have generally been successful in solving the
problem by having them estimate their resource requirements better or
bundling their work in ordert to increase the run-time-to-wait-time
ratio.





OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: scheduling according time requirements

2024-04-30 Thread Dietmar Rieder via slurm-users

Hi Loris,

On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote:

Hi Dietmar,

Dietmar Rieder via slurm-users  writes:


Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

Hi Dietmar,
Dietmar Rieder via slurm-users 
writes:


Hi,

is it possible to have slurm scheduling jobs automatical according to
the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00
DefaultTime=00:10:00 State=UP OverSubscribe=NO
PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00
DefaultTime=04:00:00 State=UP OverSubscribe=NO
PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00
DefaultTime=24:00:00 State=UP OverSubscribe=NO


So in the standard partition which is the default we have all nodes
and a max time of 4h, in the medium partition we have 4 nodes with a
max time of 24h and in the long partition we have 2 nodes with a max
time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on
any node (standard partition), whereas when specifying -t 05:00:00 or
-t 48:00:00 the job will run on the nodes of the medium or long
partition respectively.

However, my job will not get scheduled at all when -t is greater than
01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01  --pty bash
srun: Requested partition configuration not available now
srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was
thinking that slurm would automatically switch to the medium
partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g.
  $ salloc --cpus-per-task=1 --time=01:00:01
--partition=standard,medium,long
Notice that rather than using 'srun ... --pty bash', as far as I
understand, the preferred method is to use 'salloc' as above, and to use
'srun' for starting MPI processes.


Thanks for the hint. This works nicely, but it would be nice that I
would not need to specify the partition at all. Any thoughts?


I am not aware that you can set multiple partition as a default.


Diego suggested a possible way which seems to work after a quick test.



The question is why you actually need partitions with different maximum
runtimes.


we would like to have only a sub set of the nodes in a partition for 
long running jobs, so that there are enough nodes available for short jobs.


The nodes for the long partition, however are also part of the short 
partition so they can also be utilized when no long jobs are running.


That's our idea




In our case, a university cluster with a very wide range of codes and
usage patterns, multiple partitions would probably lead to fragmentation
and wastage of resources due to the job mix not always fitting well to
the various partitions.  Therefore, I am a member of the "as few
partitions as possible" camp and so in our set-up we have as essentially
only one partition with a DefaultTime of 14 days.  We do however let
users set a QOS to gain a priority boost in return for accepting a
shorter run-time and a reduced maximum number of cores.


we didn't look into QOS yet, but this might also a way to go, thanks.


Occasionally people complain about short jobs having to wait in the
queue for too long, but I have generally been successful in solving the
problem by having them estimate their resource requirements better or
bundling their work in ordert to increase the run-time-to-wait-time
ratio.



Dietmar



OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: scheduling according time requirements

2024-04-30 Thread Dietmar Rieder via slurm-users

Hi Diego,

thanks a lot, it seems to work as far as I was able to test now.

Dietmar

On 4/30/24 3:24 PM, Diego Zuccato via slurm-users wrote:

Try adding to the config:
EnforcePartLimits=ANY
JobSubmitPlugins=all_partitions

Diego

Il 30/04/2024 15:11, Dietmar Rieder via slurm-users ha scritto:

Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

Hi Dietmar,

Dietmar Rieder via slurm-users  writes:


Hi,

is it possible to have slurm scheduling jobs automatical according to
the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00
DefaultTime=00:10:00 State=UP OverSubscribe=NO
PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00
DefaultTime=04:00:00 State=UP OverSubscribe=NO
PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00
DefaultTime=24:00:00 State=UP OverSubscribe=NO


So in the standard partition which is the default we have all nodes
and a max time of 4h, in the medium partition we have 4 nodes with a
max time of 24h and in the long partition we have 2 nodes with a max
time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on
any node (standard partition), whereas when specifying -t 05:00:00 or
-t 48:00:00 the job will run on the nodes of the medium or long
partition respectively.

However, my job will not get scheduled at all when -t is greater than
01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01  --pty bash
srun: Requested partition configuration not available now
srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was
thinking that slurm would automatically switch to the medium
partition.

Do I misunderstand something there? Or can this be somehow configured.


You can specify multiple partitions, e.g.
   $ salloc --cpus-per-task=1 --time=01:00:01 
--partition=standard,medium,long


Notice that rather than using 'srun ... --pty bash', as far as I
understand, the preferred method is to use 'salloc' as above, and to use
'srun' for starting MPI processes.


Thanks for the hint. This works nicely, but it would be nice that I 
would not need to specify the partition at all. Any thoughts?



Dietmar









OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: scheduling according time requirements

2024-04-30 Thread Dietmar Rieder via slurm-users

Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

Hi Dietmar,

Dietmar Rieder via slurm-users  writes:


Hi,

is it possible to have slurm scheduling jobs automatical according to
the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00
DefaultTime=00:10:00 State=UP OverSubscribe=NO
PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00
DefaultTime=04:00:00 State=UP OverSubscribe=NO
PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00
DefaultTime=24:00:00 State=UP OverSubscribe=NO


So in the standard partition which is the default we have all nodes
and a max time of 4h, in the medium partition we have 4 nodes with a
max time of 24h and in the long partition we have 2 nodes with a max
time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on
any node (standard partition), whereas when specifying -t 05:00:00 or
-t 48:00:00 the job will run on the nodes of the medium or long
partition respectively.

However, my job will not get scheduled at all when -t is greater than
01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01  --pty bash
srun: Requested partition configuration not available now
srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was
thinking that slurm would automatically switch to the medium
partition.

Do I misunderstand something there? Or can this be somehow configured.


You can specify multiple partitions, e.g.
  
   $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long


Notice that rather than using 'srun ... --pty bash', as far as I
understand, the preferred method is to use 'salloc' as above, and to use
'srun' for starting MPI processes.


Thanks for the hint. This works nicely, but it would be nice that I 
would not need to specify the partition at all. Any thoughts?



Dietmar



OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] scheduling according time requirements

2024-04-30 Thread Dietmar Rieder via slurm-users

Hi,

is it possible to have slurm scheduling jobs automatical according to 
the "-t" time requirements to a fitting partition?


e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 
DefaultTime=00:10:00 State=UP OverSubscribe=NO
PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 
DefaultTime=04:00:00 State=UP OverSubscribe=NO
PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 
DefaultTime=24:00:00 State=UP OverSubscribe=NO



So in the standard partition which is the default we have all nodes and 
a max time of 4h, in the medium partition we have 4 nodes with a max 
time of 24h and in the long partition we have 2 nodes with a max time of 
336h.


I was hoping that if I submit a job with -t 01:00:00 it can be run on 
any node (standard partition), whereas when specifying -t 05:00:00 or -t 
48:00:00 the job will run on the nodes of the medium or long partition 
respectively.


However, my job will not get scheduled at all when -t is greater than 
01:00:00


i.e.

]$ srun --cpus-per-task 1 -t 01:00:01  --pty bash
srun: Requested partition configuration not available now
srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was 
thinking that slurm would automatically switch to the medium partition.


Do I misunderstand something there? Or can this be somehow configured.

Thanks so much and sorry for the naive question
   Dietmar


OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-29 Thread Dietmar Rieder via slurm-users

Hi list,

I finally got it working, I completely overlooked that I set 
Oversubscribe=EXCLUSIVE for the partition that I used to test, stupid 
me.


sorry for the noise and thanks again for your answers

Best
   Dietmar

On 2/29/24 13:19, Dietmar Rieder via slurm-users wrote:

Hi Josef, hi list,

I now rebuild the rpms from OpenHPC but using the original sources form 
version 23.11.4.


The configure command that is genereated from the spec is the following:

./configure --build=x86_64-redhat-linux-gnu \
--host=x86_64-redhat-linux-gnu \
--program-prefix= \
--disable-dependency-tracking \
--prefix=/usr \
--exec-prefix=/usr \
--bindir=/usr/bin \
--sbindir=/usr/sbin \
--sysconfdir=/etc/slurm \
--datadir=/usr/share \
--includedir=/usr/include \
--libdir=/usr/lib64 \
--libexecdir=/usr/libexec \
--localstatedir=/var \
--sharedstatedir=/var/lib \
--mandir=/usr/share/man \
--infodir=/usr/share/info \
--enable-multiple-slurmd \
--with-pmix=/opt/ohpc/admin/pmix \
--with-hwloc=/opt/ohpc/pub/libs/hwloc

(Do I miss something here)

the configure output shows:

[...]
checking for bpf installation... /usr
checking for dbus-1... yes
[...]

config.log

dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include '
dbus_LIBS='-ldbus-1


confdefs.h.
#define WITH_CGROUP 1
#define HAVE_BPF 1

However I still can't see any CPU limits when I use sbatch to run a 
batch job.



$ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 
'grep Cpus /proc/$$/status'


$ cat slurm-72.out
Cpus_allowed:   ,,
Cpus_allowed_list:  0-95


The logs from the head node (leto) and the compute node (apollo-01) are 
showing:


Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: 
_slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU input mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU input mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU final HW mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU final HW mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 
for UID 50001
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 
for UID 50001

Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3
Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: 
_start_job: Started JobId=72 in standard on apollo-01
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: 
JobId=72 WEXITSTATUS 0
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: 
JobId=72 done



Best
   Dietmar

On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote:

 > I'm running slurm 22.05.11 which is available with OpenHCP 3.x
 > Do you think an upgrade is needed?

I feel that lot of slurm operators tend to not use 3rd party sources 
of slurm binaries, as you do not have the build environment fully in 
your hands.


But before making such a complex decision, perhaps look for build logs 
of slurm you use (somewhere in OpenHPC buildsystem?) and check if it 
was built with libraries needed to have cgroupsv2 working..


Not having cgroupsv2 dependencies during build-time is only one of all 
possible causes..


josef








OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-29 Thread Dietmar Rieder via slurm-users

Hi Josef, hi list,

I now rebuild the rpms from OpenHPC but using the original sources form 
version 23.11.4.


The configure command that is genereated from the spec is the following:

./configure --build=x86_64-redhat-linux-gnu \
--host=x86_64-redhat-linux-gnu \
--program-prefix= \
--disable-dependency-tracking \
--prefix=/usr \
--exec-prefix=/usr \
--bindir=/usr/bin \
--sbindir=/usr/sbin \
--sysconfdir=/etc/slurm \
--datadir=/usr/share \
--includedir=/usr/include \
--libdir=/usr/lib64 \
--libexecdir=/usr/libexec \
--localstatedir=/var \
--sharedstatedir=/var/lib \
--mandir=/usr/share/man \
--infodir=/usr/share/info \
--enable-multiple-slurmd \
--with-pmix=/opt/ohpc/admin/pmix \
--with-hwloc=/opt/ohpc/pub/libs/hwloc

(Do I miss something here)

the configure output shows:

[...]
checking for bpf installation... /usr
checking for dbus-1... yes
[...]

config.log

dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include '
dbus_LIBS='-ldbus-1


confdefs.h.
#define WITH_CGROUP 1
#define HAVE_BPF 1

However I still can't see any CPU limits when I use sbatch to run a 
batch job.



$ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 
'grep Cpus /proc/$$/status'


$ cat slurm-72.out
Cpus_allowed:   ,,
Cpus_allowed_list:  0-95


The logs from the head node (leto) and the compute node (apollo-01) are 
showing:


Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: 
_slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU input mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU input mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU final HW mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU final HW mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 
for UID 50001
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 
for UID 50001

Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3
Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: 
_start_job: Started JobId=72 in standard on apollo-01
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: 
JobId=72 WEXITSTATUS 0
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: 
JobId=72 done



Best
  Dietmar

On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote:

 > I'm running slurm 22.05.11 which is available with OpenHCP 3.x
 > Do you think an upgrade is needed?

I feel that lot of slurm operators tend to not use 3rd party sources of 
slurm binaries, as you do not have the build environment fully in your 
hands.


But before making such a complex decision, perhaps look for build logs 
of slurm you use (somewhere in OpenHPC buildsystem?) and check if it was 
built with libraries needed to have cgroupsv2 working..


Not having cgroupsv2 dependencies during build-time is only one of all 
possible causes..


josef




OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-28 Thread Dietmar Rieder via slurm-users

Hi,

I'm running slurm 22.05.11 which is available with OpenHCP 3.x
Do you think an upgrade is needed?

Best
  Dietmar

On 2/28/24 14:55, Josef Dvoracek via slurm-users wrote:

Hi Dietmar;

I tried this on ${my cluster}, as I switched to cgroupsv2 quite recently..

I must say that on my setup it looks it works as expected, see the 
grepped stdout from your reproducer below.


I use recent slurm 23.11.4 .

Wild guess.. Has your build machine bpt and dbus devel packages installed?
(both packages are fine to be absent when doing build for cgroupsv1 - 
slurm..)


cheers

josef

[jose@koios1 test_cgroups]$ cat slurm-7177217.out | grep eli
ValueError: CPU number 7 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 4 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 5 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 11 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 9 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 10 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 14 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 8 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 12 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 6 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 13 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 15 is not eligible; choose between [0, 1, 2, 3]
[jose@koios1 test_cgroups]$

On 28. 02. 24 14:28, Dietmar Rieder via slurm-users wrote:
...







OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-28 Thread Dietmar Rieder via slurm-users

Hi Hermann,

I get:

Cpus_allowed:   ,,
Cpus_allowed_list:  0-95

Best
   Dietmar

p.s.: lg aus dem CCB

On 2/28/24 15:01, Hermann Schwärzler via slurm-users wrote:

Hi Dietmar,

what do you find in the output-file of this job

sbatch --time 5 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status'

On our 64 cores machines with enabled hyperthreading I see e.g.

Cpus_allowed:   0400,,0400,
Cpus_allowed_list:  58,122

Greetings
Hermann


On 2/28/24 14:28, Dietmar Rieder via slurm-users wrote:

Hi,

I'm new to slrum, but maybe someone can help me:

I'm trying to restrict the CPU usage to the actually 
requested/allocated resources using cgroup v2.


For this I made the following settings in slurmd.conf:


ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

And in cgroup.conf

CgroupPlugin=cgroup/v2
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
AllowedRAMSpace=98


cgroup v2 seems to be active on the compute node:

# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)


# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids
# cat /sys/fs/cgroup/system.slice/cgroup.subtree_control
cpuset cpu io memory pids


Now, when I use sbatch to submit the following test script, the python 
script which is started from the batch script is utilizing all CPUs 
(96) at 100% on the allocated node, although I only ask for 4 cpus 
(--cpus-per-task=4). I'd expect that the task can not use more that 
these 4.


#!/bin/bash
#SBATCH --output=/local/users/appadmin/test-%j.log
#SBATCH --job-name=test
#SBATCH --chdir=/local/users/appadmin
#SBATCH --cpus-per-task=4
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --mem=64gb
#SBATCH --time=4:00:00
#SBATCH --partition=standard
#SBATCH --gpus=0
#SBATCH --export
#SBATCH --get-user-env=L

export 
PATH=/usr/local/bioinf/jupyterhub/bin:/usr/local/bioinf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/bioinf/miniforge/condabin


source .bashrc
conda activate test
python test.py


The python code in test.py is the following using the 
cpu_load_generator package from [1]:


#!/usr/bin/env python

import sys
from cpu_load_generator import load_single_core, load_all_cores, 
from_profile


load_all_cores(duration_s=120, target_load=1)  # generates load on all 
cores



Interestingly, when I use srun to launch an interactive job, and run 
the python script manually, I see with top that only 4 cpus are 
running at 100%. And I also python errors thrown when the script tries 
to start the 5th process (which makes sense):


   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

 self.run()
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 108, in run

 self._target(*self._args, **self._kwargs)
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/cpu_load_generator/_interface.py", line 24, in load_single_core

 process.cpu_affinity([core_num])
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/__init__.py", line 867, in cpu_affinity

 self._proc.cpu_affinity_set(list(set(cpus)))
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 1714, in wrapper

 return fun(self, *args, **kwargs)
    ^^
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 2213, in cpu_affinity_set

 cext.proc_cpu_affinity_set(self.pid, cpus)
OSError: [Errno 22] Invalid argument


What am I missing, why are the CPU resources not restricted when I use 
sbatch?



Thanks for any input or hint
    Dietmar

[1]: https://pypi.org/project/cpu-load-generator/






--
_
D i e t m a r  R i e d e r
Innsbruck Medical University
Biocenter - Institute of Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402 | Mobile: +43 676 8716 72402
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] sbatch and cgroup v2

2024-02-28 Thread Dietmar Rieder via slurm-users

Hi,

I'm new to slrum, but maybe someone can help me:


I'm trying to restrict the CPU usage to the actually requested/allocated 
resources using cgroup v2.


For this I made the following settings in slurmd.conf:


ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

And in cgroup.conf

CgroupPlugin=cgroup/v2
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
AllowedRAMSpace=98


cgroup v2 seems to be active on the compute node:

# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)


# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids
# cat /sys/fs/cgroup/system.slice/cgroup.subtree_control
cpuset cpu io memory pids


Now, when I use sbatch to submit the following test script, the python 
script which is started from the batch script is utilizing all CPUs (96) 
at 100% on the allocated node, although I only ask for 4 cpus 
(--cpus-per-task=4). I'd expect that the task can not use more that these 4.


#!/bin/bash
#SBATCH --output=/local/users/appadmin/test-%j.log
#SBATCH --job-name=test
#SBATCH --chdir=/local/users/appadmin
#SBATCH --cpus-per-task=4
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --mem=64gb
#SBATCH --time=4:00:00
#SBATCH --partition=standard
#SBATCH --gpus=0
#SBATCH --export
#SBATCH --get-user-env=L

export 
PATH=/usr/local/bioinf/jupyterhub/bin:/usr/local/bioinf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/bioinf/miniforge/condabin


source .bashrc
conda activate test
python test.py


The python code in test.py is the following using the cpu_load_generator 
package from [1]:


#!/usr/bin/env python

import sys
from cpu_load_generator import load_single_core, load_all_cores, 
from_profile


load_all_cores(duration_s=120, target_load=1)  # generates load on all cores


Interestingly, when I use srun to launch an interactive job, and run the 
python script manually, I see with top that only 4 cpus are running at 
100%. And I also python errors thrown when the script tries to start the 
5th process (which makes sense):


  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", 
line 314, in _bootstrap

self.run()
  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", 
line 108, in run

self._target(*self._args, **self._kwargs)
  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/cpu_load_generator/_interface.py", 
line 24, in load_single_core

process.cpu_affinity([core_num])
  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/__init__.py", 
line 867, in cpu_affinity

self._proc.cpu_affinity_set(list(set(cpus)))
  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", 
line 1714, in wrapper

return fun(self, *args, **kwargs)
   ^^
  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", 
line 2213, in cpu_affinity_set

cext.proc_cpu_affinity_set(self.pid, cpus)
OSError: [Errno 22] Invalid argument


What am I missing, why are the CPU resources not restricted when I use 
sbatch?



Thanks for any input or hint
   Dietmar

[1]: https://pypi.org/project/cpu-load-generator/


OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com