Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-07-24 Thread Davide DelVento
I run a cluster we bought from ACT and recently updated to ClusterVisor v1.0

The new version has (among many things) a really nice view of individual
jobs resource utilization (GPUs, memory, CPU, temperature, etc). I did not
pay attention to the overall statistics, so I am not sure how CV fares
there -- because I care only about individual jobs (I work with individual
users, and don't care about overall utilization, which is info for the
upper management). At the moment only admins can see the info, but my
understanding is that they are considering making it a user-space feature,
which will be really slick.

Several years ago I used XDMOD and Supremm and it was more confusing to use
and had troubles collecting all the data we needed (which the team blamed
on some BIOS settings), so the view was incomplete. Also, the tool seemed
to be more focused on the overall stats rather than per job info (both were
available, but the focus seemed on the former). I am sure these tools have
improved since then, so I'm not dismissing them, just giving my opinion
based on old facts. Comparing that old version of XDMOD to current CV
(unfair, I know, but that's the comparison I've got) the latter wins hands
down for per-job information. Also probably unfair is that XDMOD and
Supremm are free and open source whereas CV is proprietary.


On Mon, Jul 24, 2023 at 2:57 PM Magnus Jonsson 
wrote:

> We are feeding job usage information into a Prometheus database for our
> users (and us) to look at (via Grafana).
>
> It is also possible to get a lite of jobs that are under using memory, gpu
> or whatever metric you feed into the database.
>
>
>
> It’s a live feed with ~30s resolution from both compute jobs and Lustre
> file system.
>
> It’s easy to extend with more metrices.
>
>
>
> If you want more information on what we are doing just send me an email
> and I can give you more information.
>
>
>
> /Magnus
>
>
>
> --
>
> Magnus Jonsson, Developer, HPC2N, Umeå Universitet
>
> By sending an email to Umeå University, the University will need to
>
> process your personal data. For more information, please read
> www.umu.se/en/gdpr
>
> *Från:* slurm-users  *För *Will
> Furnell - STFC UKRI
> *Skickat:* Monday, 24 July 2023 16:38
> *Till:* slurm-us...@schedmd.com
> *Ämne:* [slurm-users] Tracking efficiency of all jobs on the cluster
> (dashboard etc.)
>
>
>
> Hello,
>
>
>
> I am aware of ‘seff’, which allows you to check the efficiency of a single
> job, which is good for users, but as a cluster administrator I would like
> to be able to track the efficiency of all jobs from all users on the
> cluster, so I am able to ‘re-educate’ users that may be running jobs that
> have terrible resource usage efficiency.
>
>
>
> What do other cluster administrators use for this task? Is there anything
> you use and recommend (or don’t recommend) or have heard of that is able to
> do this? Even if it’s something like a Grafana dashboard that hooks up to
> the SLURM database,
>
>
>
> Thank you,
>
>
>
> Will.
>


Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-07-24 Thread Cristóbal Navarro
Hello Angel and Community,
I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on
Ubuntu 22.04 LTS) and Slurm 23.02.
When I execute `slurmd` service, it status shows failed with the following
information below.
As of today, what is the best solution to this problem? I am really not
sure if the DGX A100 could fail or not by disabling cgroups v1.
Any suggestions are welcome.

➜  slurm-23.02.3 systemctl status slurmd.service

× slurmd.service - Slurm node daemon
 Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
preset: enabled)
 Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04;
7s ago
Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
(code=exited, status=1/FAILURE)
   Main PID: 3680019 (code=exited, status=1/FAILURE)
CPU: 40ms

jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  Log file
re-opened
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2:
hwloc_topology_init
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2:
hwloc_topology_load
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2:
hwloc_topology_export_xml
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  CPUs:128
Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: Node reconfigured
socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw)
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: fatal: Hybrid mode is
not supported. Mounted cgroups are: 2:freezer:/
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: 0::/init.scope
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Main process exited,
code=exited, status=1/FAILURE
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Failed with result
'exit-code'.
➜  slurm-23.02.3



On Wed, May 3, 2023 at 6:32 PM Angel de Vicente 
wrote:

> Hello,
>
> Angel de Vicente  writes:
>
> > ,
> > | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are:
> > | 5:freezer:/
> > | 3:cpuacct:/
> > `
>
> in the end I learnt that despite Ubuntu 22.04 reporting to be using
> only cgroup V2, it was also using V1 and creating those mount points,
> and then Slurm 23.02.01 was complaining that it could not work with
> Cgroups in hybrid mode.
>
> So, the "solution" (as far as you don't need V1 for some reason) was to
> add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1
> mount points and Slurm was happy with that.
>
> [in case somebody is interested in the future, I needed this so that I
> could limit the resources given to users not using Slurm. We have some
> shared workstations with many cores and users were oversubscribing the
> CPUs, so I have installed Slurm to put some order in the executions
> there. But these machines are not an actual cluster with a login node:
> the login node is the same as the executing node! So with cgroups I
> control that users connecting via ssh only have the resources equivalent
> to 3/4 of a core (enough to edit files, etc.) until they submit their
> jobs via Slurm, when they then get the full allocation they requested].
>
> Cheers,
> --
> Ángel de Vicente
>  Research Software Engineer (Supercomputing and BigData)
>  Tel.: +34 922-605-747
>  Web.: http://research.iac.es/proyecto/polmag/
>
>  GPG: 0x8BDC390B69033F52
>


-- 
Cristóbal A. Navarro


Re: [slurm-users] MaxMemPerCPU not enforced?

2023-07-24 Thread Angel de Vicente
Hello,

Matthew Brown  writes:

> Minimum  memory required per allocated CPU. ... Note that if the job's
> --mem-per-cpu value exceeds the configured MaxMemPerCPU, then  the
> user's  limit  will be treated as a memory limit per task

Ah, thanks, I should've read the documentation more carefully.

From my limited tests today, somehow in the interactive queue all seems
OK now, but not so in the 'batch' queue. For example, I just submitted
three jobs with different amount of CPUs per job (4, 8 and 16 processes
respectively). MaxMemPerCPU is set to 2GB, and these jobs run the
'stress' command, consuming 3GB per process.

,
| [user@xxx test]$ squeue
|  JOBID PARTITION NAME   USER  STTIME   TIME_LIMIT   
CPUSQOSACCOUNT NODELIST(REASON)
| 127564 batch test user   R9:2515:00 
16 normalddgroup xxx
| 127562 batch test user   R9:2515:00  
4 normalddgroup xxx
| 127563 batch test user   R9:2515:00  
8 normalddgroup xxx
`


It looks like Slurm is trying to kill the jobs, but somehow not all the
processes die (as you can see below, 2 out of the 4 processes in job
127562 are still there after 9 minutes, 3 of the 8 proceeses in job
127563 and 6 of the 16 processes in job 127564):

,
| [user@xxx test]$ ps -fea | grep stress
| user   1853317 1853314  0 22:35 ?00:00:00 stress -m 16 -t 600 
--vm-keep --vm-bytes 3G
| user   1853319 1853317 66 22:35 ?00:06:17 stress -m 16 -t 600 
--vm-keep --vm-bytes 3G
| user   1853320 1853317 65 22:35 ?00:06:11 stress -m 16 -t 600 
--vm-keep --vm-bytes 3G
| user   1853321 1853317 65 22:35 ?00:06:11 stress -m 16 -t 600 
--vm-keep --vm-bytes 3G
| user   1853328 1853317 65 22:35 ?00:06:12 stress -m 16 -t 600 
--vm-keep --vm-bytes 3G
| user   1853329 1853317 65 22:35 ?00:06:12 stress -m 16 -t 600 
--vm-keep --vm-bytes 3G
| user   1853338 1853337  0 22:35 ?00:00:00 stress -m 8 -t 600 
--vm-keep --vm-bytes 3G
| user   1853340 1853338 68 22:35 ?00:06:32 stress -m 8 -t 600 
--vm-keep --vm-bytes 3G
| user   1853341 1853338 69 22:35 ?00:06:34 stress -m 8 -t 600 
--vm-keep --vm-bytes 3G
| user   1853347 1853316  0 22:35 ?00:00:00 stress -m 4 -t 600 
--vm-keep --vm-bytes 3G
| user   1853350 1853347 68 22:35 ?00:06:29 stress -m 4 -t 600 
--vm-keep --vm-bytes 3G
| user   1854560 1511070  0 22:45 pts/200:00:00 grep stress
`

And these processes are truly using 3GB:

,
| [user@xxx test]$ ps -v 1853319
| PID TTY  STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
| 1853319 ?R  6:25   864211 3149428 3146040  1.1 stress -m 16 
-t 600 --vm-keep --vm-bytes 3G
`

Any idea how to solve/debug this?

Many thanks,
-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature


[slurm-users] Partition not allowing subaccount use

2023-07-24 Thread Groner, Rob
I've setup a partition THING with AllowAccounts=stuff.  I then use sacctmgr to 
create the stuff account and a mystuff account whose parent is stuff.  My 
understanding is that this would make mystuff a subaccount of stuff.

The description for specifying allowaccount in a partition definition in 
slurm.conf says:

Comma-separated list of accounts which may execute jobs in the partition. The 
default value is "ALL". This list is also hierarchical, meaning subaccounts are 
included automatically.

However, when I try to submit a job using --account=mystuff, then it gets 
rejected and the reason in the slurmctld.log is because "Job's account not 
permitted to use this partition (THING allows stuff not mystuff)".

Am I not understanding what constitutes a subaccount?  When I "sacctmgr show 
assoc tree", then I see that mystuff is under stuff.

Or am I misreading the documentation that says that subaccounts are included in 
who is allowed to use the partition?

Thanks,

Rob



Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-07-24 Thread Magnus Jonsson
We are feeding job usage information into a Prometheus database for our users 
(and us) to look at (via Grafana).
It is also possible to get a lite of jobs that are under using memory, gpu or 
whatever metric you feed into the database.

It’s a live feed with ~30s resolution from both compute jobs and Lustre file 
system.
It’s easy to extend with more metrices.

If you want more information on what we are doing just send me an email and I 
can give you more information.

/Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
By sending an email to Umeå University, the University will need to
process your personal data. For more information, please read 
www.umu.se/en/gdpr
Från: slurm-users  För Will Furnell - 
STFC UKRI
Skickat: Monday, 24 July 2023 16:38
Till: slurm-us...@schedmd.com
Ämne: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard 
etc.)

Hello,

I am aware of ‘seff’, which allows you to check the efficiency of a single job, 
which is good for users, but as a cluster administrator I would like to be able 
to track the efficiency of all jobs from all users on the cluster, so I am able 
to ‘re-educate’ users that may be running jobs that have terrible resource 
usage efficiency.

What do other cluster administrators use for this task? Is there anything you 
use and recommend (or don’t recommend) or have heard of that is able to do 
this? Even if it’s something like a Grafana dashboard that hooks up to the 
SLURM database,

Thank you,

Will.


Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-07-24 Thread Matthew Brown
I use seff all the time as a first order approximation. It's a good hint at
what's going on with a job but doesn't give much detail.

We are in the process of integrating the Supremm node utilization capture
tool with our clusters and with our local XDMOD installation. Plain old
XDMOD can ingest the Slurm logs and give you some great information on
utilization, but generally has more of a high-level or summary perspective
on stats. To help see their personal job efficiency, you really need to
give users time-series data and we're expecting to get that with the
Supremm components.

The other angle which I've recently asked our eng/admin team to try to
implement on our newest cluster (yet to be released), is to turn on the
bits that Slurm has built-in for job profiling. With this properly
configured, users can turn on job-profiling as with a Slurm job-option and
it will produce that time-series data. Look for the AcctGatherProfileType
config stuff for slurm.conf.

Best,

Matt

Matthew Brown
Computational Scientist
Advanced Research Computing
Virginia Tech


On Mon, Jul 24, 2023 at 10:39 AM Will Furnell - STFC UKRI <
will.furn...@stfc.ac.uk> wrote:

> Hello,
>
>
>
> I am aware of ‘seff’, which allows you to check the efficiency of a single
> job, which is good for users, but as a cluster administrator I would like
> to be able to track the efficiency of all jobs from all users on the
> cluster, so I am able to ‘re-educate’ users that may be running jobs that
> have terrible resource usage efficiency.
>
>
>
> What do other cluster administrators use for this task? Is there anything
> you use and recommend (or don’t recommend) or have heard of that is able to
> do this? Even if it’s something like a Grafana dashboard that hooks up to
> the SLURM database,
>
>
>
> Thank you,
>
>
>
> Will.
>


Re: [slurm-users] MPI_Init_thread error

2023-07-24 Thread Fatih Ertinaz
Hi Aziz,

This seems like an MPI environment issue rather than a Slurm problem.

Make sure that MPI modules are loaded as well. You can see the list of
loaded modules via `module list`. This should give you if SU2 dependencies
are available in your runtime. If they are not loaded implicitly, you need
to load them before you load SU2. You can then check with commands like
`which mpirun` or `mpirun -V` to see if you have proper mpi runtime env.

By the way, even if your case runs fine, you won't be able to benefit from
mpi because you're allocating a single process (--ntasks-per-node=1).
Instead, get the whole node and use all physical cores (or run a
scalability analysis and make a decision after that).

Hope this helps

Fatih

On Mon, Jul 24, 2023 at 10:44 AM Aziz Ogutlu 
wrote:

> Hi there all,
> We're using Slurm 21.08 on Redhat 7.9 HPC cluster with OpenMPI 4.0.3 + gcc
> 8.5.0.
> When we run command below for call SU2, we get an error message:
>
> *$ srun -p defq --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash
> -i*
> *$ module load su2/7.5.1*
> *$ SU2_CFD config.cfg*
>
>  An error occurred in MPI_Init_thread*
>  on a NULL communicator*
>  MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,*
> and potentially your MPI job)*
> *[cnode003.hpc:17534] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not able to
> guarantee that all other processes were killed!*
>
> --
> Best regards,
> Aziz Öğütlü
>
> Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  www.eduline.com.tr
> Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
> Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
> Tel : +90 212 324 60 61 Cep: +90 541 350 40 72
>
>


[slurm-users] MPI_Init_thread error

2023-07-24 Thread Aziz Ogutlu

Hi there all,
We're using Slurm 21.08 on Redhat 7.9 HPC cluster with OpenMPI 4.0.3 + 
gcc 8.5.0.

When we run command below for call SU2, we get an error message:

/$ srun -p defq --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i/
/$ module load su2/7.5.1/
/$ SU2_CFD config.cfg/

/*** An error occurred in MPI_Init_thread/
/*** on a NULL communicator/
/*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,/
/***    and potentially your MPI job)/
/[cnode003.hpc:17534] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able 
to guarantee that all other processes were killed!/


--
Best regards,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.www.eduline.com.tr
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72


[slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-07-24 Thread Will Furnell - STFC UKRI
Hello,

I am aware of 'seff', which allows you to check the efficiency of a single job, 
which is good for users, but as a cluster administrator I would like to be able 
to track the efficiency of all jobs from all users on the cluster, so I am able 
to 're-educate' users that may be running jobs that have terrible resource 
usage efficiency.

What do other cluster administrators use for this task? Is there anything you 
use and recommend (or don't recommend) or have heard of that is able to do 
this? Even if it's something like a Grafana dashboard that hooks up to the 
SLURM database,

Thank you,

Will.


Re: [slurm-users] MaxMemPerCPU not enforced?

2023-07-24 Thread Matthew Brown
Slurm will allocate more cpus to cover the memory requirement. Use
sacct's query fields to compare Requested Resources vs. Allocated Resources:

$ scontrol show part normal_q | grep MaxMem
   DefMemPerCPU=1920 MaxMemPerCPU=1920

$ srun -n 1 --mem-per-cpu=4000 --partition=normal_q --account=arcadm
hostname
srun: job 1577313 queued and waiting for resources
srun: job 1577313 has been allocated resources
tc095

$ sacct -j 1577313 -o jobid,reqtres%35,alloctres%35
   JobID ReqTRES
AllocTRES
 ---
---
1577313 billing=1,cpu=1,mem=4000M,node=1
 billing=3,cpu=3,mem=4002M,node=1
1577313.ext+
 billing=3,cpu=3,mem=4002M,node=1
1577313.0
cpu=3,mem=4002M,node=1

>From the Slurm manuals (eg. man srun):
 --mem-per-cpu=[units]
Minimum  memory required per allocated CPU. ... Note that if the job's
--mem-per-cpu value exceeds the configured MaxMemPerCPU, then  the user's
 limit  will be treated as a memory limit per task

On Mon, Jul 24, 2023 at 9:32 AM Groner, Rob  wrote:

> I'm not sure I can help with the rest, but the EnforcePartLimits setting
> will only reject a job at submission time that exceeds *partition*​
> limits, not overall cluster limits.  I don't see anything, offhand, in the
> interactive partition definition that is exceeded by your request for 4
> GB/CPU.
>
> Rob
>
>
> --
> *From:* slurm-users on behalf of Angel de Vicente
> *Sent:* Monday, July 24, 2023 7:20 AM
> *To:* Slurm User Community List
> *Subject:* [slurm-users] MaxMemPerCPU not enforced?
>
> Hello,
>
> I'm trying to get Slurm to control the memory used per CPU, but it does
> not seem to enforce the MaxMemPerCPU option in slurm.conf
>
> This is running in Ubuntu 22.04 (cgroups v2), Slurm 23.02.3.
>
> Relevant configuration options:
>
> ,cgroup.conf
> | AllowedRAMSpace=100
> | ConstrainCores=yes
> | ConstrainRAMSpace=yes
> | ConstrainSwapSpace=yes
> | AllowedSwapSpace=0
> `
>
> ,slurm.conf
> | TaskPlugin=task/affinity,task/cgroup
> | PrologFlags=X11
> |
> | SelectType=select/cons_res
> | SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK
> | MaxMemPerCPU=500
> | DefMemPerCPU=200
> |
> | JobAcctGatherType=jobacct_gather/linux
> |
> | EnforcePartLimits=ALL
> |
> | NodeName=xxx RealMemory=257756 Sockets=4 CoresPerSocket=8
> ThreadsPerCore=1 Weight=1
> |
> | PartitionName=batch   Nodes=duna State=UP Default=YES
> MaxTime=2-00:00:00 MaxCPUsPerNode=32 OverSubscribe=FORCE:1
> | PartitionName=interactive Nodes=duna State=UP Default=NO
> MaxTime=08:00:00   MaxCPUsPerNode=32 OverSubscribe=FORCE:2
> `
>
>
> I can ask for an interactive session with 4GB/CPU (I would have thought
> that "EnforcePartLimits=ALL" would stop me from doing that), and once
> I'm in the interactive session I can execute a 3GB test code without any
> issues (I can see with htop that the process does indeed use a RES size
> of 3GB at 100% CPU use). Any idea what could be the problem or how to
> start debugging this?
>
> ,
> | [angelv@xxx test]$ sinter -n 1 --mem-per-cpu=4000
> | salloc: Granted job allocation 127544
> | salloc: Nodes xxx are ready for job
> |
> | (sinter) [angelv@xxx test]$ stress -m 1 -t 600 --vm-keep --vm-bytes 3G
> | stress -m 1 -t 600 --vm-keep --vm-bytes 3G
> | stress: info: [1772392] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
> `
>
> Many thanks,
> --
> Ángel de Vicente
>  Research Software Engineer (Supercomputing and BigData)
>  Tel.: +34 922-605-747
>  Web.: http://research.iac.es/proyecto/polmag/
>
>  GPG: 0x8BDC390B69033F52
>


Re: [slurm-users] MaxMemPerCPU not enforced?

2023-07-24 Thread Groner, Rob
I'm not sure I can help with the rest, but the EnforcePartLimits setting will 
only reject a job at submission time that exceeds partition​ limits, not 
overall cluster limits.  I don't see anything, offhand, in the interactive 
partition definition that is exceeded by your request for 4 GB/CPU.

Rob



From: slurm-users on behalf of Angel de Vicente
Sent: Monday, July 24, 2023 7:20 AM
To: Slurm User Community List
Subject: [slurm-users] MaxMemPerCPU not enforced?

Hello,

I'm trying to get Slurm to control the memory used per CPU, but it does
not seem to enforce the MaxMemPerCPU option in slurm.conf

This is running in Ubuntu 22.04 (cgroups v2), Slurm 23.02.3.

Relevant configuration options:

,cgroup.conf
| AllowedRAMSpace=100
| ConstrainCores=yes
| ConstrainRAMSpace=yes
| ConstrainSwapSpace=yes
| AllowedSwapSpace=0
`

,slurm.conf
| TaskPlugin=task/affinity,task/cgroup
| PrologFlags=X11
|
| SelectType=select/cons_res
| SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK
| MaxMemPerCPU=500
| DefMemPerCPU=200
|
| JobAcctGatherType=jobacct_gather/linux
|
| EnforcePartLimits=ALL
|
| NodeName=xxx RealMemory=257756 Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 
Weight=1
|
| PartitionName=batch   Nodes=duna State=UP Default=YES MaxTime=2-00:00:00 
MaxCPUsPerNode=32 OverSubscribe=FORCE:1
| PartitionName=interactive Nodes=duna State=UP Default=NO  MaxTime=08:00:00   
MaxCPUsPerNode=32 OverSubscribe=FORCE:2
`


I can ask for an interactive session with 4GB/CPU (I would have thought
that "EnforcePartLimits=ALL" would stop me from doing that), and once
I'm in the interactive session I can execute a 3GB test code without any
issues (I can see with htop that the process does indeed use a RES size
of 3GB at 100% CPU use). Any idea what could be the problem or how to
start debugging this?

,
| [angelv@xxx test]$ sinter -n 1 --mem-per-cpu=4000
| salloc: Granted job allocation 127544
| salloc: Nodes xxx are ready for job
|
| (sinter) [angelv@xxx test]$ stress -m 1 -t 600 --vm-keep --vm-bytes 3G
| stress -m 1 -t 600 --vm-keep --vm-bytes 3G
| stress: info: [1772392] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
`

Many thanks,
--
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


[slurm-users] MaxMemPerCPU not enforced?

2023-07-24 Thread Angel de Vicente
Hello,

I'm trying to get Slurm to control the memory used per CPU, but it does
not seem to enforce the MaxMemPerCPU option in slurm.conf

This is running in Ubuntu 22.04 (cgroups v2), Slurm 23.02.3.

Relevant configuration options:

,cgroup.conf
| AllowedRAMSpace=100
| ConstrainCores=yes
| ConstrainRAMSpace=yes
| ConstrainSwapSpace=yes
| AllowedSwapSpace=0
`

,slurm.conf
| TaskPlugin=task/affinity,task/cgroup
| PrologFlags=X11
| 
| SelectType=select/cons_res
| SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK
| MaxMemPerCPU=500
| DefMemPerCPU=200
| 
| JobAcctGatherType=jobacct_gather/linux
| 
| EnforcePartLimits=ALL
| 
| NodeName=xxx RealMemory=257756 Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 
Weight=1
| 
| PartitionName=batch   Nodes=duna State=UP Default=YES MaxTime=2-00:00:00 
MaxCPUsPerNode=32 OverSubscribe=FORCE:1
| PartitionName=interactive Nodes=duna State=UP Default=NO  MaxTime=08:00:00   
MaxCPUsPerNode=32 OverSubscribe=FORCE:2
`


I can ask for an interactive session with 4GB/CPU (I would have thought
that "EnforcePartLimits=ALL" would stop me from doing that), and once
I'm in the interactive session I can execute a 3GB test code without any
issues (I can see with htop that the process does indeed use a RES size
of 3GB at 100% CPU use). Any idea what could be the problem or how to
start debugging this?

,
| [angelv@xxx test]$ sinter -n 1 --mem-per-cpu=4000
| salloc: Granted job allocation 127544
| salloc: Nodes xxx are ready for job
| 
| (sinter) [angelv@xxx test]$ stress -m 1 -t 600 --vm-keep --vm-bytes 3G
| stress -m 1 -t 600 --vm-keep --vm-bytes 3G
| stress: info: [1772392] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
`

Many thanks,
-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Custom Gres for SSD

2023-07-24 Thread Shunran Zhang

Hi Matthias,

Thank you for your info. The prolog/epilog way of managing it does look 
quite promising.


Indeed in my setup I only want one job per node per SSD-set. Our tasks 
that require the scratch space are more IO bound - we are more worried 
about the IO usage than the actual disk space usage, and that is the 
reason why we only have ssd with count of 1 per 2-disk RAID 0. For those 
IO bound operations, even if each job only use 5% of the disk space 
available, the IO on the disk would become the bottleneck, resulting in 
both jobs running 2x slower and processes in D state, which is what I am 
trying to prevent. Also as those IO bound jobs are usually submitted by 
one single user in a batch, a user-based approach might also not be 
adequate.


I am considering to modify your script so that by default, the scratch 
space is world writable but everyone except root have a quota of 0, and 
the prolog lifts such quota. This way when the user forgot to specify 
the --gres=ssd:1 the job would fail with IO error and he would 
immediately know what went wrong.


I am also thinking of a gpu-like cgroup based solution. Maybe if I limit 
the file access to lets say /dev/sda, it would also stop the user from 
accessing the mount point of /dev/sda - I am not sure so I would also 
test this approach out...


Will investigate into it for a little bit more.

Sincerely,

S. Zhang

On 2023/07/24 17:06, Matthias Loose wrote:

On 2023-07-24 09:50, Matthias Loose wrote:

Hi Shunran,

just read your question again. If you dont want users to share the 
SSD, like at all even if both have requested it you can basically skip 
the quota part of my awnser.


If you really only want one user per SSD per node you should set the 
gres variable in the node configuration to 1 just like you did and 
then implement the prolog/epilog solution (without quotas). If the 
mounted SSD can only be written to by root no one else can use it and 
the job that requested it get a folder created by the prolog.


What we also do ist export the folder name in the user/task prolog to 
the environment so he can easely use it.


Out task prolog:

  #!/bin/bash
#PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

  local_dir="/local"

  SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"

  # check for /local job dir
  if [[ -d ${SLURM_TMPDIR} ]]; then
    # set tempdir env vars
    echo "export SLURM_TMPDIR=${SLURM_TMPDIR}"
    echo "export TMPDIR=${SLURM_TMPDIR}"
    echo "export JAVA_TOOL_OPTIONS=\"-Djava.io.tmpdir=${SLURM_TMPDIR}\""
  fi

Kind regards, Matt


Hi Shunran,

we do something very similar. I have nodes with 2 SSDs in a Raid1
mounted on /local. We defined a gres ressource just like you and
called it local. We define the ressource in the gres.conf like this:

  # LOCAL
  NodeName=hpc-node[01-10] Name=local

and add the ressource in counts of GB to the slurm.nodes.conf:

  NodeName=hpc-node01  CPUs=256 RealMemory=... Gres=local:3370

So in this case the node01 has 3370 counts or GB of the gres "local"
available for reservation. Now slurm tracks that resource for you and
users can reserve counts of /local space. But there is still one big
problem, SLURM hast no idea what local is and as u correctly noted,
others can just use it. I solved this the following way:

- /local ist owned by root, so no user can just write to it
- the node prolog creates a folder in /local in this form:
/local/job_ and makes the job owner of it
- the node epilog deletes that folder

This way you have already solved the problem of people/jobs not having
reserved any local using it. But there ist still no enforcement of
limits. For that I use quotas.
My /local is XFS formatted and XFS has a nifty feature called project
quotas, where you can set the quota for a folder.

This is my node prolog script for this purpose:

  #!/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

  local_dir="/local"
  local_job=0

  ## DETERMINE GRES:LOCAL
  # get job gres
  JOB_TRES=$(scontrol show JobID=${SLURM_JOBID} | grep "TresPerNode="
| cut -d '=' -f 2 | tr ',' ' ')

  # parse for local
  for gres in ${JOB_TRES}; do
    key=$(echo ${gres} | cut -d ':' -f 2 | tr '[:upper:]' '[:lower:]')
    if [[ ${key} == "local" ]]; then
  local_job=$(echo ${gres} | cut -d ':' -f 3)
  break
    fi
  done

  # make job local-dir if requested
  if [[ ${local_job} -ne 0 ]]; then
    # make local-dir for job
    SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
    mkdir ${SLURM_TMPDIR}

    # conversion
    local_job=$((local_job * 1024 * 1024))

    # set hard limit to requested size + 5%
    hard_limit=$((local_job * 105 / 100))

    # create project quota and set limits
    xfs_quota -x -c "project -s -p ${SLURM_TMPDIR} ${SLURM_JOBID}" 
${local_dir}

    xfs_quota -x -c "limit -p bsoft=${local_job}k bhard=${hard_limit}k
${SLURM_JOBID}" ${local_dir}

    chown ${SLURM_JOB_USER}:0 ${SLURM_TMPDIR}
    chmod 750 ${SLURM_TMPDIR}
  fi

  exit 0

This is my 

Re: [slurm-users] Custom Gres for SSD

2023-07-24 Thread Matthias Loose

On 2023-07-24 09:50, Matthias Loose wrote:

Hi Shunran,

just read your question again. If you dont want users to share the SSD, 
like at all even if both have requested it you can basically skip the 
quota part of my awnser.


If you really only want one user per SSD per node you should set the 
gres variable in the node configuration to 1 just like you did and then 
implement the prolog/epilog solution (without quotas). If the mounted 
SSD can only be written to by root no one else can use it and the job 
that requested it get a folder created by the prolog.


What we also do ist export the folder name in the user/task prolog to 
the environment so he can easely use it.


Out task prolog:

  #!/bin/bash
  #PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

  local_dir="/local"

  SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"

  # check for /local job dir
  if [[ -d ${SLURM_TMPDIR} ]]; then
# set tempdir env vars
echo "export SLURM_TMPDIR=${SLURM_TMPDIR}"
echo "export TMPDIR=${SLURM_TMPDIR}"
echo "export JAVA_TOOL_OPTIONS=\"-Djava.io.tmpdir=${SLURM_TMPDIR}\""
  fi

Kind regards, Matt


Hi Shunran,

we do something very similar. I have nodes with 2 SSDs in a Raid1
mounted on /local. We defined a gres ressource just like you and
called it local. We define the ressource in the gres.conf like this:

  # LOCAL
  NodeName=hpc-node[01-10] Name=local

and add the ressource in counts of GB to the slurm.nodes.conf:

  NodeName=hpc-node01  CPUs=256 RealMemory=... Gres=local:3370

So in this case the node01 has 3370 counts or GB of the gres "local"
available for reservation. Now slurm tracks that resource for you and
users can reserve counts of /local space. But there is still one big
problem, SLURM hast no idea what local is and as u correctly noted,
others can just use it. I solved this the following way:

- /local ist owned by root, so no user can just write to it
- the node prolog creates a folder in /local in this form:
/local/job_ and makes the job owner of it
- the node epilog deletes that folder

This way you have already solved the problem of people/jobs not having
reserved any local using it. But there ist still no enforcement of
limits. For that I use quotas.
My /local is XFS formatted and XFS has a nifty feature called project
quotas, where you can set the quota for a folder.

This is my node prolog script for this purpose:

  #!/bin/bash
  PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

  local_dir="/local"
  local_job=0

  ## DETERMINE GRES:LOCAL
  # get job gres
  JOB_TRES=$(scontrol show JobID=${SLURM_JOBID} | grep "TresPerNode="
| cut -d '=' -f 2 | tr ',' ' ')

  # parse for local
  for gres in ${JOB_TRES}; do
key=$(echo ${gres} | cut -d ':' -f 2 | tr '[:upper:]' '[:lower:]')
if [[ ${key} == "local" ]]; then
  local_job=$(echo ${gres} | cut -d ':' -f 3)
  break
fi
  done

  # make job local-dir if requested
  if [[ ${local_job} -ne 0 ]]; then
# make local-dir for job
SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
mkdir ${SLURM_TMPDIR}

# conversion
local_job=$((local_job * 1024 * 1024))

# set hard limit to requested size + 5%
hard_limit=$((local_job * 105 / 100))

# create project quota and set limits
xfs_quota -x -c "project -s -p ${SLURM_TMPDIR} ${SLURM_JOBID}" 
${local_dir}

xfs_quota -x -c "limit -p bsoft=${local_job}k bhard=${hard_limit}k
${SLURM_JOBID}" ${local_dir}

chown ${SLURM_JOB_USER}:0 ${SLURM_TMPDIR}
chmod 750 ${SLURM_TMPDIR}
  fi

  exit 0

This is my epilog:

  #!/bin/bash
  PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

  local_dir="/local"
  SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"

  # remove the quota
  xfs_quota -x -c "limit -p bsoft=0m bhard=0m ${SLURM_JOBID}" 
${local_dir}


  # remove the folder
  if [[ -d ${SLURM_TMPDIR} ]]; then
rm -rf --one-file-system ${SLURM_TMPDIR}
  fi

  exit 0

In order to use project quota you would need to activate it by using
this mount flag: pquota in the fstab.
I give the user 5% more than he requested. You just have to make sure
that you configure available space - 5% in the nodes.conf.

This is what we do and it works great.

Kind regards, Matt


On 2023-07-24 05:48, Shunran Zhang wrote:

Hi all,

I am attempting to setup a gres to manage jobs that need a
scratch space, but only a few of our computational nodes are
equipped with SSD for such scratch space. Originally I setup a new
partition for those IO-bound jobs, but it ended up that those jobs
might be allocated to the same node thus fighting each other for
IO.

With a look over other settings it appears that the gres setting
looks promising. However I was having some difficulties figuring
out how to limit access to such space to those who requested
--gres=ssd:1.

For now I am using Flags=CountOnly to trust users who uses SSD
request for it, but apparently any job submitted to a node with
SSD can just use such space. Our scratch space 

Re: [slurm-users] Custom Gres for SSD

2023-07-24 Thread Matthias Loose

Hi Shunran,

we do something very similar. I have nodes with 2 SSDs in a Raid1 
mounted on /local. We defined a gres ressource just like you and called 
it local. We define the ressource in the gres.conf like this:


  # LOCAL
  NodeName=hpc-node[01-10] Name=local

and add the ressource in counts of GB to the slurm.nodes.conf:

  NodeName=hpc-node01  CPUs=256 RealMemory=... Gres=local:3370

So in this case the node01 has 3370 counts or GB of the gres "local" 
available for reservation. Now slurm tracks that resource for you and 
users can reserve counts of /local space. But there is still one big 
problem, SLURM hast no idea what local is and as u correctly noted, 
others can just use it. I solved this the following way:


- /local ist owned by root, so no user can just write to it
- the node prolog creates a folder in /local in this form: 
/local/job_ and makes the job owner of it

- the node epilog deletes that folder

This way you have already solved the problem of people/jobs not having 
reserved any local using it. But there ist still no enforcement of 
limits. For that I use quotas.
My /local is XFS formatted and XFS has a nifty feature called project 
quotas, where you can set the quota for a folder.


This is my node prolog script for this purpose:

  #!/bin/bash
  PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

  local_dir="/local"
  local_job=0

  ## DETERMINE GRES:LOCAL
  # get job gres
  JOB_TRES=$(scontrol show JobID=${SLURM_JOBID} | grep "TresPerNode=" | 
cut -d '=' -f 2 | tr ',' ' ')


  # parse for local
  for gres in ${JOB_TRES}; do
key=$(echo ${gres} | cut -d ':' -f 2 | tr '[:upper:]' '[:lower:]')
if [[ ${key} == "local" ]]; then
  local_job=$(echo ${gres} | cut -d ':' -f 3)
  break
fi
  done

  # make job local-dir if requested
  if [[ ${local_job} -ne 0 ]]; then
# make local-dir for job
SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
mkdir ${SLURM_TMPDIR}

# conversion
local_job=$((local_job * 1024 * 1024))

# set hard limit to requested size + 5%
hard_limit=$((local_job * 105 / 100))

# create project quota and set limits
xfs_quota -x -c "project -s -p ${SLURM_TMPDIR} ${SLURM_JOBID}" 
${local_dir}
xfs_quota -x -c "limit -p bsoft=${local_job}k bhard=${hard_limit}k 
${SLURM_JOBID}" ${local_dir}


chown ${SLURM_JOB_USER}:0 ${SLURM_TMPDIR}
chmod 750 ${SLURM_TMPDIR}
  fi

  exit 0

This is my epilog:

  #!/bin/bash
  PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

  local_dir="/local"
  SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"

  # remove the quota
  xfs_quota -x -c "limit -p bsoft=0m bhard=0m ${SLURM_JOBID}" 
${local_dir}


  # remove the folder
  if [[ -d ${SLURM_TMPDIR} ]]; then
rm -rf --one-file-system ${SLURM_TMPDIR}
  fi

  exit 0

In order to use project quota you would need to activate it by using 
this mount flag: pquota in the fstab.
I give the user 5% more than he requested. You just have to make sure 
that you configure available space - 5% in the nodes.conf.


This is what we do and it works great.

Kind regards, Matt


On 2023-07-24 05:48, Shunran Zhang wrote:

Hi all,

I am attempting to setup a gres to manage jobs that need a
scratch space, but only a few of our computational nodes are
equipped with SSD for such scratch space. Originally I setup a new
partition for those IO-bound jobs, but it ended up that those jobs
might be allocated to the same node thus fighting each other for
IO.

With a look over other settings it appears that the gres setting
looks promising. However I was having some difficulties figuring
out how to limit access to such space to those who requested
--gres=ssd:1.

For now I am using Flags=CountOnly to trust users who uses SSD
request for it, but apparently any job submitted to a node with
SSD can just use such space. Our scratch space implementation is 2
disks (sda and sdb) formatted to btrfs and RAID 0. What should I
do to enforce such limit on which job can use such space?

Related configurations for ref:
gres.conf: NodeName=scratch-1 Name=ssd Flags=CountOnly cgroup.conf:
ConstrainDevices=yes slurm.conf: GresTypes=gpu,ssd
NodeName=scratch-1 CPUs=88 Sockets=2 CoresPerSocket=22
ThreadsPerCore=2  RealMemory=18 Gres=ssd:1 State=UNKNOWN
Sincerely,
S. Zhang