[slurm-users] Multi Regional computing

2023-02-02 Thread Eunsong Goh
Hi,

I just finished a cluster which consists of multi-regional and on-premise
servers.

My slurm cluster environment is as follows and I want to run jobs in a
combination of multiple region worker nodes.

Slurm master server was created in GCP KR Region,
Worker node #1 was created in the same region with slurm master server, and
has NVIDIA T4 2 GPUs.
Worker  node #2 was created in GCP US Region, and has NVIDIA T4 2 GPUs.
And Worker node #3 is one of the on premise servers which has NVIDIA T4 8
GPUs.

In this environment, Can I run a slurm job in combination of  #1 server 2
GPUs + #2 servers 2 GPUs?, or #1 server 2 GPUs + #3 on premise server?

Depending on the result of my several tests, multi-regional GPUs
combinations failed.
Those jobs were run in only one region's worker node.

Are there any mechanisms or rules about the combination of multiple worker
nodes? and priority rule in selection of multi worker nodes?

Thanks


Re: [slurm-users] [ext] Enforce gpu usage limits (with GRES?)

2023-02-02 Thread Analabha Roy
Hi,

Thanks for the reply. Yes, your advice helped! Much obliged. Not only was
cgroups config necessary, but the option

ConstrainDevices=yes

in cgroup.conf was necessary to enforce the gpu gres. Now, not adding a
gres parameter to srun causes gpu jobs to fail. An improvement!

Although, I still can't keep out gpu jobs from the "CPU" partition. Is
there a way to link a partition to a GRES or something?

Alternatively, can I define two nodenames in slurm.conf that point to the
same physical node, but only one of them has the gpu GRES? That way, I can
link the GPU partition to the gres-configged nodename only.

Thanks in advance,
AR

*PS*: If the slurm devs are reading this, may I suggest that perhaps it
would be a good idea to add a reference to cgroups in the gres
documentation page?








On Thu, 2 Feb 2023 at 16:52, Holtgrewe, Manuel <
manuel.holtgr...@bih-charite.de> wrote:

> Hi,
>
>
> if by "share the GPU" you mean exclusive allocation to a single job then,
> I believe, you are missing cgroup configuration for isolating access to the
> GPU.
>
>
> Below the relevant parts (I believe) of our configuration.
>
>
> There also is a way of time- and space-slice GPUs but I guess you should
> get things setup without slicing.
>
>
> I hope this helps.
>
>
> Manuel
>
>
> ==> /etc/slurm/cgroup.conf <==
> # https://bugs.schedmd.com/show_bug.cgi?id=3701
> CgroupMountpoint="/sys/fs/cgroup"
> CgroupAutomount=yes
> AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
>
> ==> /etc/slurm/cgroup_allowed_devices_file.conf <==
> /dev/null
> /dev/urandom
> /dev/zero
> /dev/sda*
> /dev/cpu/*/*
> /dev/pts/*
> /dev/nvidia*
>
> ==> /etc/slurm/slurm.conf <==
>
> ProctrackType=proctrack/cgroup
>
> # Memory is enforced via cgroups, so we should not do this here by [*]
> #
> # /etc/slurm/cgroup.conf: ConstrainRAMSpace=yes
> #
> # [*] https://bugs.schedmd.com/show_bug.cgi?id=5262
> JobAcctGatherParams=NoOverMemoryKill
>
> TaskPlugin=task/cgroup
>
> JobAcctGatherType=jobacct_gather/cgroup
>
>
> --
> Dr. Manuel Holtgrewe, Dipl.-Inform.
> Bioinformatician
> Core Unit Bioinformatics – CUBI
> Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in
> the Helmholtz Association / Charité – Universitätsmedizin Berlin
>
> Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
> Postal Address: Chariteplatz 1, 10117 Berlin
>
> E-Mail: manuel.holtgr...@bihealth.de
> Phone: +49 30 450 543 607
> Fax: +49 30 450 7 543 901
> Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de
> www.charite.de
> --
> *From:* slurm-users  on behalf of
> Analabha Roy 
> *Sent:* Wednesday, February 1, 2023 6:12:40 PM
> *To:* slurm-users@lists.schedmd.com
> *Subject:* [ext] [slurm-users] Enforce gpu usage limits (with GRES?)
>
> Hi,
>
> I'm new to slurm, so I apologize in advance if my question seems basic.
>
> I just purchased a single node 'cluster' consisting of one 64-core cpu and
> an nvidia rtx5k gpu (Turing architecture, I think). The vendor supplied it
> with ubuntu 20.04 and slurm-wlm 19.05.5. Now I'm trying to adjust the
> config to suit the needs of my department.
>
> I'm trying to bone up on GRES scheduling by reading this manual page
> , but am confused about some things.
>
> My slurm.conf file has the following lines put in it by the vendor:
>
> ###
> # COMPUTE NODES
> GresTypes=gpu
> NodeName=shavak-DIT400TR-55L CPUs=64 SocketsPerBoard=2 CoresPerSocket=32
> ThreadsPerCore=1 RealMemory=95311 Gres=gpu:1
> #PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> PartitionName=CPU Nodes=ALL Default=Yes MaxTime=INFINITE  State=UP
>
> PartitionName=GPU Nodes=ALL Default=NO MaxTime=INFINITE  State=UP
> #
>
> So they created two partitions that are essentially identical. Secondly,
> they put just the following line in gres.conf:
>
> ###
> NodeName=shavak-DIT400TR-55L  Name=gpuFile=/dev/nvidia0
> ###
>
> That's all. However, this configuration does not appear to constrain
> anyone in any manner. As a regular user, I can still use srun or sbatch to
> start GPU jobs from the "CPU partition," and nvidia-smi says that a simple
> cupy  script that multiplies matrices and starts as an
> sbatch job in the CPU partition can access the gpu just fine. Note that the
> environment variable "CUDA_VISIBLE_DEVICES" does not appear to be set in
> any job step. I tested this by starting an interactive srun shell in both
> CPU and GPU partition and running ''echo $CUDA_VISIBLE_DEVICES" and got
> bupkis for both.
>
>
> What I need to do is constrain jobs to using chunks of GPU Cores/RAM so
> that multiple jobs can share the GPU.
>
> As I understand from the gres manpage, simply adding "AutoDetect=nvml"
> (NVML should be installed with the NVIDIA HPC SDK, right? I installed it
> with apt-get...) in gres.conf should allow Slurm to detect the GPU's

Re: [slurm-users] [ext] Enforce gpu usage limits (with GRES?)

2023-02-02 Thread Holtgrewe, Manuel
Hi,


if by "share the GPU" you mean exclusive allocation to a single job then, I 
believe, you are missing cgroup configuration for isolating access to the GPU.


Below the relevant parts (I believe) of our configuration.


There also is a way of time- and space-slice GPUs but I guess you should get 
things setup without slicing.


I hope this helps.


Manuel


==> /etc/slurm/cgroup.conf <==
# https://bugs.schedmd.com/show_bug.cgi?id=3701
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"

==> /etc/slurm/cgroup_allowed_devices_file.conf <==
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/dev/nvidia*


==> /etc/slurm/slurm.conf <==

ProctrackType=proctrack/cgroup

# Memory is enforced via cgroups, so we should not do this here by [*]
#
# /etc/slurm/cgroup.conf: ConstrainRAMSpace=yes
#
# [*] https://bugs.schedmd.com/show_bug.cgi?id=5262
JobAcctGatherParams=NoOverMemoryKill

TaskPlugin=task/cgroup

JobAcctGatherType=jobacct_gather/cgroup


--
Dr. Manuel Holtgrewe, Dipl.-Inform.
Bioinformatician
Core Unit Bioinformatics – CUBI
Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the 
Helmholtz Association / Charité – Universitätsmedizin Berlin

Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
Postal Address: Chariteplatz 1, 10117 Berlin

E-Mail: manuel.holtgr...@bihealth.de
Phone: +49 30 450 543 607
Fax: +49 30 450 7 543 901
Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de  www.charite.de

From: slurm-users  on behalf of Analabha 
Roy 
Sent: Wednesday, February 1, 2023 6:12:40 PM
To: slurm-users@lists.schedmd.com
Subject: [ext] [slurm-users] Enforce gpu usage limits (with GRES?)

Hi,

I'm new to slurm, so I apologize in advance if my question seems basic.

I just purchased a single node 'cluster' consisting of one 64-core cpu and an 
nvidia rtx5k gpu (Turing architecture, I think). The vendor supplied it with 
ubuntu 20.04 and slurm-wlm 19.05.5. Now I'm trying to adjust the config to suit 
the needs of my department.

I'm trying to bone up on GRES scheduling by reading this manual 
page, but am confused about some things.

My slurm.conf file has the following lines put in it by the vendor:

###
# COMPUTE NODES
GresTypes=gpu
NodeName=shavak-DIT400TR-55L CPUs=64 SocketsPerBoard=2 CoresPerSocket=32 
ThreadsPerCore=1 RealMemory=95311 Gres=gpu:1
#PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

PartitionName=CPU Nodes=ALL Default=Yes MaxTime=INFINITE  State=UP

PartitionName=GPU Nodes=ALL Default=NO MaxTime=INFINITE  State=UP
#

So they created two partitions that are essentially identical. Secondly, they 
put just the following line in gres.conf:

###
NodeName=shavak-DIT400TR-55L  Name=gpuFile=/dev/nvidia0
###

That's all. However, this configuration does not appear to constrain anyone in 
any manner. As a regular user, I can still use srun or sbatch to start GPU jobs 
from the "CPU partition," and nvidia-smi says that a simple 
cupy script that multiplies matrices and starts as an sbatch 
job in the CPU partition can access the gpu just fine. Note that the 
environment variable "CUDA_VISIBLE_DEVICES" does not appear to be set in any 
job step. I tested this by starting an interactive srun shell in both CPU and 
GPU partition and running ''echo $CUDA_VISIBLE_DEVICES" and got bupkis for both.


What I need to do is constrain jobs to using chunks of GPU Cores/RAM so that 
multiple jobs can share the GPU.

As I understand from the gres manpage, simply adding "AutoDetect=nvml" (NVML 
should be installed with the NVIDIA HPC SDK, right? I installed it with 
apt-get...) in gres.conf should allow Slurm to detect the GPU's internal 
specifications automatically. Is that all, or do I need to config an mps GRES 
as well? Will that succeed in jailing out the GPU from jobs that don't mention 
any gres parameters (perhaps by setting CUDA_VISIBLE_DEVICES), or is there any 
additional config for that? Do I really need that extra "GPU" partition that 
the vendor put in for any of this, or is there a way to bind GRES resources to 
a particular partition in such a way that simply launching jobs in that 
partition will be enough?

Thanks for your attention.
Regards
AR













--
Analabha Roy
Assistant Professor
Department of Physics
The University of Burdwan
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: dan...@utexas.edu, 
a...@phys.buruniv.ac.in, 
hariseldo...@gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/