Re: [slurm-users] Core reserved/bound to a GPU

2020-09-04 Thread Manuel Bertrand

On 01/09/2020 06:36, Chris Samuel wrote:

On Monday, 31 August 2020 7:41:13 AM PDT Manuel BERTRAND wrote:


Every thing works great so far but now I would like to bound a specific
core to each GPUs on each node. By "bound" I mean to make a particular
core not assignable to a CPU job alone so that the GPU is available
whatever the CPU workload on the node.

What I've done in the past (waves to Swinburne folks on the list) was to have
overlapping partitions on GPU nodes where the GPU job partition had access to
all the cores and the CPU only job partition had access to only a subset
(limited by the MaxCPUsPerNode parameter on the partition).



Thanks for this suggestion but it leads to another problem:
the total number of cores is quite different on the nodes, ranging from 
12 to 20.
So as the MaxCPUsPerNode parameter will be enforced on all the nodes in 
the partition, I will need to adjust it for the GPU node with the 
smallest number of cores (here 12 with 2 GPUs, so with 2 cores to be 
reserved: MaxCPUsPerNode=10) and so I'll lose up to 10 cores on the 20 
cores node :(


What do you think of the idea to enforce this only on the "Default" 
partition (GPU + CPU nodes) so that if a user need a full cores set he 
must specify the partition ie. "cpu" / "gpu" ?

Here is my current partitions declaration:
PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5 
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=gpu 
Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8 
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00 
State=UP


So instead of enforcing the limit directly on the CPU partition and 
adding to it all the GPU nodes, I would do it on the "Default" one (here 
named "all") like this:
PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00 
State=UP MaxCPUsPerNode=10


It seems quite hackish...




Re: [slurm-users] Core reserved/bound to a GPU

2020-08-31 Thread Chris Samuel
On Monday, 31 August 2020 7:41:13 AM PDT Manuel BERTRAND wrote:

> Every thing works great so far but now I would like to bound a specific
> core to each GPUs on each node. By "bound" I mean to make a particular
> core not assignable to a CPU job alone so that the GPU is available
> whatever the CPU workload on the node.

What I've done in the past (waves to Swinburne folks on the list) was to have 
overlapping partitions on GPU nodes where the GPU job partition had access to 
all the cores and the CPU only job partition had access to only a subset 
(limited by the MaxCPUsPerNode parameter on the partition).

The problem you run into there though is that there's no way to reserve cores 
on a particular socket, which means problems for folks who care about locality 
for GPU codes as they can wait in the queue with GPUs free and cores free but 
not the right cores on the right socket to be able to use the GPUs. :-(

Here's my bug from when I was in Australia for this issue where I suggested a 
MaxCPUsPerSocket parameter for partitions:

https://bugs.schedmd.com/show_bug.cgi?id=4717

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] Core reserved/bound to a GPU

2020-08-31 Thread Stephan Schott
Hi,
I'm also very interested in how this could be done properly. At the moment
what we are doing is setting up partitions with MaxCPUsPerNode set to
CPUs-GPUs. Maybe this can help you in the meanwhile, but this is a
suboptimal solution (in fact we have nodes with different number of CPUs,
so we had to make a partition per "node type"). Someone else can have a
better idea.
Cheers,

El lun., 31 ago. 2020 a las 16:45, Manuel BERTRAND (<
manuel.bertr...@lis-lab.fr>) escribió:

> Hi list,
>
> I am totally new to Slurm and have just deployed a heterogeneous GPU/CPU
> cluster by following the latest OpenHPC recipe on CentOS 8.2 (thanks
> OpenHPC team for making those !)
> Every thing works great so far but now I would like to bound a specific
> core to each GPUs on each node. By "bound" I mean to make a particular
> core not assignable to a CPU job alone so that the GPU is available
> whatever the CPU workload on the node. I'm asking this because in the
> actual state a CPU only user can monopolize the whole node, preventing a
> GPU user to come in as there is no CPU available even if the GPU is
> free. I'm not sure what is the best way to enforce this. Hope this is
> clear :)
>
> Any help greatly appreciated !
>
> Here is my gres.conf, cgroup.conf, partitions configuration, followed by
> the output of 'scontrol show config':
>
> ### gres.conf 
> NodeName=gpunode1 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode1 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode1 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode1 Name=gpu  File=/dev/nvidia3
> NodeName=gpunode2 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode2 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode2 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia3
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia4
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia5
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia6
> NodeName=gpunode3 Name=gpu  File=/dev/nvidia7
> NodeName=gpunode4 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode4 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia3
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia4
> NodeName=gpunode5 Name=gpu  File=/dev/nvidia5
> NodeName=gpunode6 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode6 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode6 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode6 Name=gpu  File=/dev/nvidia3
> NodeName=gpunode7 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode7 Name=gpu  File=/dev/nvidia1
> NodeName=gpunode7 Name=gpu  File=/dev/nvidia2
> NodeName=gpunode7 Name=gpu  File=/dev/nvidia3
> NodeName=gpunode8 Name=gpu  File=/dev/nvidia0
> NodeName=gpunode8 Name=gpu  File=/dev/nvidia1
>
> ### cgroup.conf 
> CgroupAutomount=yes
> TaskAffinity=no
> ConstrainCores=yes
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> ConstrainKmemSpace=no
> ConstrainDevices=yes
>
>
> ### partitions configuration ###
> PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5
> Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
> PartitionName=gpu
> Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8
>
> Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
> PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00
> State=UP
>
>
> ### Slurm configuration ###
> Configuration data as of 2020-08-31T16:23:54
> AccountingStorageBackupHost = (null)
> AccountingStorageEnforce = none
> AccountingStorageHost   = sms.mycluster
> AccountingStorageLoc= N/A
> AccountingStoragePort   = 6819
> AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
> AccountingStorageType   = accounting_storage/slurmdbd
> AccountingStorageUser   = N/A
> AccountingStoreJobComment = No
> AcctGatherEnergyType= acct_gather_energy/none
> AcctGatherFilesystemType = acct_gather_filesystem/none
> AcctGatherInterconnectType = acct_gather_interconnect/none
> AcctGatherNodeFreq  = 0 sec
> AcctGatherProfileType   = acct_gather_profile/none
> AllowSpecResourcesUsage = No
> AuthAltTypes= (null)
> AuthInfo= (null)
> AuthType= auth/munge
> BatchStartTimeout   = 10 sec
>
> EpilogMsgTime   = 2000 usec
> EpilogSlurmctld = (null)
> ExtSensorsType  = ext_sensors/none
> ExtSensorsFreq  = 0 sec
> FederationParameters= (null)
> FirstJobId  = 1
> GetEnvTimeout   = 2 sec
> GresTypes   = gpu
> GpuFreqDef  = high,memory=high
> GroupUpdateForce= 1
> GroupUpdateTime = 600 sec
> HASH_VAL= Match
> HealthCheckInterval = 300 sec
> HealthCheckNodeState= ANY
> 

[slurm-users] Core reserved/bound to a GPU

2020-08-31 Thread Manuel BERTRAND

Hi list,

I am totally new to Slurm and have just deployed a heterogeneous GPU/CPU 
cluster by following the latest OpenHPC recipe on CentOS 8.2 (thanks 
OpenHPC team for making those !)
Every thing works great so far but now I would like to bound a specific 
core to each GPUs on each node. By "bound" I mean to make a particular 
core not assignable to a CPU job alone so that the GPU is available 
whatever the CPU workload on the node. I'm asking this because in the 
actual state a CPU only user can monopolize the whole node, preventing a 
GPU user to come in as there is no CPU available even if the GPU is 
free. I'm not sure what is the best way to enforce this. Hope this is 
clear :)


Any help greatly appreciated !

Here is my gres.conf, cgroup.conf, partitions configuration, followed by 
the output of 'scontrol show config':


### gres.conf 
NodeName=gpunode1 Name=gpu  File=/dev/nvidia0
NodeName=gpunode1 Name=gpu  File=/dev/nvidia1
NodeName=gpunode1 Name=gpu  File=/dev/nvidia2
NodeName=gpunode1 Name=gpu  File=/dev/nvidia3
NodeName=gpunode2 Name=gpu  File=/dev/nvidia0
NodeName=gpunode2 Name=gpu  File=/dev/nvidia1
NodeName=gpunode2 Name=gpu  File=/dev/nvidia2
NodeName=gpunode3 Name=gpu  File=/dev/nvidia0
NodeName=gpunode3 Name=gpu  File=/dev/nvidia1
NodeName=gpunode3 Name=gpu  File=/dev/nvidia2
NodeName=gpunode3 Name=gpu  File=/dev/nvidia3
NodeName=gpunode3 Name=gpu  File=/dev/nvidia4
NodeName=gpunode3 Name=gpu  File=/dev/nvidia5
NodeName=gpunode3 Name=gpu  File=/dev/nvidia6
NodeName=gpunode3 Name=gpu  File=/dev/nvidia7
NodeName=gpunode4 Name=gpu  File=/dev/nvidia0
NodeName=gpunode4 Name=gpu  File=/dev/nvidia1
NodeName=gpunode5 Name=gpu  File=/dev/nvidia0
NodeName=gpunode5 Name=gpu  File=/dev/nvidia1
NodeName=gpunode5 Name=gpu  File=/dev/nvidia2
NodeName=gpunode5 Name=gpu  File=/dev/nvidia3
NodeName=gpunode5 Name=gpu  File=/dev/nvidia4
NodeName=gpunode5 Name=gpu  File=/dev/nvidia5
NodeName=gpunode6 Name=gpu  File=/dev/nvidia0
NodeName=gpunode6 Name=gpu  File=/dev/nvidia1
NodeName=gpunode6 Name=gpu  File=/dev/nvidia2
NodeName=gpunode6 Name=gpu  File=/dev/nvidia3
NodeName=gpunode7 Name=gpu  File=/dev/nvidia0
NodeName=gpunode7 Name=gpu  File=/dev/nvidia1
NodeName=gpunode7 Name=gpu  File=/dev/nvidia2
NodeName=gpunode7 Name=gpu  File=/dev/nvidia3
NodeName=gpunode8 Name=gpu  File=/dev/nvidia0
NodeName=gpunode8 Name=gpu  File=/dev/nvidia1

### cgroup.conf 
CgroupAutomount=yes
TaskAffinity=no
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainKmemSpace=no
ConstrainDevices=yes


### partitions configuration ###
PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5 
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=gpu 
Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8 
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00 
State=UP



### Slurm configuration ###
Configuration data as of 2020-08-31T16:23:54
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = sms.mycluster
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = No
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq  = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes    = (null)
AuthInfo    = (null)
AuthType    = auth/munge
BatchStartTimeout   = 10 sec

EpilogMsgTime   = 2000 usec
EpilogSlurmctld = (null)
ExtSensorsType  = ext_sensors/none
ExtSensorsFreq  = 0 sec
FederationParameters    = (null)
FirstJobId  = 1
GetEnvTimeout   = 2 sec
GresTypes   = gpu
GpuFreqDef  = high,memory=high
GroupUpdateForce    = 1
GroupUpdateTime = 600 sec
HASH_VAL    = Match
HealthCheckInterval = 300 sec
HealthCheckNodeState    = ANY
HealthCheckProgram  = /usr/sbin/nhc
InactiveLimit   = 0 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType   = jobacct_gather/none
JobAcctGatherParams = (null)
JobCompHost = localhost
JobCompLoc  = /var/log/slurm_jobcomp.log
JobCompPort = 0
JobCompType = jobcomp/none
JobCompUser = root
JobContainerType    = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults = (null)
JobFileAppend   = 0
JobRequeue  = 1
JobSubmitPlugins    = (null)
KeepAliveTime   = SYSTEM_DEFAULT
KillOnBadExit   = 0