Re: [slurm-users] Core reserved/bound to a GPU
On 01/09/2020 06:36, Chris Samuel wrote: On Monday, 31 August 2020 7:41:13 AM PDT Manuel BERTRAND wrote: Every thing works great so far but now I would like to bound a specific core to each GPUs on each node. By "bound" I mean to make a particular core not assignable to a CPU job alone so that the GPU is available whatever the CPU workload on the node. What I've done in the past (waves to Swinburne folks on the list) was to have overlapping partitions on GPU nodes where the GPU job partition had access to all the cores and the CPU only job partition had access to only a subset (limited by the MaxCPUsPerNode parameter on the partition). Thanks for this suggestion but it leads to another problem: the total number of cores is quite different on the nodes, ranging from 12 to 20. So as the MaxCPUsPerNode parameter will be enforced on all the nodes in the partition, I will need to adjust it for the GPU node with the smallest number of cores (here 12 with 2 GPUs, so with 2 cores to be reserved: MaxCPUsPerNode=10) and so I'll lose up to 10 cores on the 20 cores node :( What do you think of the idea to enforce this only on the "Default" partition (GPU + CPU nodes) so that if a user need a full cores set he must specify the partition ie. "cpu" / "gpu" ? Here is my current partitions declaration: PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5 Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP PartitionName=gpu Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8 Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00 State=UP So instead of enforcing the limit directly on the CPU partition and adding to it all the GPU nodes, I would do it on the "Default" one (here named "all") like this: PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00 State=UP MaxCPUsPerNode=10 It seems quite hackish...
Re: [slurm-users] Core reserved/bound to a GPU
On Monday, 31 August 2020 7:41:13 AM PDT Manuel BERTRAND wrote: > Every thing works great so far but now I would like to bound a specific > core to each GPUs on each node. By "bound" I mean to make a particular > core not assignable to a CPU job alone so that the GPU is available > whatever the CPU workload on the node. What I've done in the past (waves to Swinburne folks on the list) was to have overlapping partitions on GPU nodes where the GPU job partition had access to all the cores and the CPU only job partition had access to only a subset (limited by the MaxCPUsPerNode parameter on the partition). The problem you run into there though is that there's no way to reserve cores on a particular socket, which means problems for folks who care about locality for GPU codes as they can wait in the queue with GPUs free and cores free but not the right cores on the right socket to be able to use the GPUs. :-( Here's my bug from when I was in Australia for this issue where I suggested a MaxCPUsPerSocket parameter for partitions: https://bugs.schedmd.com/show_bug.cgi?id=4717 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Core reserved/bound to a GPU
Hi, I'm also very interested in how this could be done properly. At the moment what we are doing is setting up partitions with MaxCPUsPerNode set to CPUs-GPUs. Maybe this can help you in the meanwhile, but this is a suboptimal solution (in fact we have nodes with different number of CPUs, so we had to make a partition per "node type"). Someone else can have a better idea. Cheers, El lun., 31 ago. 2020 a las 16:45, Manuel BERTRAND (< manuel.bertr...@lis-lab.fr>) escribió: > Hi list, > > I am totally new to Slurm and have just deployed a heterogeneous GPU/CPU > cluster by following the latest OpenHPC recipe on CentOS 8.2 (thanks > OpenHPC team for making those !) > Every thing works great so far but now I would like to bound a specific > core to each GPUs on each node. By "bound" I mean to make a particular > core not assignable to a CPU job alone so that the GPU is available > whatever the CPU workload on the node. I'm asking this because in the > actual state a CPU only user can monopolize the whole node, preventing a > GPU user to come in as there is no CPU available even if the GPU is > free. I'm not sure what is the best way to enforce this. Hope this is > clear :) > > Any help greatly appreciated ! > > Here is my gres.conf, cgroup.conf, partitions configuration, followed by > the output of 'scontrol show config': > > ### gres.conf > NodeName=gpunode1 Name=gpu File=/dev/nvidia0 > NodeName=gpunode1 Name=gpu File=/dev/nvidia1 > NodeName=gpunode1 Name=gpu File=/dev/nvidia2 > NodeName=gpunode1 Name=gpu File=/dev/nvidia3 > NodeName=gpunode2 Name=gpu File=/dev/nvidia0 > NodeName=gpunode2 Name=gpu File=/dev/nvidia1 > NodeName=gpunode2 Name=gpu File=/dev/nvidia2 > NodeName=gpunode3 Name=gpu File=/dev/nvidia0 > NodeName=gpunode3 Name=gpu File=/dev/nvidia1 > NodeName=gpunode3 Name=gpu File=/dev/nvidia2 > NodeName=gpunode3 Name=gpu File=/dev/nvidia3 > NodeName=gpunode3 Name=gpu File=/dev/nvidia4 > NodeName=gpunode3 Name=gpu File=/dev/nvidia5 > NodeName=gpunode3 Name=gpu File=/dev/nvidia6 > NodeName=gpunode3 Name=gpu File=/dev/nvidia7 > NodeName=gpunode4 Name=gpu File=/dev/nvidia0 > NodeName=gpunode4 Name=gpu File=/dev/nvidia1 > NodeName=gpunode5 Name=gpu File=/dev/nvidia0 > NodeName=gpunode5 Name=gpu File=/dev/nvidia1 > NodeName=gpunode5 Name=gpu File=/dev/nvidia2 > NodeName=gpunode5 Name=gpu File=/dev/nvidia3 > NodeName=gpunode5 Name=gpu File=/dev/nvidia4 > NodeName=gpunode5 Name=gpu File=/dev/nvidia5 > NodeName=gpunode6 Name=gpu File=/dev/nvidia0 > NodeName=gpunode6 Name=gpu File=/dev/nvidia1 > NodeName=gpunode6 Name=gpu File=/dev/nvidia2 > NodeName=gpunode6 Name=gpu File=/dev/nvidia3 > NodeName=gpunode7 Name=gpu File=/dev/nvidia0 > NodeName=gpunode7 Name=gpu File=/dev/nvidia1 > NodeName=gpunode7 Name=gpu File=/dev/nvidia2 > NodeName=gpunode7 Name=gpu File=/dev/nvidia3 > NodeName=gpunode8 Name=gpu File=/dev/nvidia0 > NodeName=gpunode8 Name=gpu File=/dev/nvidia1 > > ### cgroup.conf > CgroupAutomount=yes > TaskAffinity=no > ConstrainCores=yes > ConstrainRAMSpace=yes > ConstrainSwapSpace=yes > ConstrainKmemSpace=no > ConstrainDevices=yes > > > ### partitions configuration ### > PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5 > Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP > PartitionName=gpu > Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8 > > Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP > PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00 > State=UP > > > ### Slurm configuration ### > Configuration data as of 2020-08-31T16:23:54 > AccountingStorageBackupHost = (null) > AccountingStorageEnforce = none > AccountingStorageHost = sms.mycluster > AccountingStorageLoc= N/A > AccountingStoragePort = 6819 > AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages > AccountingStorageType = accounting_storage/slurmdbd > AccountingStorageUser = N/A > AccountingStoreJobComment = No > AcctGatherEnergyType= acct_gather_energy/none > AcctGatherFilesystemType = acct_gather_filesystem/none > AcctGatherInterconnectType = acct_gather_interconnect/none > AcctGatherNodeFreq = 0 sec > AcctGatherProfileType = acct_gather_profile/none > AllowSpecResourcesUsage = No > AuthAltTypes= (null) > AuthInfo= (null) > AuthType= auth/munge > BatchStartTimeout = 10 sec > > EpilogMsgTime = 2000 usec > EpilogSlurmctld = (null) > ExtSensorsType = ext_sensors/none > ExtSensorsFreq = 0 sec > FederationParameters= (null) > FirstJobId = 1 > GetEnvTimeout = 2 sec > GresTypes = gpu > GpuFreqDef = high,memory=high > GroupUpdateForce= 1 > GroupUpdateTime = 600 sec > HASH_VAL= Match > HealthCheckInterval = 300 sec > HealthCheckNodeState= ANY >
[slurm-users] Core reserved/bound to a GPU
Hi list, I am totally new to Slurm and have just deployed a heterogeneous GPU/CPU cluster by following the latest OpenHPC recipe on CentOS 8.2 (thanks OpenHPC team for making those !) Every thing works great so far but now I would like to bound a specific core to each GPUs on each node. By "bound" I mean to make a particular core not assignable to a CPU job alone so that the GPU is available whatever the CPU workload on the node. I'm asking this because in the actual state a CPU only user can monopolize the whole node, preventing a GPU user to come in as there is no CPU available even if the GPU is free. I'm not sure what is the best way to enforce this. Hope this is clear :) Any help greatly appreciated ! Here is my gres.conf, cgroup.conf, partitions configuration, followed by the output of 'scontrol show config': ### gres.conf NodeName=gpunode1 Name=gpu File=/dev/nvidia0 NodeName=gpunode1 Name=gpu File=/dev/nvidia1 NodeName=gpunode1 Name=gpu File=/dev/nvidia2 NodeName=gpunode1 Name=gpu File=/dev/nvidia3 NodeName=gpunode2 Name=gpu File=/dev/nvidia0 NodeName=gpunode2 Name=gpu File=/dev/nvidia1 NodeName=gpunode2 Name=gpu File=/dev/nvidia2 NodeName=gpunode3 Name=gpu File=/dev/nvidia0 NodeName=gpunode3 Name=gpu File=/dev/nvidia1 NodeName=gpunode3 Name=gpu File=/dev/nvidia2 NodeName=gpunode3 Name=gpu File=/dev/nvidia3 NodeName=gpunode3 Name=gpu File=/dev/nvidia4 NodeName=gpunode3 Name=gpu File=/dev/nvidia5 NodeName=gpunode3 Name=gpu File=/dev/nvidia6 NodeName=gpunode3 Name=gpu File=/dev/nvidia7 NodeName=gpunode4 Name=gpu File=/dev/nvidia0 NodeName=gpunode4 Name=gpu File=/dev/nvidia1 NodeName=gpunode5 Name=gpu File=/dev/nvidia0 NodeName=gpunode5 Name=gpu File=/dev/nvidia1 NodeName=gpunode5 Name=gpu File=/dev/nvidia2 NodeName=gpunode5 Name=gpu File=/dev/nvidia3 NodeName=gpunode5 Name=gpu File=/dev/nvidia4 NodeName=gpunode5 Name=gpu File=/dev/nvidia5 NodeName=gpunode6 Name=gpu File=/dev/nvidia0 NodeName=gpunode6 Name=gpu File=/dev/nvidia1 NodeName=gpunode6 Name=gpu File=/dev/nvidia2 NodeName=gpunode6 Name=gpu File=/dev/nvidia3 NodeName=gpunode7 Name=gpu File=/dev/nvidia0 NodeName=gpunode7 Name=gpu File=/dev/nvidia1 NodeName=gpunode7 Name=gpu File=/dev/nvidia2 NodeName=gpunode7 Name=gpu File=/dev/nvidia3 NodeName=gpunode8 Name=gpu File=/dev/nvidia0 NodeName=gpunode8 Name=gpu File=/dev/nvidia1 ### cgroup.conf CgroupAutomount=yes TaskAffinity=no ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes ConstrainKmemSpace=no ConstrainDevices=yes ### partitions configuration ### PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5 Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP PartitionName=gpu Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8 Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00 State=UP ### Slurm configuration ### Configuration data as of 2020-08-31T16:23:54 AccountingStorageBackupHost = (null) AccountingStorageEnforce = none AccountingStorageHost = sms.mycluster AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = No AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No AuthAltTypes = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GpuFreqDef = high,memory=high GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 300 sec HealthCheckNodeState = ANY HealthCheckProgram = /usr/sbin/nhc InactiveLimit = 0 sec JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/none JobAcctGatherParams = (null) JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = (null) KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0