Hi Andreas,
might be that this is one of the bugs in Slurm 18.
I think, I will open a bug report and see what they say.
Thank you very much, nonetheless.
Best
Marcus
On 2/14/19 2:36 PM, Andreas Henkel wrote:
Hi Marcus,
for us slurmd -C as well as numactl -H looked fine, too. But we're
using task/cgroup only and every job starting on a skylake node gave us
|error("task/cgroup: task[%u] infinite loop broken while trying " "to
provision compute elements using %s (bitmap:%s)", |
from src/plugins/task/cgroup/task_cgroup_cpuset.c and the process
placement was wrong.
Once we deactivated subnuma everythings running fine.
But for completeness: I tested that on Slurm 17 (and maybe the core
was partly 16 at that time). We're using Slurm 17.11.13 and I'll check
the behavior there in the next days.
I'm hestitant to switch to 18 because of the latest bugs that appeared
with every minor release.
Best,
Andreas
||
On 14.02.19 12:54, Marcus Wagner wrote:
Hi Andreas,
as slurmd -C shows, it detects 4 numa-nodes taking these as sockets.
This was also the way, we configured slurm.
numactl -H clearly shows the four domains and which belongs to which
socket:
node distances:
node 0 1 2 3
0: 10 11 21 21
1: 11 10 21 21
2: 21 21 10 11
3: 21 21 11 10
This is fairly the same with hwloc:
$> hwloc-distances
Relative latency matrix between 4 NUMANodes (depth 3) by logical
indexes (below Machine L#0):
index 0 1 2 3
0 1.000 1.100 2.100 2.100
1 1.100 1.000 2.100 2.100
2 2.100 2.100 1.000 1.100
3 2.100 2.100 1.100 1.000
We use the task/affinity plugin together with task/cgroup, but in the
cgroup.conf set affinity to off, such that the task affinity plugin
is doing the magic.
We also see slurm configured that way to do a round robin over the
numanodes by default (12 tasks on 48 core machine):
ncm0071.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#> unlimited+p2
+pemap 0,48
ncm0071.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#> unlimited+p2
+pemap 3,51
ncm0071.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#> unlimited+p2
+pemap 24,72
ncm0071.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#> unlimited+p2
+pemap 27,75
ncm0071.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#> unlimited+p2
+pemap 1,49
ncm0071.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#> unlimited+p2
+pemap 4,52
ncm0071.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#> unlimited+p2
+pemap 25,73
ncm0071.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#> unlimited+p2
+pemap 28,76
ncm0071.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#> unlimited+p2
+pemap 2,50
ncm0071.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#> unlimited+p2
+pemap 5,53
ncm0071.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#> unlimited+p2
+pemap 26,74
ncm0071.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#> unlimited+p2
+pemap 29,77
using #SBATCH -m block:block results in all tasks on one numanode:
ncm0071.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#> unlimited+p2
+pemap 0,48
ncm0071.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#> unlimited+p2
+pemap 1,49
ncm0071.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#> unlimited+p2
+pemap 2,50
ncm0071.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#> unlimited+p2
+pemap 6,54
ncm0071.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#> unlimited+p2
+pemap 7,55
ncm0071.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#> unlimited+p2
+pemap 8,56
ncm0071.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#> unlimited+p2
+pemap 12,60
ncm0071.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#> unlimited+p2
+pemap 13,61
ncm0071.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#> unlimited+p2
+pemap 14,62
ncm0071.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#> unlimited+p2
+pemap 18,66
ncm0071.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#> unlimited+p2
+pemap 19,67
ncm0071.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#> unlimited+p2
+pemap 20,68
isn't it that, what would be needed, or do I miss something? What
would be "better" with hwloc2?
Besides my original problem, we are fairly happy with slurm so far,
but that one gives me grey hair :/
Best
Marcus
On 2/14/19 11:27 AM, Henkel, Andreas wrote:
Hi Marcus,
We have skylake too and it didn’t work for us. We used cgroups only
and process binding went completely havoc with subnuma enabled.
While searching for solutions I found that hwloc does support
subnuma only with version > 2 (when looking for skylake in hwloc you
will get hits in version 2 branches only). At least hwloc 2.x made
Numa-blocks children objects whereas hwloc 1.x has Numablocks as
parents only. I think that was the reason why there was a special
branch in hwloc for handling subNuma-layouts of Xeon Phi.
But I’ll be happy if you proof me wrong.
Best,
Andreas
Am 14.02.2019 um 09:32 schrieb Marcus Wagner
<wag...@itc.rwth-aachen.de>:
Hi Andreas,
On 2/14/19 8:56 AM, Henkel, Andreas wrote:
Hi Marcus,
More ideas:
CPUs doesn’t always count as core but may take the meaning of one
thread, hence makes different
Maybe the behavior of CR_ONE_TASK is still not solid nor properly
documente and ntasks and ntasks-per-node are honored different
internally. If so solely using ntasks can mean using alle threads
for Slurm even if the binding may be correct according to binding.
Obviously in your results Slurm handles the options differently.
Have you tried configuring the node with cpus=96? What output do
you get from slurmd -C?
Not yet, as this is not the desired behaviour. We want to schedule
by cores. But I will try that. slurmd -C output is the following:
NodeName=ncm0708 slurmd: Considering each NUMA node as a socket
CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12
ThreadsPerCore=2 RealMemory=191905
UpTime=6-21:30:02
Is this a new architecture like skylake? In case of
subnuma-Layouts Slurm can not handle it without hwloc2.
Yes, we have Skylake and as you can see in the above output, we
have subnuma-clustering enabled. Still, we only use hwloc coming
with CentOS 7: hwloc-1.11.8-4.el7.x86_64
Where did you get the information, that hwloc2 is needed?
Have you tried to use srun -v(vv) instead of sbatch? Maybe you can
get a glimpse of what Slurm actually does with your options.
The only strange thing I can observe is the following:
srun: threads : 60
What threads is srun talking about there?
Nonetheless, here the full output:
$> srun --ntasks=48 --ntasks-per-node=48 -vvv hostname
srun: defined options for program `srun'
srun: --------------- ---------------------
srun: user : `mw445520'
srun: uid : 40574
srun: gid : 40574
srun: cwd :
/rwthfs/rz/cluster/home/mw445520/tests/slurm/cgroup
srun: ntasks : 48 (set)
srun: nodes : 1 (default)
srun: jobid : 4294967294 (default)
srun: partition : default
srun: profile : `NotSet'
srun: job name : `hostname'
srun: reservation : `(null)'
srun: burst_buffer : `(null)'
srun: wckey : `(null)'
srun: cpu_freq_min : 4294967294
srun: cpu_freq_max : 4294967294
srun: cpu_freq_gov : 4294967294
srun: switches : -1
srun: wait-for-switches : -1
srun: distribution : unknown
srun: cpu-bind : default (0)
srun: mem-bind : default (0)
srun: verbose : 3
srun: slurmd_debug : 0
srun: immediate : false
srun: label output : false
srun: unbuffered IO : false
srun: overcommit : false
srun: threads : 60
srun: checkpoint_dir : /w0/slurm/checkpoint
srun: wait : 0
srun: nice : -2
srun: account : (null)
srun: comment : (null)
srun: dependency : (null)
srun: exclusive : false
srun: bcast : false
srun: qos : (null)
srun: constraints :
srun: reboot : yes
srun: preserve_env : false
srun: network : (null)
srun: propagate : NONE
srun: prolog : (null)
srun: epilog : (null)
srun: mail_type : NONE
srun: mail_user : (null)
srun: task_prolog : (null)
srun: task_epilog : (null)
srun: multi_prog : no
srun: sockets-per-node : -2
srun: cores-per-socket : -2
srun: threads-per-core : -2
srun: ntasks-per-node : 48
srun: ntasks-per-socket : -2
srun: ntasks-per-core : -2
srun: plane_size : 4294967294
srun: core-spec : NA
srun: power :
srun: cpus-per-gpu : 0
srun: gpus : (null)
srun: gpu-bind : (null)
srun: gpu-freq : (null)
srun: gpus-per-node : (null)
srun: gpus-per-socket : (null)
srun: gpus-per-task : (null)
srun: mem-per-gpu : 0
srun: remote command : `hostname'
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0007
srun: debug2: srun PMI messages to port=34521
srun: debug: Entering slurm_allocation_msg_thr_create()
srun: debug: port from net_stream_listen is 35465
srun: debug: Entering _msg_thr_internal
srun: debug: Munge authentication plugin loaded
srun: error: CPU count per node can not be satisfied
srun: error: Unable to allocate resources: Requested node
configuration is not available
Best
Marcus
Best,
Andreas
Am 14.02.2019 um 08:34 schrieb Marcus Wagner
<wag...@itc.rwth-aachen.de>:
Hi Chris,
this are 96 thread nodes with 48 cores. You are right, that if we
set it to 24, the job will get scheduled. But then, only half of
the node is used. On the other side, if I only use --ntasks=48,
slurm schedules all tasks onto the same node. The hyperthread of
each core is included in the cgroup and the task_affinity plugin
also correctly binds the hyperthread together with the core
(small ugly testscript from us, the last two numbers are the core
and its hyperthread):
ncm0728.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 0,48
ncm0728.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 26,74
ncm0728.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 29,77
ncm0728.hpc.itc.rwth-aachen.de <12> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 6,54
ncm0728.hpc.itc.rwth-aachen.de <13> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 9,57
ncm0728.hpc.itc.rwth-aachen.de <14> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 30,78
ncm0728.hpc.itc.rwth-aachen.de <15> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 33,81
ncm0728.hpc.itc.rwth-aachen.de <16> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 7,55
ncm0728.hpc.itc.rwth-aachen.de <17> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 10,58
ncm0728.hpc.itc.rwth-aachen.de <18> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 31,79
ncm0728.hpc.itc.rwth-aachen.de <19> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 34,82
ncm0728.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 3,51
ncm0728.hpc.itc.rwth-aachen.de <20> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 8,56
ncm0728.hpc.itc.rwth-aachen.de <21> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 11,59
ncm0728.hpc.itc.rwth-aachen.de <22> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 32,80
ncm0728.hpc.itc.rwth-aachen.de <23> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 35,83
ncm0728.hpc.itc.rwth-aachen.de <24> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 12,60
ncm0728.hpc.itc.rwth-aachen.de <25> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 15,63
ncm0728.hpc.itc.rwth-aachen.de <26> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 36,84
ncm0728.hpc.itc.rwth-aachen.de <27> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 39,87
ncm0728.hpc.itc.rwth-aachen.de <28> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 13,61
ncm0728.hpc.itc.rwth-aachen.de <29> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 16,64
ncm0728.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 24,72
ncm0728.hpc.itc.rwth-aachen.de <30> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 37,85
ncm0728.hpc.itc.rwth-aachen.de <31> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 40,88
ncm0728.hpc.itc.rwth-aachen.de <32> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 14,62
ncm0728.hpc.itc.rwth-aachen.de <33> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 17,65
ncm0728.hpc.itc.rwth-aachen.de <34> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 38,86
ncm0728.hpc.itc.rwth-aachen.de <35> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 41,89
ncm0728.hpc.itc.rwth-aachen.de <36> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 18,66
ncm0728.hpc.itc.rwth-aachen.de <37> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 21,69
ncm0728.hpc.itc.rwth-aachen.de <38> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 42,90
ncm0728.hpc.itc.rwth-aachen.de <39> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 45,93
ncm0728.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 27,75
ncm0728.hpc.itc.rwth-aachen.de <40> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 19,67
ncm0728.hpc.itc.rwth-aachen.de <41> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 22,70
ncm0728.hpc.itc.rwth-aachen.de <42> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 43,91
ncm0728.hpc.itc.rwth-aachen.de <43> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 46,94
ncm0728.hpc.itc.rwth-aachen.de <44> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 20,68
ncm0728.hpc.itc.rwth-aachen.de <45> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 23,71
ncm0728.hpc.itc.rwth-aachen.de <46> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 44,92
ncm0728.hpc.itc.rwth-aachen.de <47> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 47,95
ncm0728.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 1,49
ncm0728.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 4,52
ncm0728.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 25,73
ncm0728.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 28,76
ncm0728.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 2,50
ncm0728.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 5,53
--ntasks=48:
NodeList=ncm0728
BatchHost=ncm0728
NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=182400M,node=1,billing=48
--ntasks=48
--ntasks-per-node=24:
NodeList=ncm[0438-0439]
BatchHost=ncm0438
NumNodes=2 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=182400M,node=2,billing=48
--ntasks=48
--ntasks-per-node=48:
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node
configuration is not available
Isn't the first essentially the same as the last, with the
difference, that I want to force slurm to put all tasks onto one
node?
Best
Marcus
On 2/14/19 7:15 AM, Chris Samuel wrote:
On Wednesday, 13 February 2019 4:48:05 AM PST Marcus Wagner wrote:
#SBATCH --ntasks-per-node=48
I wouldn't mind betting is that if you set that to 24 it will
work, and each
thread will be assigned a single core with the 2 thread units on
it.
All the best,
Chris
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de