Hi Marcus, We have skylake too and it didn’t work for us. We used cgroups only and process binding went completely havoc with subnuma enabled. While searching for solutions I found that hwloc does support subnuma only with version > 2 (when looking for skylake in hwloc you will get hits in version 2 branches only). At least hwloc 2.x made Numa-blocks children objects whereas hwloc 1.x has Numablocks as parents only. I think that was the reason why there was a special branch in hwloc for handling subNuma-layouts of Xeon Phi. But I’ll be happy if you proof me wrong.
Best, Andreas > Am 14.02.2019 um 09:32 schrieb Marcus Wagner <wag...@itc.rwth-aachen.de>: > > Hi Andreas, > > > >> On 2/14/19 8:56 AM, Henkel, Andreas wrote: >> Hi Marcus, >> >> More ideas: >> CPUs doesn’t always count as core but may take the meaning of one thread, >> hence makes different >> Maybe the behavior of CR_ONE_TASK is still not solid nor properly documente >> and ntasks and ntasks-per-node are honored different internally. If so >> solely using ntasks can mean using alle threads for Slurm even if the >> binding may be correct according to binding. >> Obviously in your results Slurm handles the options differently. >> >> Have you tried configuring the node with cpus=96? What output do you get >> from slurmd -C? > Not yet, as this is not the desired behaviour. We want to schedule by cores. > But I will try that. slurmd -C output is the following: > > NodeName=ncm0708 slurmd: Considering each NUMA node as a socket > CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2 > RealMemory=191905 > UpTime=6-21:30:02 > >> Is this a new architecture like skylake? In case of subnuma-Layouts Slurm >> can not handle it without hwloc2. > Yes, we have Skylake and as you can see in the above output, we have > subnuma-clustering enabled. Still, we only use hwloc coming with CentOS 7: > hwloc-1.11.8-4.el7.x86_64 > Where did you get the information, that hwloc2 is needed? >> Have you tried to use srun -v(vv) instead of sbatch? Maybe you can get a >> glimpse of what Slurm actually does with your options. > The only strange thing I can observe is the following: > srun: threads : 60 > > What threads is srun talking about there? > Nonetheless, here the full output: > > $> srun --ntasks=48 --ntasks-per-node=48 -vvv hostname > srun: defined options for program `srun' > srun: --------------- --------------------- > srun: user : `mw445520' > srun: uid : 40574 > srun: gid : 40574 > srun: cwd : /rwthfs/rz/cluster/home/mw445520/tests/slurm/cgroup > srun: ntasks : 48 (set) > srun: nodes : 1 (default) > srun: jobid : 4294967294 (default) > srun: partition : default > srun: profile : `NotSet' > srun: job name : `hostname' > srun: reservation : `(null)' > srun: burst_buffer : `(null)' > srun: wckey : `(null)' > srun: cpu_freq_min : 4294967294 > srun: cpu_freq_max : 4294967294 > srun: cpu_freq_gov : 4294967294 > srun: switches : -1 > srun: wait-for-switches : -1 > srun: distribution : unknown > srun: cpu-bind : default (0) > srun: mem-bind : default (0) > srun: verbose : 3 > srun: slurmd_debug : 0 > srun: immediate : false > srun: label output : false > srun: unbuffered IO : false > srun: overcommit : false > srun: threads : 60 > srun: checkpoint_dir : /w0/slurm/checkpoint > srun: wait : 0 > srun: nice : -2 > srun: account : (null) > srun: comment : (null) > srun: dependency : (null) > srun: exclusive : false > srun: bcast : false > srun: qos : (null) > srun: constraints : > srun: reboot : yes > srun: preserve_env : false > srun: network : (null) > srun: propagate : NONE > srun: prolog : (null) > srun: epilog : (null) > srun: mail_type : NONE > srun: mail_user : (null) > srun: task_prolog : (null) > srun: task_epilog : (null) > srun: multi_prog : no > srun: sockets-per-node : -2 > srun: cores-per-socket : -2 > srun: threads-per-core : -2 > srun: ntasks-per-node : 48 > srun: ntasks-per-socket : -2 > srun: ntasks-per-core : -2 > srun: plane_size : 4294967294 > srun: core-spec : NA > srun: power : > srun: cpus-per-gpu : 0 > srun: gpus : (null) > srun: gpu-bind : (null) > srun: gpu-freq : (null) > srun: gpus-per-node : (null) > srun: gpus-per-socket : (null) > srun: gpus-per-task : (null) > srun: mem-per-gpu : 0 > srun: remote command : `hostname' > srun: debug: propagating SLURM_PRIO_PROCESS=0 > srun: debug: propagating UMASK=0007 > srun: debug2: srun PMI messages to port=34521 > srun: debug: Entering slurm_allocation_msg_thr_create() > srun: debug: port from net_stream_listen is 35465 > srun: debug: Entering _msg_thr_internal > srun: debug: Munge authentication plugin loaded > srun: error: CPU count per node can not be satisfied > srun: error: Unable to allocate resources: Requested node configuration is > not available > > > > Best > Marcus > > >> >> Best, >> Andreas >> >> >>> Am 14.02.2019 um 08:34 schrieb Marcus Wagner <wag...@itc.rwth-aachen.de>: >>> >>> Hi Chris, >>> >>> >>> this are 96 thread nodes with 48 cores. You are right, that if we set it to >>> 24, the job will get scheduled. But then, only half of the node is used. On >>> the other side, if I only use --ntasks=48, slurm schedules all tasks onto >>> the same node. The hyperthread of each core is included in the cgroup and >>> the task_affinity plugin also correctly binds the hyperthread together with >>> the core (small ugly testscript from us, the last two numbers are the core >>> and its hyperthread): >>> >>> ncm0728.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 0,48 >>> ncm0728.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 26,74 >>> ncm0728.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 29,77 >>> ncm0728.hpc.itc.rwth-aachen.de <12> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 6,54 >>> ncm0728.hpc.itc.rwth-aachen.de <13> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 9,57 >>> ncm0728.hpc.itc.rwth-aachen.de <14> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 30,78 >>> ncm0728.hpc.itc.rwth-aachen.de <15> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 33,81 >>> ncm0728.hpc.itc.rwth-aachen.de <16> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 7,55 >>> ncm0728.hpc.itc.rwth-aachen.de <17> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 10,58 >>> ncm0728.hpc.itc.rwth-aachen.de <18> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 31,79 >>> ncm0728.hpc.itc.rwth-aachen.de <19> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 34,82 >>> ncm0728.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 3,51 >>> ncm0728.hpc.itc.rwth-aachen.de <20> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 8,56 >>> ncm0728.hpc.itc.rwth-aachen.de <21> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 11,59 >>> ncm0728.hpc.itc.rwth-aachen.de <22> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 32,80 >>> ncm0728.hpc.itc.rwth-aachen.de <23> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 35,83 >>> ncm0728.hpc.itc.rwth-aachen.de <24> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 12,60 >>> ncm0728.hpc.itc.rwth-aachen.de <25> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 15,63 >>> ncm0728.hpc.itc.rwth-aachen.de <26> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 36,84 >>> ncm0728.hpc.itc.rwth-aachen.de <27> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 39,87 >>> ncm0728.hpc.itc.rwth-aachen.de <28> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 13,61 >>> ncm0728.hpc.itc.rwth-aachen.de <29> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 16,64 >>> ncm0728.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 24,72 >>> ncm0728.hpc.itc.rwth-aachen.de <30> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 37,85 >>> ncm0728.hpc.itc.rwth-aachen.de <31> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 40,88 >>> ncm0728.hpc.itc.rwth-aachen.de <32> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 14,62 >>> ncm0728.hpc.itc.rwth-aachen.de <33> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 17,65 >>> ncm0728.hpc.itc.rwth-aachen.de <34> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 38,86 >>> ncm0728.hpc.itc.rwth-aachen.de <35> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 41,89 >>> ncm0728.hpc.itc.rwth-aachen.de <36> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 18,66 >>> ncm0728.hpc.itc.rwth-aachen.de <37> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 21,69 >>> ncm0728.hpc.itc.rwth-aachen.de <38> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 42,90 >>> ncm0728.hpc.itc.rwth-aachen.de <39> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 45,93 >>> ncm0728.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 27,75 >>> ncm0728.hpc.itc.rwth-aachen.de <40> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 19,67 >>> ncm0728.hpc.itc.rwth-aachen.de <41> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 22,70 >>> ncm0728.hpc.itc.rwth-aachen.de <42> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 43,91 >>> ncm0728.hpc.itc.rwth-aachen.de <43> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 46,94 >>> ncm0728.hpc.itc.rwth-aachen.de <44> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 20,68 >>> ncm0728.hpc.itc.rwth-aachen.de <45> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 23,71 >>> ncm0728.hpc.itc.rwth-aachen.de <46> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 44,92 >>> ncm0728.hpc.itc.rwth-aachen.de <47> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 47,95 >>> ncm0728.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 1,49 >>> ncm0728.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 4,52 >>> ncm0728.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 25,73 >>> ncm0728.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 28,76 >>> ncm0728.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 2,50 >>> ncm0728.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#> unlimited+p2 +pemap >>> 5,53 >>> >>> >>> --ntasks=48: >>> >>> NodeList=ncm0728 >>> BatchHost=ncm0728 >>> NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >>> TRES=cpu=48,mem=182400M,node=1,billing=48 >>> >>> >>> --ntasks=48 >>> --ntasks-per-node=24: >>> >>> NodeList=ncm[0438-0439] >>> BatchHost=ncm0438 >>> NumNodes=2 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >>> TRES=cpu=48,mem=182400M,node=2,billing=48 >>> >>> >>> --ntasks=48 >>> --ntasks-per-node=48: >>> >>> sbatch: error: CPU count per node can not be satisfied >>> sbatch: error: Batch job submission failed: Requested node configuration is >>> not available >>> >>> >>> Isn't the first essentially the same as the last, with the difference, that >>> I want to force slurm to put all tasks onto one node? >>> >>> >>> >>> Best >>> Marcus >>> >>> >>>>> On 2/14/19 7:15 AM, Chris Samuel wrote: >>>>> On Wednesday, 13 February 2019 4:48:05 AM PST Marcus Wagner wrote: >>>>> >>>>> #SBATCH --ntasks-per-node=48 >>>> I wouldn't mind betting is that if you set that to 24 it will work, and >>>> each >>>> thread will be assigned a single core with the 2 thread units on it. >>>> >>>> All the best, >>>> Chris >>> -- >>> Marcus Wagner, Dipl.-Inf. >>> >>> IT Center >>> Abteilung: Systeme und Betrieb >>> RWTH Aachen University >>> Seffenter Weg 23 >>> 52074 Aachen >>> Tel: +49 241 80-24383 >>> Fax: +49 241 80-624383 >>> wag...@itc.rwth-aachen.de >>> www.itc.rwth-aachen.de >>> >>> > > -- > Marcus Wagner, Dipl.-Inf. > > IT Center > Abteilung: Systeme und Betrieb > RWTH Aachen University > Seffenter Weg 23 > 52074 Aachen > Tel: +49 241 80-24383 > Fax: +49 241 80-624383 > wag...@itc.rwth-aachen.de > www.itc.rwth-aachen.de > >