Re: [slurm-users] CPUSpecList confusion

Marcus Wagner Wed, 14 Dec 2022 22:28:08 -0800

Hi Paul,

as Slurm uses hwloc, I was looking into these tools a little bit deeper.
Using your script, I saw e.g. the following output on one node:


=== 31495434
CPU_IDs=21-23,25
21-23,25
=== 31495433
CPU_IDs=16-18,20
10-11,15,17
=== 31487399
CPU_IDs=15
9

That does not match your schemes and on first sight seems to be more random.

It seems, Slurm uses hwlocs logical indices, whereas cgroups uses the 
OS/physical indices.
According to the example above (excerpt of the full output of hwloc-ls)

      NUMANode L#1 (P#1 47GB)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU 
L#12 (P#3)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU 
L#13 (P#4)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU 
L#14 (P#5)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU 
L#15 (P#9)
      L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU 
L#16 (P#10)
      L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU 
L#17 (P#11)
      L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU 
L#18 (P#15)
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU 
L#19 (P#16)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU 
L#20 (P#17)
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU 
L#21 (P#21)
      L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU 
L#22 (P#22)
      L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU 
L#23 (P#23)


That does seem to match.

and in short, to get the mapping, one can use
$> hwloc-ls --only pu
...
PU L#10 (P#19)
PU L#11 (P#20)
PU L#12 (P#3)
PU L#13 (P#4)
PU L#14 (P#5)
PU L#15 (P#9)
PU L#16 (P#10)
PU L#17 (P#11)
PU L#18 (P#15)
PU L#19 (P#16)
PU L#20 (P#17)
PU L#21 (P#21)
PU L#22 (P#22)
PU L#23 (P#23)
...


Best
Marcus

Am 14.12.2022 um 18:11 schrieb Paul Raines:

Ugh.  Guess I cannot count.  The mapping on that last node DOES work with the 
"alternating" scheme where we have

  0  0
  1  2
  2  4
  3  6
  4  8
  5 10
  6 12
  7 14
  8 16
  9 18
10 20
11 22
12  1
13  3
14  5
15  7
16  9
17 11
18 13
19 15
20 17
21 19
22 21
23 23

so CPU_IDs=8-11,20-23 does correspond to cgroup 16-23

Using the script

cd /sys/fs/cgroup/cpuset/slurm
for d in $(find -name 'job*') ; do
   j=$(echo $d | cut -d_ -f3)
   echo === $j
   scontrol -d show job $j | grep CPU_ID | cut -d' ' -f7
   cat $d/cpuset.effective_cpus
done

=== 1967214
CPU_IDs=8-11,20-23
16-23
=== 1960208
CPU_IDs=12-19
1,3,5,7,9,11,13,15
=== 1966815
CPU_IDs=0
0
=== 1966821
CPU_IDs=6
12
=== 1966818
CPU_IDs=3
6
=== 1966816
CPU_IDs=1
2
=== 1966822
CPU_IDs=7
14
=== 1966820
CPU_IDs=5
10
=== 1966819
CPU_IDs=4
8
=== 1966817
CPU_IDs=2
4

On all my nodes I see just two schemes.  The alternating odd/even one above and 
one that is does not alternate like on this box with

CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1

=== 1966495
CPU_IDs=0-2
0-2
=== 1966498
CPU_IDs=10-12
10-12
=== 1966502
CPU_IDs=26-28
26-28
=== 1960064
CPU_IDs=7-9,13-25
7-9,13-25
=== 1954480
CPU_IDs=3-6
3-6


On Wed, 14 Dec 2022 9:42am, Paul Raines wrote:


Yes, I see that on some of my other machines too.  So apicid is definitely not 
what SLURM is using but somehow just lines up that way on this one machine I 
have.

I think the issue is cgroups counts starting at 0 all the cores on the first 
socket, then all the cores on the second socket.  But SLURM (on a two socket 
box) counts 0 as the first core on the first socket, 1 as the first core on the 
second socket, 2 as the second core on the first socket,
3 as the second core on the second socket, and so on. (Looks like I am
wrong: see below)

Why slurm does this instead of just using the assignments cgroups uses
I have no idea.  Hopefully one of the SLURM developers reads this
and can explain

Looking at another SLURM node I have (where cgroups v1 is still in use
and HT turned off) with definition

CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1

I find

[root@r440-17 ~]# egrep '^(apicid|proc)' /proc/cpuinfo  | tail -4
processor       : 22
apicid          : 22
processor       : 23
apicid          : 54

So apicid's are NOT going to work

# scontrol -d show job 1966817 | grep CPU_ID
    Nodes=r17 CPU_IDs=2 Mem=16384 GRES=
# cat /sys/fs/cgroup/cpuset/slurm/uid_3776056/job_1966817/cpuset.cpus
4

If Slurm has '2' this should be the second core on the first socket so should 
be '1' in cgroups, but it is 4 as we see above which is the fifth core on the 
first socket.  So I guess I was wrong above.

But in /proc/cpuinfo the apicid for processor 4 is 2!!!  So is apicid
right after all?  Nope, on the same machine I have

# scontrol -d show job 1960208 | grep CPU_ID
    Nodes=r17 CPU_IDs=12-19 Mem=51200 GRES=
# cat /sys/fs/cgroup/cpuset/slurm/uid_5164679/job_1960208/cpuset.cpus
1,3,5,7,9,11,13,15

and in /proc/cpuinfo the apcid for processor 12 is 16

# scontrol -d show job 1967214 | grep CPU_ID
    Nodes=r17 CPU_IDs=8-11,20-23 Mem=51200 GRES=
# cat /sys/fs/cgroup/cpuset/slurm/uid_5164679/job_1967214/cpuset.cpus
16-23

I am totally lost now. Seems totally random. SLURM devs?  Any insight?


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Wed, 14 Dec 2022 1:33am, Marcus Wagner wrote:

 Hi Paul,

 sorry to say, but that has to be some coincidence on your system. I've
 never seen Slurm reporting using corenumbers, which are higher than the
 total number of cores.

 I have e.g. a intel Platinum 8160 here. 24 Cores per Socket, no
 HyperThreading activated.
 Yet here the last lines of /proc/cpuinfo:

 processor       : 43
 apicid          : 114
 processor       : 44
 apicid          : 116
 processor       : 45
 apicid          : 118
 processor       : 46
 apicid          : 120
 processor       : 47
 apicid          : 122

 Never seen Slurm reporting corenumbers for a job > 96
 Nonetheless, I agree, the cores reported by Slurm mostly have nothing to
 do with the cores reported e.g. by cgroups.
 Since Slurm creates the cgroups, I wonder, why they report some kind of
 abstract coreid, because they should know, which cores are used, as they
 create the cgroups for the jobs.

 Best
 Marcus

 Am 13.12.2022 um 16:39 schrieb Paul Raines:


  Yes, looks like SLURM is using the apicid that is in /proc/cpuinfo
  The first 14 cpus in /proc/cpu (procs 0-13) have apicid
  0,2,4,6,8,10,12,14,16,20,22,24,26,28 in /proc/cpuinfo

  So after setting CpuSpecList=0,2,4,6,8,10,12,14,16,18,20,22,24,26
  in slurm.conf it appears to be doing what I want

  $ echo $SLURM_JOB_ID
  9
  $ grep -i ^cpu /proc/self/status
  Cpus_allowed:   000f0000,000f0000
  Cpus_allowed_list:      16-19,48-51
  $ scontrol -d show job 9 | grep CPU_ID
        Nodes=larkin CPU_IDs=32-39 Mem=25600 GRES=

  apcid=32 is processor=16 and apcid=33 is processor=48 in /proc/cpuinfo

  Thanks

  -- Paul Raines (http://help.nmr.mgh.harvard.edu)



  On Tue, 13 Dec 2022 9:52am, Sean Maxwell wrote:

         External Email - Use Caution
  In the slurm.conf manual they state the CpuSpecList ids are "abstract",
  and
  in the CPU management docs they enforce the notion that the abstract
  Slurm
  IDs are not related to the Linux hardware IDs, so that is probably the
  source of the behavior. I unfortunately don't have more information.

  On Tue, Dec 13, 2022 at 9:45 AM Paul Raines
  <rai...@nmr.mgh.harvard.edu>
  wrote:


  Hmm.  Actually looks like confusion between CPU IDs on the system
  and what SLURM thinks the IDs are

  # scontrol -d show job 8
  ...
        Nodes=foobar CPU_IDs=14-21 Mem=25600 GRES=
  ...

  # cat
  /sys/fs/cgroup/system.slice/slurmstepd.scope/job_8/cpuset.cpus.effective
  7-10,39-42


  -- Paul Raines
  
(http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)



  On Tue, 13 Dec 2022 9:40am, Paul Raines wrote:

> >   Oh but that does explain the CfgTRES=cpu=14.  With the CpuSpecList
>   below and SlurmdOffSpec I do get CfgTRES=cpu=50 so that makes sense.
> >   The issue remains that thought the number of cpus in CpuSpecList
>   is taken into account, the exact IDs seem to be ignored.
> > >   -- Paul Raines >   
(http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
> > > >   On Tue, 13 Dec 2022 9:34am, Paul Raines wrote:
> >> >>    I have tried it both ways with the same result.  The assigned CPUs
>>    will be both in and out of the range given to CpuSpecList
>> >>    I tried setting using commas instead of ranges so used
>> >>    CpuSpecList=0,1,2,3,4,5,6,7,8,9,10,11,12,13
>> >>    But still does not work
>> >>    $ srun -p basic -N 1 --ntasks-per-node=1 --mem=25G \
>>    --time=10:00:00 --cpus-per-task=8 --pty /bin/bash
>>    $ grep -i ^cpu /proc/self/status
>>    Cpus_allowed:   00000780,00000780
>>    Cpus_allowed_list:      7-10,39-42
>> >> >>    -- Paul Raines >>  
(http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>> >> >> >>    On Mon, 12 Dec 2022 10:21am, Sean Maxwell wrote:
>> >>>     Hi Paul,
>>> >>>     Nodename=foobar \
>>>>        CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>>>        ThreadsPerCore=2
>>>>        \
>>>>        RealMemory=256312 MemSpecLimit=32768 CpuSpecList=14-63 \
>>>>        TmpDisk=6000000 Gres=gpu:nvidia_rtx_a6000:1
>>>> >>>>     The slurm.conf also has:
>>>> >>>>     ProctrackType=proctrack/cgroup
>>>>     TaskPlugin=task/affinity,task/cgroup
>>>>     TaskPluginParam=Cores,*SlurmdOf**fSpec*,Verbose
>>>> >>> >>>     Doesn't setting SlurmdOffSpec tell Slurmd that is should NOT use 
>>>  the
>>>     CPUs
>>>     in the spec list? (
>>> >>>  
https://secure-web.cisco.com/1V9Fskh4lCAx_XrdlCr8o1EtnePELf-1YK4TerT47ktLxy3fO9FaIpaGXVA8ODhMAdhmXSqToQstwAilA71r7z1Q4jDqPSKEsJQNUhJYYRtxFnZIO49QxsYrVo9c3ExH89cIk_t7H5dtGEjpme2LFKm23Z52yK-xZ3fEl_LyK61uCzkas6GKykzPCPyoNXaFgs32Ct2tDIVL8vI6JW1_-1uQ9gUyWmm24xJoBxLEui7tSTVwMtiVRu8C7pU1nJ8qr6ghBlxrqx-wQiRP4XBCjhPARHa2KBqkUogjEVRAe3WdAbbYBxtXeVsWjqNGmjSVA/https%3A%2F%2Fslurm.schedmd.com%2Fslurm.conf.html%23OPT_SlurmdOffSpec)
>>>     In this case, I believe it uses what is left, which is the 0-13. >>>  We
  are
>>>     just starting to work on this ourselves, and were looking at >>>  this
>>>     setting.
>>> >>>     Best,
>>> >>>     -Sean
>>> >> >
  The information in this e-mail is intended only for the person to whom
  it
  is addressed.  If you believe this e-mail was sent to you in error and
  the
  e-mail contains patient information, please contact the Mass General
  Brigham Compliance HelpLine at
  
https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline
  <
  
https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline>
  .
  Please note that this e-mail is not secure (encrypted).  If you do not
  wish to continue communication over unencrypted e-mail, please notify
  the
  sender of this message immediately.  Continuing to send or respond to
  e-mail after receiving this message means you understand and accept
  this
  risk and wish to continue to communicate over unencrypted e-mail.

  The information in this e-mail is intended only for the person to whom
  it
  is addressed.  If you believe this e-mail was sent to you in error and
  the
  e-mail contains patient information, please contact the Mass General
  Brigham Compliance HelpLine at
  https://www.massgeneralbrigham.org/complianceline
  <https://www.massgeneralbrigham.org/complianceline> .
  Please note that this e-mail is not secure (encrypted).  If you do not
  wish to continue communication over unencrypted e-mail, please notify
  the
  sender of this message immediately.  Continuing to send or respond to
  e-mail after receiving this message means you understand and accept this
  risk and wish to continue to communicate over unencrypted e-mail.


 --
 Dipl.-Inf. Marcus Wagner

 IT Center
 Gruppe: Server, Storage, HPC
 Abteilung: Systeme und Betrieb
 RWTH Aachen University
 Seffenter Weg 23
 52074 Aachen
 Tel: +49 241 80-24383
 Fax: +49 241 80-624383
 wag...@itc.rwth-aachen.de
 www.itc.rwth-aachen.de

 Social Media Kanäle des IT Centers:
 https://blog.rwth-aachen.de/itc/
 https://www.facebook.com/itcenterrwth
 https://www.linkedin.com/company/itcenterrwth
 https://twitter.com/ITCenterRWTH
 https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ

The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham Compliance 
HelpLine at https://www.massgeneralbrigham.org/complianceline 
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted).  If you do not wish to 
continue communication over unencrypted e-mail, please notify the sender of 
this message immediately.  Continuing to send or respond to e-mail after 
receiving this message means you understand and accept this risk and wish to 
continue to communicate over unencrypted e-mail.


--
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Server, Storage, HPC
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [slurm-users] CPUSpecList confusion

Reply via email to