Hi Thomas, Add the -d flag to scontrol show job
e.g. # scontrol show job 23891862 -d JobId=23891862 JobName=SPI_DOWN UserId=user1(11283) GroupId=group1(10414) MCS_label=N/A Priority=586 Nice=0 Account=group1 QOS=qos1 JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=2-00:13:58 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2021-02-03T19:19:28 EligibleTime=2021-02-03T19:19:28 AccrueTime=2021-02-03T19:19:31 StartTime=2021-02-03T19:19:31 EndTime=2021-02-10T19:19:31 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-03T19:19:31 Partition=gpgpu AllocNode:Sid=spartan-login3:222306 ReqNodeList=(null) ExcNodeList=(null) NodeList=spartan-gpgpu007 BatchHost=spartan-gpgpu007 NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:* TRES=cpu=6,mem=24000M,node=1,billing=101,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* JOB_GRES=gpu:1 Nodes=spartan-gpgpu007 CPU_IDs=6-11 Mem=24000 GRES=gpu:1(IDX:1) MinCPUsNode=6 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Note the CPU_IDs and GPU IDX in the output Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Fri, 5 Feb 2021 at 02:01, Thomas Zeiser < thomas.zei...@rrze.uni-erlangen.de> wrote: > UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts > > Dear All, > > we are running Slurm-20.02.6 and using > "SelectType=select/cons_tres" with > "SelectTypeParameters=CR_Core_Memory", "TaskPlugin=task/cgroup", > and "ProctrackType=proctrack/cgroup". Nodes can be shared between > multiple jobs with the partition defaults "ExclusiveUser=no > OverSubscribe=No" > > For monitoring purpose, we'd like to know on the ControlMachine > which cores of a batch node are assigned to a specific job. Is > there any way (except looking on each batch node itself into > /sys/fs/cgroup/cpuset/slurm_*) to get the assigned core ranges or > GPU IDs? > > E.g. from Torque we are used that qstat tells the assigned cores. > However, with Slurm, even "scontrol show job JOBID" does not seem > to have any information in that direction. > > Knowing which GPU is allocated (in case of gres/gpu) of course > also would be interested to know on the ControlMachine. > > > Here's the output we get from scontrol show job; it has the node > name and the number of cores assigned but not the "core IDs" (e.g. > 32-63) > > JobId=886 JobName=br-14 > UserId=hpc114(1356) GroupId=hpc1(1355) MCS_label=N/A > Priority=1010 Nice=0 Account=hpc1 QOS=normal WCKey=* > JobState=RUNNING Reason=None Dependency=(null) > Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:40:09 TimeLimit=1-00:00:00 TimeMin=N/A > SubmitTime=2021-02-04T07:26:51 EligibleTime=2021-02-04T07:26:51 > AccrueTime=2021-02-04T07:26:51 > StartTime=2021-02-04T07:26:54 EndTime=2021-02-05T07:26:54 Deadline=N/A > PreemptEligibleTime=2021-02-04T07:26:54 PreemptTime=None > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-04T07:26:54 > Partition=a100 AllocNode:Sid=gpu001:1743663 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gpu001 > BatchHost=gpu001 > NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=32,mem=120000M,node=1,billing=32,gres/gpu=1,gres/gpu:a100=1 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=1 MinMemoryCPU=3750M MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/var/tmp/slurmd_spool/job00877/slurm_script > WorkDir=/home/hpc114/run2 > StdErr=/home/hpc114//run2/br-14.o886 > StdIn=/dev/null > StdOut=/home/hpc114/run2/br-14.o886 > Power= > TresPerNode=gpu:a100:1 > MailUser=(null) MailType=NONE > > Also "scontrol show node" is not helpful > > NodeName=gpu001 Arch=x86_64 CoresPerSocket=64 > CPUAlloc=128 CPUTot=128 CPULoad=4.09 > AvailableFeatures=hwperf > ActiveFeatures=hwperf > Gres=gpu:a100:4(S:0-1) > NodeAddr=gpu001 NodeHostName=gpu001 Port=6816 Version=20.02.6 > OS=Linux 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 > RealMemory=510000 AllocMem=480000 FreeMem=495922 Sockets=2 Boards=1 > State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=80 Owner=N/A > MCS_label=N/A > Partitions=a100 > BootTime=2021-01-27T16:03:48 SlurmdStartTime=2021-02-03T13:43:05 > CfgTRES=cpu=128,mem=510000M,billing=128,gres/gpu=4,gres/gpu:a100=4 > AllocTRES=cpu=128,mem=480000M,gres/gpu=4,gres/gpu:a100=4 > CapWatts=n/a > CurrentWatts=0 AveWatts=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > There is no information on the currently running four jobs > included; neither which share of the allocated node is assigned to > the individual jobs. > > > I'd like to see isomehow that job 886 got cores 32-63,160-191 > assigned as seen on the node from /sys/fs/cgroup > > %cat /sys/fs/cgroup/cpuset/slurm_gpu001/uid_1356/job_886/cpuset.cpus > 32-63,160-191 > > > Thanks for any ideas! > > Thomas Zeiser > >