Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-26 Thread Paul Raines



While doing more investigation I found an interesting situation.

I have a 32 core (2 x 16 core Xeon) node with the 10 RTX cards where
all 10 cards have affinity to just one socket (cores 0-15 as shown
by 'nvidia-smi topo -m').  The current running jobs on it are
using 5 GPUS and 15 cores

# scontrol show node=rtx-04 | grep gres
   CfgTRES=cpu=32,mem=1546000M,billing=99,gres/gpu=10
   AllocTRES=cpu=15,mem=220G,gres/gpu=5

Checking /sys/fs/cgroup I see these jobs are using cores 0-14

# grep . /sys/fs/cgroup/cpuset/slurm/uid_*/job_*/cpuset.cpus
/sys/fs/cgroup/cpuset/slurm/uid_4181545/job_38409/cpuset.cpus:12-14
/sys/fs/cgroup/cpuset/slurm/uid_4181545/job_38670/cpuset.cpus:0-2
/sys/fs/cgroup/cpuset/slurm/uid_4181545/job_38673/cpuset.cpus:3-5
/sys/fs/cgroup/cpuset/slurm/uid_5829/job_49088/cpuset.cpus:9-11
/sys/fs/cgroup/cpuset/slurm/uid_8285/job_49048/cpuset.cpus:6-8

If I submit a job to rtx-04 asking for 1 core and 1 GPU the job runs
no problem and it uses core 15.  And then if I submit more jobs asking
for a GPU they run fine on core 16 and up.

Now if I cancel my jobs so I am back to the jobs using 5 GPUS and 15 cores
and then submit a job asking for 2 cores and 1 GPU, the job
stays in Pending state and refused to run on rtx-04.

So before submitting any bug report I decided to upgrade to the latest SLURM 
version.  I upgraded from 20.02.03 to 20.11.3 (with those jobs still running 
on rtx-04) and now the problem has gone away.  I can submit a 2 core and 1 GPU 
job and it runs immediately.


So my problem seems fixed, but in the update I noticed a wierd thing happen.
Now SLURM insistes that the Cores in gres.conf must be set to Cores=0-31
even though 'nvidia-smi topo -m' still says 0-15.  I decided to just remove
the Cores= setting from /etc/slurm/gres.conf

So before the update slurmd.log has:

[2021-01-26T03:07:45.673] Gres Name=gpu Type=quadro_rtx_8000 Count=1 Index=0 
ID=7696487 File=/dev/nvidia0 Cores=0-15 CoreCnt=32 Links=-1,0,0,0,0,0,2,0,0,0


and after the update

[2021-01-26T14:31:47.282] Gres Name=gpu Type=quadro_rtx_8000 Count=1 Index=0 
ID=7696487 File=/dev/nvidia0 Cores=0-31 CoreCnt=32 Links=-1,0,0,0,0,0,2,0,0,0


This is fine with me as I want SLURM to ignore GPU affinity on these nodes
but it is curious.



-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Mon, 25 Jan 2021 10:07am, Paul Raines wrote:



I tried submitting jobs with --gres-flags=disable-binding but
this has not made any difference.  Jobs asking for GPUs are still only
being run if a core defined in gres.conf for the GPU is free.

Basically seems the option is ignored.


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Sun, 24 Jan 2021 11:39am, Paul Raines wrote:


 Thanks Chris.

 I think you have identified the issue here or are very close.  My
 gres.conf on
 the rtx-04 node for example is:

 AutoDetect=nvml
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9 Cores=0-15

 There are 32 cores (HT is off).  But the daughter card that holds all
 10 of the RTX8000s connects to only one socket as can be seen from
 'nvidia-smi topo -m'

 Its odd though in that my tests on my identically configured
 rtx6000 partition did not show that behavior but maybe it is
 due to just the "random" cores that got assigned to jobs there
 all having a least one core on the "right" socket.

 Anyway, how do I turn off this "affinity enforcment" as it is
 more important that a job run with a GPU on its non-affinity socket
 than just wait and not run at all?

 Thanks

 -- Paul Raines (http://help.nmr.mgh.harvard.edu)



 On Sat, 23 Jan 2021 3:19pm, Chris Samuel wrote:


  On Saturday, 23 January 2021 9:54:11 AM PST Paul Raines wrote:


  Now rtx-08 which has only 4 GPUs seems to always get all 4 uses.
  But the others seem to always only get half used (except rtx-07
  which somehow gets 6 used so another wierd thing).

  Again if I submit non-GPU jobs, they end up allocating all hte
  cores/cpus on the nodes just fine.


  What does your gres.conf look like for these nodes?

  One thing I've seen in the past is where the core specifications for the
  GPUs
  are out of step with the hardware and so Slurm thinks they're on the
  wrong
  socket.  Then when all the cores in that socket are used up Slurm won't
  put
  more GPU jobs on the node without the jobs explicitly asking to not do
  locality.

  One thing I've noticed is that in prior to Slurm 20.02 the documentation
  for
  gres.conf used to say:

#   If 

Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-25 Thread Paul Raines



I tried submitting jobs with --gres-flags=disable-binding but
this has not made any difference.  Jobs asking for GPUs are still only
being run if a core defined in gres.conf for the GPU is free.

Basically seems the option is ignored.


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Sun, 24 Jan 2021 11:39am, Paul Raines wrote:


Thanks Chris.

I think you have identified the issue here or are very close.  My gres.conf 
on

the rtx-04 node for example is:

AutoDetect=nvml
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9 Cores=0-15

There are 32 cores (HT is off).  But the daughter card that holds all
10 of the RTX8000s connects to only one socket as can be seen from
'nvidia-smi topo -m'

Its odd though in that my tests on my identically configured
rtx6000 partition did not show that behavior but maybe it is
due to just the "random" cores that got assigned to jobs there
all having a least one core on the "right" socket.

Anyway, how do I turn off this "affinity enforcment" as it is
more important that a job run with a GPU on its non-affinity socket
than just wait and not run at all?

Thanks

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Sat, 23 Jan 2021 3:19pm, Chris Samuel wrote:


 On Saturday, 23 January 2021 9:54:11 AM PST Paul Raines wrote:


 Now rtx-08 which has only 4 GPUs seems to always get all 4 uses.
 But the others seem to always only get half used (except rtx-07
 which somehow gets 6 used so another wierd thing).

 Again if I submit non-GPU jobs, they end up allocating all hte
 cores/cpus on the nodes just fine.


 What does your gres.conf look like for these nodes?

 One thing I've seen in the past is where the core specifications for the
 GPUs
 are out of step with the hardware and so Slurm thinks they're on the wrong
 socket.  Then when all the cores in that socket are used up Slurm won't
 put
 more GPU jobs on the node without the jobs explicitly asking to not do
 locality.

 One thing I've noticed is that in prior to Slurm 20.02 the documentation
 for
 gres.conf used to say:

#  If your cores contain multiple threads only the first thread
#  (processing unit) of each core needs to be listed.

 but that language is gone from 20.02 and later and the change isn't
 mentioned
 in the release notes for 20.02 so I'm not sure what happened there, the
 only
 clue is this commit:

 https://github.com/SchedMD/slurm/commit/
 7461b6ba95bb8ae70b36425f2c7e4961ac35799e#diff-
 cac030b65a8fc86123176971a94062fafb262cb2b11b3e90d6cc69e353e3bb89

 which says "xcpuinfo_abs_to_mac() expects a core list, not a CPU list."

 Best of luck!
 Chris
 --
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA














Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-24 Thread Paul Raines

Thanks Chris.

I think you have identified the issue here or are very close.  My gres.conf on
the rtx-04 node for example is:

AutoDetect=nvml
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8 Cores=0-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9 Cores=0-15

There are 32 cores (HT is off).  But the daughter card that holds all
10 of the RTX8000s connects to only one socket as can be seen from
'nvidia-smi topo -m'

Its odd though in that my tests on my identically configured
rtx6000 partition did not show that behavior but maybe it is
due to just the "random" cores that got assigned to jobs there
all having a least one core on the "right" socket.

Anyway, how do I turn off this "affinity enforcment" as it is
more important that a job run with a GPU on its non-affinity socket
than just wait and not run at all?

Thanks

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Sat, 23 Jan 2021 3:19pm, Chris Samuel wrote:


On Saturday, 23 January 2021 9:54:11 AM PST Paul Raines wrote:


Now rtx-08 which has only 4 GPUs seems to always get all 4 uses.
But the others seem to always only get half used (except rtx-07
which somehow gets 6 used so another wierd thing).

Again if I submit non-GPU jobs, they end up allocating all hte
cores/cpus on the nodes just fine.


What does your gres.conf look like for these nodes?

One thing I've seen in the past is where the core specifications for the GPUs
are out of step with the hardware and so Slurm thinks they're on the wrong
socket.  Then when all the cores in that socket are used up Slurm won't put
more GPU jobs on the node without the jobs explicitly asking to not do
locality.

One thing I've noticed is that in prior to Slurm 20.02 the documentation for
gres.conf used to say:

# If your cores contain multiple threads only the first thread
# (processing unit) of each core needs to be listed.

but that language is gone from 20.02 and later and the change isn't mentioned
in the release notes for 20.02 so I'm not sure what happened there, the only
clue is this commit:

https://github.com/SchedMD/slurm/commit/
7461b6ba95bb8ae70b36425f2c7e4961ac35799e#diff-
cac030b65a8fc86123176971a94062fafb262cb2b11b3e90d6cc69e353e3bb89

which says "xcpuinfo_abs_to_mac() expects a core list, not a CPU list."

Best of luck!
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA










Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-23 Thread Chris Samuel
On Saturday, 23 January 2021 9:54:11 AM PST Paul Raines wrote:

> Now rtx-08 which has only 4 GPUs seems to always get all 4 uses.
> But the others seem to always only get half used (except rtx-07
> which somehow gets 6 used so another wierd thing).
> 
> Again if I submit non-GPU jobs, they end up allocating all hte
> cores/cpus on the nodes just fine.

What does your gres.conf look like for these nodes?

One thing I've seen in the past is where the core specifications for the GPUs 
are out of step with the hardware and so Slurm thinks they're on the wrong 
socket.  Then when all the cores in that socket are used up Slurm won't put 
more GPU jobs on the node without the jobs explicitly asking to not do 
locality.

One thing I've noticed is that in prior to Slurm 20.02 the documentation for 
gres.conf used to say:

# If your cores contain multiple threads only the first thread
# (processing unit) of each core needs to be listed.

but that language is gone from 20.02 and later and the change isn't mentioned 
in the release notes for 20.02 so I'm not sure what happened there, the only 
clue is this commit:

https://github.com/SchedMD/slurm/commit/
7461b6ba95bb8ae70b36425f2c7e4961ac35799e#diff-
cac030b65a8fc86123176971a94062fafb262cb2b11b3e90d6cc69e353e3bb89

which says "xcpuinfo_abs_to_mac() expects a core list, not a CPU list."

Best of luck!
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-23 Thread Paul Raines

Yes, I meant job 38692.  Sorry.

I am still having the problem.  I suspect it has something to do with
the GPU configuration as this does not happen on my non-GPU node partitions.
Also, if I submit non-GPU jobs to the rtx8000 partition here, they
use up all the cores on the nodes just fine.

The upshot is on my 10 GPU nodes, I never see more than 6 GPUs in use
and jobs just asking for 1 or 2 GPUs are just made to wait in the qeuue.

Here is an example.  The state of the nodes in rtx8000 queue before
I queue jobs:

rtx-04
   CfgTRES=cpu=32,mem=1546000M,billing=99,gres/gpu=10
   AllocTRES=cpu=15,mem=120G,gres/gpu=5
rtx-05
   CfgTRES=cpu=32,mem=1546000M,billing=99,gres/gpu=10
   AllocTRES=cpu=15,mem=328G,gres/gpu=5
rtx-06
   CfgTRES=cpu=32,mem=1546000M,billing=99,gres/gpu=10
   AllocTRES=cpu=15,mem=224G,gres/gpu=5
rtx-07
   CfgTRES=cpu=32,mem=1546000M,billing=99,gres/gpu=10
   AllocTRES=cpu=16,mem=232G,gres/gpu=6
rtx-08
   CfgTRES=cpu=32,mem=1546000M,billing=81,gres/gpu=4

I then submit 10 jobs.  Then the queue for rtx8000 is:

NODELISTJOBID PARTITION  ST TIME_LIMIT  TRES_ALLOC   TRES_PER
rtx-04  40365 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-04  38676 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-04  38673 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-04  38670 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-04  38409 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-05  40214 rtx8000R  6-10:00:00  cpu=3,mem=128G,node= gpu:1
rtx-05  38677 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-05  38674 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-05  37450 rtx8000R  6-10:00:00  cpu=3,mem=128G,node= gpu:1
rtx-05  37278 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-06  40366 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-06  40364 rtx8000R  6-10:00:00  cpu=3,mem=128G,node= gpu:1
rtx-06  38648 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-06  38646 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-06  37267 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-07  40760 rtx8000R  50:00   cpu=4,mem=32G,node=1 gpu:2
rtx-07  38675 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-07  38672 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-07  38671 rtx8000R  7-00:00:00  cpu=3,mem=24G,node=1 gpu:1
rtx-07  37451 rtx8000R  6-10:00:00  cpu=3,mem=128G,node= gpu:1
rtx-08  40785 rtx8000R  50:00   cpu=4,mem=32G,node=1 gpu:2
rtx-08  40786 rtx8000R  50:00   cpu=4,mem=32G,node=1 gpu:2
(Priorit40794 rtx8000PD 50:00   cpu=4,mem=32G,node=1 gpu:2
(Priorit40793 rtx8000PD 50:00   cpu=4,mem=32G,node=1 gpu:2
(Priorit40792 rtx8000PD 50:00   cpu=4,mem=32G,node=1 gpu:2
(Priorit40791 rtx8000PD 50:00   cpu=4,mem=32G,node=1 gpu:2
(Priorit40790 rtx8000PD 50:00   cpu=4,mem=32G,node=1 gpu:2
(Priorit40789 rtx8000PD 50:00   cpu=4,mem=32G,node=1 gpu:2
(Priorit40788 rtx8000PD 50:00   cpu=4,mem=32G,node=1 gpu:2
(Resourc40787 rtx8000PD 50:00   cpu=4,mem=32G,node=1 gpu:2

[root@mlsc-head ~]# scontrol show job=40787
JobId=40787 JobName=sjob_5
   UserId=raines(5829) GroupId=raines(5829) MCS_label=N/A
   Priority=19836243 Nice=0 Account=sysadm QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:50:00 TimeMin=N/A
   SubmitTime=2021-01-23T12:37:51 EligibleTime=2021-01-23T12:37:51
   AccrueTime=2021-01-23T12:37:51
   StartTime=2021-01-23T13:08:52 EndTime=2021-01-23T13:58:52 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-01-23T12:38:36
   Partition=rtx8000 AllocNode:Sid=mlsc-head:1268664
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=rtx-07
   NumNodes=1-2 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=32G,node=1,billing=11,gres/gpu=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=32G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/autofs/cluster/batch/raines/sjob_5
   WorkDir=/autofs/cluster/batch/raines
   StdErr=/autofs/cluster/batch/raines/sjob_5.err40787
   StdIn=/dev/null
   StdOut=/autofs/cluster/batch/raines/sjob_5.out40787
   Power=
   TresPerJob=gpu:2
   MailUser=(null) MailType=NONE


[root@mlsc-head ~]# scontrol show node=rtx-04
NodeName=rtx-04 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=15 CPUTot=32 CPULoad=18.21
   AvailableFeatures=intel,cascade,rtx8000
   ActiveFeatures=intel,cascade,rtx8000
   Gres=gpu:quadro_rtx_8000:10(S:0)
   NodeAddr=rtx-04 NodeHostName=rtx-04 Version=20.02.3
   OS=Linux 4.18.0-193.28.1.el8_2.x86_64 #1 SMP Thu Oct 22 00:20:22 UTC 2020
   RealMemory=1546000 

Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-21 Thread Williams, Gareth (IM, Black Mountain)
I think job 38687 *is* being run on the rtx-06 node.
I think you mean why job 38692 is not being run on the rtx-06 node (the top 
prio pending job).

I can't see the problem... This (and other info) does seem to indicate that 
there is enough resource for the extra job:
CfgTRES=cpu=32,mem=1546000M,billing=99,gres/gpu=10
AllocTRES=cpu=16,mem=143G,gres/gpu=5

If I were debugging this, I'd submit some test jobs that just request resource 
and sleep, and look for if a node ever allocates more than 16 cores/cpus or 5 
gpus.

Maybe the answer is in the comprehensive info you posted and someone will see 
the gem. Not me, sorry.

Gareth

-Original Message-
From: slurm-users  On Behalf Of Paul 
Raines
Sent: Friday, 22 January 2021 7:12 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Job not running with Resource Reason even though 
resources appear to be available


I am in the beginning of setting up my first SLURM cluster and I am trying to 
understand why jobs are pending when resources are available

These are the pending jobs:

# squeue -P --sort=-p,i --states=PD -O "JobID:.12 ,Partition:9 ,StateCompact:2
,Priority:.12 ,ReasonList"
JOBID PARTITION ST PRIORITY NODELIST(REASON)
38692 rtx8000   PD 0.0046530945 (Resources)
38693 rtx8000   PD 0.0046530945 (Priority)
38694 rtx8000   PD 0.0046530906 (Priority)
38695 rtx8000   PD 0.0046530866 (Priority)
38696 rtx8000   PD 0.0046530866 (Priority)
38697 rtx8000   PD 0.208867 (Priority)

The job at the top is as follows:

Submission command line:

   sbatch -p rtx8000 -G 1 -c 4 -t 12:00:00 --mem=47G \
-o /cluster/batch/iman/%j.out --wrap='cmd .'

# scontrol show job=38692
JobId=38692 JobName=wrap
UserId=iman(8084) GroupId=iman(8084) MCS_label=N/A
Priority=19989863 Nice=0 Account=imanlab QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A
SubmitTime=2021-01-21T13:05:02 EligibleTime=2021-01-21T13:05:02
AccrueTime=2021-01-21T13:05:02
StartTime=2021-01-22T01:05:02 EndTime=2021-01-22T13:05:02 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-01-21T14:04:32
Partition=rtx8000 AllocNode:Sid=mlsc-head:974529
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=rtx-06
NumNodes=1-1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=47G,node=1,billing=8,gres/gpu=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=4 MinMemoryNode=47G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/autofs/homes/008/iman
StdErr=/cluster/batch/iman/38692.out
StdIn=/dev/null
StdOut=/cluster/batch/iman/38692.out
Power=
TresPerJob=gpu:1
MailUser=(null) MailType=NONE

This node shows it has enough free resources (cpu,mem,gpus) for the job in the 
partition

# scontrol show node=rtx-06
NodeName=rtx-06 Arch=x86_64 CoresPerSocket=16
CPUAlloc=16 CPUTot=32 CPULoad=5.77
AvailableFeatures=intel,cascade,rtx8000
ActiveFeatures=intel,cascade,rtx8000
Gres=gpu:quadro_rtx_8000:10(S:0)
NodeAddr=rtx-06 NodeHostName=rtx-06 Version=20.02.3
OS=Linux 4.18.0-193.28.1.el8_2.x86_64 #1 SMP Thu Oct 22 00:20:22 UTC 2020
RealMemory=1546000 AllocMem=146432 FreeMem=1420366 Sockets=2 Boards=1
MemSpecLimit=2048
State=MIXED ThreadsPerCore=1 TmpDisk=600 Weight=1 Owner=N/A 
MCS_label=N/A
Partitions=rtx8000
BootTime=2020-12-30T10:35:34 SlurmdStartTime=2020-12-30T10:37:21
CfgTRES=cpu=32,mem=1546000M,billing=99,gres/gpu=10
AllocTRES=cpu=16,mem=143G,gres/gpu=5
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

# squeue --partition=rtx8000 --states=R -O "NodeList:10 ,JobID:.8 
,Partition:10,tres-alloc,tres-per-job" -w rtx-06
NODELIST  JOBID PARTITION  TRES_ALLOC   TRES_PER_JOB
rtx-0638687 rtx8000cpu=4,mem=47G,node=1 gpu:1
rtx-0637267 rtx8000cpu=3,mem=24G,node=1 gpu:1
rtx-0637495 rtx8000cpu=3,mem=24G,node=1 gpu:1
rtx-0638648 rtx8000cpu=3,mem=24G,node=1 gpu:1
rtx-0638646 rtx8000cpu=3,mem=24G,node=1 gpu:1

In case this is needed

# scontrol show part=rtx8000
PartitionName=rtx8000
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=04:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 
Hidden=NO
MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO 
MaxCPUsPerNode=UNLIMITED
Nodes=rtx-[04-08]
PriorityJobFactor=1 PriorityTier=4 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=160 TotalNodes=5 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED 

[slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-21 Thread Paul Raines



I am in the beginning of setting up my first SLURM cluster and
I am trying to understand why jobs are pending when resources are available

These are the pending jobs:

# squeue -P --sort=-p,i --states=PD -O "JobID:.12 ,Partition:9 ,StateCompact:2 
,Priority:.12 ,ReasonList"

   JOBID PARTITION ST PRIORITY NODELIST(REASON)
   38692 rtx8000   PD 0.0046530945 (Resources)
   38693 rtx8000   PD 0.0046530945 (Priority)
   38694 rtx8000   PD 0.0046530906 (Priority)
   38695 rtx8000   PD 0.0046530866 (Priority)
   38696 rtx8000   PD 0.0046530866 (Priority)
   38697 rtx8000   PD 0.208867 (Priority)

The job at the top is as follows:

Submission command line:

  sbatch -p rtx8000 -G 1 -c 4 -t 12:00:00 --mem=47G \
   -o /cluster/batch/iman/%j.out --wrap='cmd .'

# scontrol show job=38692
JobId=38692 JobName=wrap
   UserId=iman(8084) GroupId=iman(8084) MCS_label=N/A
   Priority=19989863 Nice=0 Account=imanlab QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2021-01-21T13:05:02 EligibleTime=2021-01-21T13:05:02
   AccrueTime=2021-01-21T13:05:02
   StartTime=2021-01-22T01:05:02 EndTime=2021-01-22T13:05:02 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-01-21T14:04:32
   Partition=rtx8000 AllocNode:Sid=mlsc-head:974529
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=rtx-06
   NumNodes=1-1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=47G,node=1,billing=8,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=47G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/autofs/homes/008/iman
   StdErr=/cluster/batch/iman/38692.out
   StdIn=/dev/null
   StdOut=/cluster/batch/iman/38692.out
   Power=
   TresPerJob=gpu:1
   MailUser=(null) MailType=NONE

This node shows it has enough free resources (cpu,mem,gpus) for
the job in the partition

# scontrol show node=rtx-06
NodeName=rtx-06 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=16 CPUTot=32 CPULoad=5.77
   AvailableFeatures=intel,cascade,rtx8000
   ActiveFeatures=intel,cascade,rtx8000
   Gres=gpu:quadro_rtx_8000:10(S:0)
   NodeAddr=rtx-06 NodeHostName=rtx-06 Version=20.02.3
   OS=Linux 4.18.0-193.28.1.el8_2.x86_64 #1 SMP Thu Oct 22 00:20:22 UTC 2020
   RealMemory=1546000 AllocMem=146432 FreeMem=1420366 Sockets=2 Boards=1
   MemSpecLimit=2048
   State=MIXED ThreadsPerCore=1 TmpDisk=600 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=rtx8000
   BootTime=2020-12-30T10:35:34 SlurmdStartTime=2020-12-30T10:37:21
   CfgTRES=cpu=32,mem=1546000M,billing=99,gres/gpu=10
   AllocTRES=cpu=16,mem=143G,gres/gpu=5
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

# squeue --partition=rtx8000 --states=R -O "NodeList:10 ,JobID:.8 
,Partition:10,tres-alloc,tres-per-job" -w rtx-06

NODELIST  JOBID PARTITION  TRES_ALLOC   TRES_PER_JOB
rtx-0638687 rtx8000cpu=4,mem=47G,node=1 gpu:1
rtx-0637267 rtx8000cpu=3,mem=24G,node=1 gpu:1
rtx-0637495 rtx8000cpu=3,mem=24G,node=1 gpu:1
rtx-0638648 rtx8000cpu=3,mem=24G,node=1 gpu:1
rtx-0638646 rtx8000cpu=3,mem=24G,node=1 gpu:1

In case this is needed

# scontrol show part=rtx8000
PartitionName=rtx8000
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=04:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 
Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO 
MaxCPUsPerNode=UNLIMITED

   Nodes=rtx-[04-08]
   PriorityJobFactor=1 PriorityTier=4 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=160 TotalNodes=5 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=1.24,Mem=0.02G,Gres/gpu=3.0


Scheduling parameters from slurm.conf are:

EnforcePartLimits=ALL
LaunchParameters=mem_sort,slurmstepd_memlock_all,test_exec
MaxJobCount=30
MaxArraySize=1
DefMemPerCPU=10240
DefCpuPerGPU=1
DefMemPerGPU=10240
GpuFreqDef=medium
CompleteWait=0
EpilogMsgTime=300
InactiveLimit=60
KillWait=30
UnkillableStepTimeout=180
ResvOverRun=UNLIMITED
MinJobAge=600
Waittime=5
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE
PreemptType=preempt/partition_prio
PreemptMode=REQUEUE

SchedulerParameters=\
default_queue_depth=1500,\
partition_job_depth=10,\
bf_continue,\
bf_interval=30,\
bf_resolution=600,\
bf_window=11520,\
bf_max_job_part=0,\
bf_max_job_user=10,\
bf_max_job_test=10,\
bf_max_job_start=1000,\
bf_ignore_newly_avail_nodes,\
enable_user_top,\
pack_serial_at_end,\
nohold_on_prolog_fail,\