While doing more investigation I found an interesting situation.

I have a 32 core (2 x 16 core Xeon) node with the 10 RTX cards where
all 10 cards have affinity to just one socket (cores 0-15 as shown
by 'nvidia-smi topo -m').  The current running jobs on it are
using 5 GPUS and 15 cores

# scontrol show node=rtx-04 | grep gres
   CfgTRES=cpu=32,mem=1546000M,billing=99,gres/gpu=10
   AllocTRES=cpu=15,mem=220G,gres/gpu=5

Checking /sys/fs/cgroup I see these jobs are using cores 0-14

# grep . /sys/fs/cgroup/cpuset/slurm/uid_*/job_*/cpuset.cpus
/sys/fs/cgroup/cpuset/slurm/uid_4181545/job_38409/cpuset.cpus:12-14
/sys/fs/cgroup/cpuset/slurm/uid_4181545/job_38670/cpuset.cpus:0-2
/sys/fs/cgroup/cpuset/slurm/uid_4181545/job_38673/cpuset.cpus:3-5
/sys/fs/cgroup/cpuset/slurm/uid_5829/job_49088/cpuset.cpus:9-11
/sys/fs/cgroup/cpuset/slurm/uid_8285/job_49048/cpuset.cpus:6-8

If I submit a job to rtx-04 asking for 1 core and 1 GPU the job runs
no problem and it uses core 15.  And then if I submit more jobs asking
for a GPU they run fine on core 16 and up.

Now if I cancel my jobs so I am back to the jobs using 5 GPUS and 15 cores
and then submit a job asking for 2 cores and 1 GPU, the job
stays in Pending state and refused to run on rtx-04.

So before submitting any bug report I decided to upgrade to the latest SLURM version. I upgraded from 20.02.03 to 20.11.3 (with those jobs still running on rtx-04) and now the problem has gone away. I can submit a 2 core and 1 GPU job and it runs immediately.

So my problem seems fixed, but in the update I noticed a wierd thing happen.
Now SLURM insistes that the Cores in gres.conf must be set to Cores=0-31
even though 'nvidia-smi topo -m' still says 0-15.  I decided to just remove
the Cores= setting from /etc/slurm/gres.conf

So before the update slurmd.log has:

[2021-01-26T03:07:45.673] Gres Name=gpu Type=quadro_rtx_8000 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-15 CoreCnt=32 Links=-1,0,0,0,0,0,2,0,0,0

and after the update

[2021-01-26T14:31:47.282] Gres Name=gpu Type=quadro_rtx_8000 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-31 CoreCnt=32 Links=-1,0,0,0,0,0,2,0,0,0

This is fine with me as I want SLURM to ignore GPU affinity on these nodes
but it is curious.



-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Mon, 25 Jan 2021 10:07am, Paul Raines wrote:


I tried submitting jobs with --gres-flags=disable-binding but
this has not made any difference.  Jobs asking for GPUs are still only
being run if a core defined in gres.conf for the GPU is free.

Basically seems the option is ignored.


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Sun, 24 Jan 2021 11:39am, Paul Raines wrote:

 Thanks Chris.

 I think you have identified the issue here or are very close.  My
 gres.conf on
 the rtx-04 node for example is:

 AutoDetect=nvml
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8 Cores=0-15
 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9 Cores=0-15

 There are 32 cores (HT is off).  But the daughter card that holds all
 10 of the RTX8000s connects to only one socket as can be seen from
 'nvidia-smi topo -m'

 Its odd though in that my tests on my identically configured
 rtx6000 partition did not show that behavior but maybe it is
 due to just the "random" cores that got assigned to jobs there
 all having a least one core on the "right" socket.

 Anyway, how do I turn off this "affinity enforcment" as it is
 more important that a job run with a GPU on its non-affinity socket
 than just wait and not run at all?

 Thanks

 -- Paul Raines (http://help.nmr.mgh.harvard.edu)



 On Sat, 23 Jan 2021 3:19pm, Chris Samuel wrote:

  On Saturday, 23 January 2021 9:54:11 AM PST Paul Raines wrote:

  Now rtx-08 which has only 4 GPUs seems to always get all 4 uses.
  But the others seem to always only get half used (except rtx-07
  which somehow gets 6 used so another wierd thing).

  Again if I submit non-GPU jobs, they end up allocating all hte
  cores/cpus on the nodes just fine.

  What does your gres.conf look like for these nodes?

  One thing I've seen in the past is where the core specifications for the
  GPUs
  are out of step with the hardware and so Slurm thinks they're on the
  wrong
  socket.  Then when all the cores in that socket are used up Slurm won't
  put
  more GPU jobs on the node without the jobs explicitly asking to not do
  locality.

  One thing I've noticed is that in prior to Slurm 20.02 the documentation
  for
  gres.conf used to say:

#   If your cores contain multiple threads only the first thread
#   (processing unit) of each core needs to be listed.

  but that language is gone from 20.02 and later and the change isn't
  mentioned
  in the release notes for 20.02 so I'm not sure what happened there, the
  only
  clue is this commit:

  https://github.com/SchedMD/slurm/commit/
  7461b6ba95bb8ae70b36425f2c7e4961ac35799e#diff-
  cac030b65a8fc86123176971a94062fafb262cb2b11b3e90d6cc69e353e3bb89

  which says "xcpuinfo_abs_to_mac() expects a core list, not a CPU list."

  Best of luck!
  Chris
  --
   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA












Reply via email to