Have you checked to make sure your GPU's are in persistence mode?

http://docs.nvidia.com/deploy/driver-persistence/

# nvidia-smi --persistence-mode=1

-------------------
Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

On Tue, 6 Dec 2016, David van Leeuwen wrote:


Hello Jared,

On Tue, Dec 6, 2016 at 4:01 PM, Jared David Baker <jared.ba...@uwyo.edu> wrote:
David,

I'd be curious to know if the device files exist (`ls /dev/nvidia*`). On our 
systems with GPUs, we've seen this error when the /dev/nvidia? files are 
missing because either a GPU has failed or the CUDA drivers not being installed 
appropriately.

Yes--the device files are there (see the log above of slurmd), but also

crw-rw-rw- 1 root root 195,   0 Dec  1 15:15 /dev/nvidia0

I am a little further now, the magical command "sinfo -lNe" revealed
that the hosts were in a "drain" state---whatever that may be (is
there a state diagram somewhere in the docs?), and
http://stackoverflow.com/questions/29535118/how-to-undrain-slurm-nodes-in-drain-state
suggested to do a

scontrol update NodeName=deep-novo-[1-2] State=RESUME

So for some (probably good but for me incomprehensible) reason the
drained state was kept over multiple restarts / reconfigures /
everything I tried in the past two days.  The earlier reported

  Reason=gres/gpu count too low (0 < 1) [root@2016-12-02T15:16:44]

apparently was a reason before the nodes got into the drained state (I
would give as a current reason something like "drained because...").

I will now try to add one-config-change-at-the-time the gpu resources
back, and see if I slurm can stay in a state where it is actually
scheduling jobs.

Cheers,

---david


- Jared

-----Original Message-----
From: David van Leeuwen [mailto:david.vanleeu...@gmail.com]
Sent: Tuesday, December 6, 2016 7:22 AM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Reason=gres/gpu count too low


Hi,

I have restarted the slurmctld and slurmd several times---in fact, I don't know 
what the preferred order of restart is (I suppose first slurmd and then 
slurmctld).  But all to no avail.

Is there a way I can somehow completely re-init the status of slurm?
I've tried "service slurmd startclean" (after fixing a debian typo)---but I 
keep getting this status

Reason=gres/gpu count too low (0 < 2) [root@2016-12-02T15:16:44]

with an old date.

Thanks,

---david

On Tue, Dec 6, 2016 at 2:07 PM, Robbert Eggermont <r.eggerm...@tudelft.nl> 
wrote:

On 06-12-16 10:49, David van Leeuwen wrote:

"gres/gpu count too low (0 < 1)"


Last time I saw this I had to restart the slurmd on that node (a
simple scontrol reconfigure was not enough).

I guess this message indicates a discrepancy between the number of GPU
resources detected by slurmd at startup, and the number specified in
the slurm.conf (and used by slurmctld).

In the end, I even had to restart both slurmd and slurmctld to get the
GPUs registered properly (including the CPU specification in the
gres.conf). (And I then repeated this just in case the order was
important.;-))

Best,

Robbert



--
David van Leeuwen



--
David van Leeuwen

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to