Have you checked to make sure your GPU's are in persistence mode?
http://docs.nvidia.com/deploy/driver-persistence/ # nvidia-smi --persistence-mode=1 ------------------- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Tue, 6 Dec 2016, David van Leeuwen wrote:
Hello Jared, On Tue, Dec 6, 2016 at 4:01 PM, Jared David Baker <jared.ba...@uwyo.edu> wrote:David, I'd be curious to know if the device files exist (`ls /dev/nvidia*`). On our systems with GPUs, we've seen this error when the /dev/nvidia? files are missing because either a GPU has failed or the CUDA drivers not being installed appropriately.Yes--the device files are there (see the log above of slurmd), but also crw-rw-rw- 1 root root 195, 0 Dec 1 15:15 /dev/nvidia0 I am a little further now, the magical command "sinfo -lNe" revealed that the hosts were in a "drain" state---whatever that may be (is there a state diagram somewhere in the docs?), and http://stackoverflow.com/questions/29535118/how-to-undrain-slurm-nodes-in-drain-state suggested to do a scontrol update NodeName=deep-novo-[1-2] State=RESUME So for some (probably good but for me incomprehensible) reason the drained state was kept over multiple restarts / reconfigures / everything I tried in the past two days. The earlier reported Reason=gres/gpu count too low (0 < 1) [root@2016-12-02T15:16:44] apparently was a reason before the nodes got into the drained state (I would give as a current reason something like "drained because..."). I will now try to add one-config-change-at-the-time the gpu resources back, and see if I slurm can stay in a state where it is actually scheduling jobs. Cheers, ---david- Jared -----Original Message----- From: David van Leeuwen [mailto:david.vanleeu...@gmail.com] Sent: Tuesday, December 6, 2016 7:22 AM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] Re: Reason=gres/gpu count too low Hi, I have restarted the slurmctld and slurmd several times---in fact, I don't know what the preferred order of restart is (I suppose first slurmd and then slurmctld). But all to no avail. Is there a way I can somehow completely re-init the status of slurm? I've tried "service slurmd startclean" (after fixing a debian typo)---but I keep getting this status Reason=gres/gpu count too low (0 < 2) [root@2016-12-02T15:16:44] with an old date. Thanks, ---david On Tue, Dec 6, 2016 at 2:07 PM, Robbert Eggermont <r.eggerm...@tudelft.nl> wrote:On 06-12-16 10:49, David van Leeuwen wrote:"gres/gpu count too low (0 < 1)"Last time I saw this I had to restart the slurmd on that node (a simple scontrol reconfigure was not enough). I guess this message indicates a discrepancy between the number of GPU resources detected by slurmd at startup, and the number specified in the slurm.conf (and used by slurmctld). In the end, I even had to restart both slurmd and slurmctld to get the GPUs registered properly (including the CPU specification in the gres.conf). (And I then repeated this just in case the order was important.;-)) Best, Robbert-- David van Leeuwen-- David van Leeuwen
smime.p7s
Description: S/MIME Cryptographic Signature