Hello all,
My error was indeed just the comma in my gres.conf. I was confused because
I had the same file on my running nodes but that's just because slurmd
started before the erroneous comma was added to the config.
So the error message was in fact directly correct, it could not find the
> On Jul 23, 2018, at 10:31 PM, Ian Mortimer wrote:
>
> On Tue, 2018-07-24 at 02:19 +, Ryan Novosielski wrote:
>
>> Best off running nvidia-persistenced. Handles all of this stuff as a
>> side effect, and also enables persistence mode, provided you don’t
>> configure it otherwise.
>
>
On Tue, 2018-07-24 at 02:19 +, Ryan Novosielski wrote:
> Best off running nvidia-persistenced. Handles all of this stuff as a
> side effect, and also enables persistence mode, provided you don’t
> configure it otherwise.
Yes. But you have to ensure it starts before slurmd.
--
Ian
Hi Alex,
What's the actual content of your gres.conf file? Seems to me that you have
a trailing comma after the location of the nvidia device
Our gres.conf has
NodeName=gpuhost[001-077] Name=gpu Type=p100 File=/dev/nvidia0
Cores=0,2,4,6,8,10,12,14,16,18,20,22
NodeName=gpuhost[001-077] Name=gpu
On Mon, 2018-07-23 at 15:59 -0700, Alex Chekholko wrote:
> However, in this case both 'nvidia-smi' and 'nvidia-smi -L' run just
> fine and produce expected output.
They will because running nvidia-smi triggers loading of the kernel
module and creation of the device files. But are the device
Subject: Re: [slurm-users] "fatal: can't stat gres.conf"
Thanks for the suggestion; if my memory serves me right, I had to do that
previously to cause the drivers to load correctly after boot.
However, in this case both 'nvidia-smi' and 'nvidia-smi -L' run just fine and
produce expec
8 6:10 AM
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] "fatal: can't stat gres.conf"
>
> Hi all,
>
> I have a few working GPU compute nodes. I bought a couple of more
> identical nodes. They are all diskless; so they all boot from the same
> disk
: [slurm-users] "fatal: can't stat gres.conf"
Hi all,
I have a few working GPU compute nodes. I bought a couple of more identical
nodes. They are all diskless; so they all boot from the same disk image.
For some reason slurmd refuses to start on the new nodes; and I'm not able to
Hi all,
I have a few working GPU compute nodes. I bought a couple of more
identical nodes. They are all diskless; so they all boot from the same
disk image.
For some reason slurmd refuses to start on the new nodes; and I'm not able
to find any differences in hardware or software. Google