Re: [slurm-users] How should I configure a node with Autodetect=nvml?

2020-02-11 Thread Chris Samuel
On Tuesday, 11 February 2020 7:27:56 AM PST Dean Schulze wrote:

> No other errors in the logs.  Identical slurm.conf on all nodes and
> controller.  Only the node with gpus has the gres.conf (with the single
> line Autodetect=nvml).

It might be useful to post the output of "slurmd -C" and your slurm.conf for 
us to see (sorry if you've done that already and I've not seen it).

You can also increase the debug level for slurmctld and slurm in slurm.conf 
(we typically run with SlurmctldDebug=debug, you may want to try 
SlurmdDebug=debug whilst experimenting).

Best of luck,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] How should I configure a node with Autodetect=nvml?

2020-02-11 Thread Dean Schulze
No other errors in the logs.  Identical slurm.conf on all nodes and
controller.  Only the node with gpus has the gres.conf (with the single
line Autodetect=nvml).

I got this error to stop by removing the Gres=gpu:gp100:2 from the NodeName
line in the controller and the node and removing the gres.conf from the
node.


On Mon, Feb 10, 2020 at 11:41 PM Chris Samuel  wrote:

> On Monday, 10 February 2020 12:11:30 PM PST Dean Schulze wrote:
>
> > With this configuration I get this message every second in my
> slurmctld.log
> > file:
> >
> > error: _slurm_rpc_node_registration node=slurmnode1: Invalid argument
>
> What other errors are in the logs?
>
> Could you check that you've got identical slurm.conf and gres.conf files
> everywhere?
>
> All the best,
> Chris
> --
>   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>
>
>
>


Re: [slurm-users] How should I configure a node with Autodetect=nvml?

2020-02-10 Thread Chris Samuel
On Monday, 10 February 2020 12:11:30 PM PST Dean Schulze wrote:

> With this configuration I get this message every second in my slurmctld.log
> file:
> 
> error: _slurm_rpc_node_registration node=slurmnode1: Invalid argument

What other errors are in the logs?

Could you check that you've got identical slurm.conf and gres.conf files 
everywhere?

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






[slurm-users] How should I configure a node with Autodetect=nvml?

2020-02-10 Thread Dean Schulze
In the gres.conf on one of my nodes I have just the line

Autodetect=nvml

as in the last example in https://slurm.schedmd.com/gres.conf.html.

In the slurm.conf on all nodes I have this line for the node with
Autodetect=nvml

NodeName=slurmnode1 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8
ThreadsPerCore=2 RealMemory=47671 Gres=gpu:gp100:4

since that node can have up to 4 gpus dynamically assigned.  Without the
Gres=gpu:gp100:4 I can't run any job that requires a gpu even if I
dynamically assign gpus on that node.  Apparently Autodetect=nvml isn't
enough to let the controller know that there are gpus available on that
node.

With this configuration I get this message every second in my slurmctld.log
file:

error: _slurm_rpc_node_registration node=slurmnode1: Invalid argument

I've restarted both slurmd and slurmctld and still get the error.  That
node also stays in the drain state no matter what I do with it.  Apparently
slurm doesn't like this configuration.

What is the right way to configure a node with Autodetect=nvml?