Re: [slurm-users] How should I configure a node with Autodetect=nvml?
On Tuesday, 11 February 2020 7:27:56 AM PST Dean Schulze wrote: > No other errors in the logs. Identical slurm.conf on all nodes and > controller. Only the node with gpus has the gres.conf (with the single > line Autodetect=nvml). It might be useful to post the output of "slurmd -C" and your slurm.conf for us to see (sorry if you've done that already and I've not seen it). You can also increase the debug level for slurmctld and slurm in slurm.conf (we typically run with SlurmctldDebug=debug, you may want to try SlurmdDebug=debug whilst experimenting). Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] How should I configure a node with Autodetect=nvml?
No other errors in the logs. Identical slurm.conf on all nodes and controller. Only the node with gpus has the gres.conf (with the single line Autodetect=nvml). I got this error to stop by removing the Gres=gpu:gp100:2 from the NodeName line in the controller and the node and removing the gres.conf from the node. On Mon, Feb 10, 2020 at 11:41 PM Chris Samuel wrote: > On Monday, 10 February 2020 12:11:30 PM PST Dean Schulze wrote: > > > With this configuration I get this message every second in my > slurmctld.log > > file: > > > > error: _slurm_rpc_node_registration node=slurmnode1: Invalid argument > > What other errors are in the logs? > > Could you check that you've got identical slurm.conf and gres.conf files > everywhere? > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA > > > > >
Re: [slurm-users] How should I configure a node with Autodetect=nvml?
On Monday, 10 February 2020 12:11:30 PM PST Dean Schulze wrote: > With this configuration I get this message every second in my slurmctld.log > file: > > error: _slurm_rpc_node_registration node=slurmnode1: Invalid argument What other errors are in the logs? Could you check that you've got identical slurm.conf and gres.conf files everywhere? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
[slurm-users] How should I configure a node with Autodetect=nvml?
In the gres.conf on one of my nodes I have just the line Autodetect=nvml as in the last example in https://slurm.schedmd.com/gres.conf.html. In the slurm.conf on all nodes I have this line for the node with Autodetect=nvml NodeName=slurmnode1 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=47671 Gres=gpu:gp100:4 since that node can have up to 4 gpus dynamically assigned. Without the Gres=gpu:gp100:4 I can't run any job that requires a gpu even if I dynamically assign gpus on that node. Apparently Autodetect=nvml isn't enough to let the controller know that there are gpus available on that node. With this configuration I get this message every second in my slurmctld.log file: error: _slurm_rpc_node_registration node=slurmnode1: Invalid argument I've restarted both slurmd and slurmctld and still get the error. That node also stays in the drain state no matter what I do with it. Apparently slurm doesn't like this configuration. What is the right way to configure a node with Autodetect=nvml?