Re: [slurm-users] How should I configure a node with Autodetect=nvml?

2020-02-11 Thread Chris Samuel
On Tuesday, 11 February 2020 7:27:56 AM PST Dean Schulze wrote: > No other errors in the logs. Identical slurm.conf on all nodes and > controller. Only the node with gpus has the gres.conf (with the single > line Autodetect=nvml). It might be useful to post the output of "slurmd -C" and your sl

Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-11 Thread Robert Kudyba
This is still happening. Nodes are being drained after a kill task failed. Could this be related to https://bugs.schedmd.com/show_bug.cgi?id=6307? [2020-02-11T12:21:26.005] update_node: node node001 reason set to: Kill task failed [2020-02-11T12:21:26.006] update_node: node node001 state set to DR

Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-11 Thread Robert Kudyba
> > Usually means you updated the slurm.conf but have not done "scontrol > reconfigure" yet. > Well it turns out it was something else related to a Bright Computing setting. In case anyone finds this thread in the future: ourcluster->category[gpucategory]->roles]% use slurmclient [ourcluster->cate

Re: [slurm-users] Anyone have success with Nvidia Jetson nano

2020-02-11 Thread Yuengling, Philip J.
Christopher, I've been using Slurm on a small Jetson Nano cluster for testing. However one important thing to keep in mind is that the Jetson Nano is a Tegra platform, and there is no nvml. Therefore GPU management through gres.conf may be a challenge. Phil Yuengling _

Re: [slurm-users] How should I configure a node with Autodetect=nvml?

2020-02-11 Thread Dean Schulze
No other errors in the logs. Identical slurm.conf on all nodes and controller. Only the node with gpus has the gres.conf (with the single line Autodetect=nvml). I got this error to stop by removing the Gres=gpu:gp100:2 from the NodeName line in the controller and the node and removing the gres.c