On Tuesday, 11 February 2020 7:27:56 AM PST Dean Schulze wrote:
> No other errors in the logs. Identical slurm.conf on all nodes and
> controller. Only the node with gpus has the gres.conf (with the single
> line Autodetect=nvml).
It might be useful to post the output of "slurmd -C" and your sl
This is still happening. Nodes are being drained after a kill task failed.
Could this be related to https://bugs.schedmd.com/show_bug.cgi?id=6307?
[2020-02-11T12:21:26.005] update_node: node node001 reason set to: Kill
task failed
[2020-02-11T12:21:26.006] update_node: node node001 state set to DR
>
> Usually means you updated the slurm.conf but have not done "scontrol
> reconfigure" yet.
>
Well it turns out it was something else related to a Bright Computing
setting. In case anyone finds this thread in the future:
ourcluster->category[gpucategory]->roles]% use slurmclient
[ourcluster->cate
Christopher,
I've been using Slurm on a small Jetson Nano cluster for testing. However one
important thing to keep in mind is that the Jetson Nano is a Tegra platform,
and there is no nvml. Therefore GPU management through gres.conf may be a
challenge.
Phil Yuengling
_
No other errors in the logs. Identical slurm.conf on all nodes and
controller. Only the node with gpus has the gres.conf (with the single
line Autodetect=nvml).
I got this error to stop by removing the Gres=gpu:gp100:2 from the NodeName
line in the controller and the node and removing the gres.c