In case your Arp cache is the problem, there is some advice in the Wiki
page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
I think there are other causes for ReqNodeNotAvail, for example, the
node being allocated for other jobs. The "scontrol show node/job"
should reveal more details.
/Ole
On 11-07-2020 06:00, mercan wrote:
Hi Janna;
It sounds like a Arp cache table problem to me. If your slurm head node
can reachable ~1000 or more network devices (all connected network
cards, switches etc., even they are reachable by different ports of the
server), you need to increse some network settings at headnode and
servers which can reach same amount of network device :
http://docs.adaptivecomputing.com/torque/5-0-3/Content/topics/torque/12-appendices/otherConsiderations.htm
Also some advices for big cluster at slurm documentation:
https://slurm.schedmd.com/big_sys.html
Regards,
Ahmet M.
11.07.2020 01:34 tarihinde Janna Ore Nugent yazdı:
Hi All,
I’ve got an intermittent situation with gpu nodes that sinfo says are
available and idle, but squeue reports as “ReqNodeNotAvail”. We’ve
cycled the nodes to restart services but it hasn’t helped. Any
suggestions for resolving this or digging into it more deeply?