Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes: > On 04/05/2017 03:59 PM, Loris Bennett wrote: > >> We are running 16.05.10-2 with power-saving. However, we have noticed a >> problem recently when nodes are woken up in order to start a job. The >> node will go from 'idle~' to, say, 'mixed#', but then the job will fail >> and the node will be put in 'down*'. We have turned up the log level to >> 'debug' with the DebugFlag 'Power', but this hasn't produced anything >> relevant. The problem is, however, resolved if the node is rebooted. >> >> Thus, there seems to be some disturbance of the communication between >> the slurmd on the woken node and the slurmctd on the administration >> node. Does anyone have any idea what might be going on? > > We have seen something similar with Slurm 16.05.10. > > How many nodes are in your network? If there are more than about 400 devices > in > the network, you must tune the kernel ARP cache of the slurmctld server, see > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
Thanks for the link, but we have fewer than 120 nodes, so we are along way from the 512-device limit. Cheers, Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de