Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:

> On 04/05/2017 03:59 PM, Loris Bennett wrote:
>
>> We are running 16.05.10-2 with power-saving.  However, we have noticed a
>> problem recently when nodes are woken up in order to start a job.  The
>> node will go from 'idle~' to, say, 'mixed#', but then the job will fail
>> and the node will be put in 'down*'.  We have turned up the log level to
>> 'debug' with the DebugFlag 'Power', but this hasn't produced anything
>> relevant.  The problem is, however, resolved if the node is rebooted.
>>
>> Thus, there seems to be some disturbance of the communication between
>> the slurmd on the woken node and the slurmctd on the administration
>> node.  Does anyone have any idea what might be going on?
>
> We have seen something similar with Slurm 16.05.10.
>
> How many nodes are in your network?  If there are more than about 400 devices 
> in
> the network, you must tune the kernel ARP cache of the slurmctld server, see
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks

Thanks for the link, but we have fewer than 120 nodes, so we are along
way from the 512-device limit.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.benn...@fu-berlin.de

Reply via email to