[slurm-dev] Re: Nodes in state 'down*' despite slurmd running

Alexey Safonov Wed, 05 Apr 2017 20:17:29 -0700

I have same issue with 10 noded
slurm 16.05.5


On 5 April 2017 at 22:02, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote:
>
> On 04/05/2017 03:59 PM, Loris Bennett wrote:
>>
>> We are running 16.05.10-2 with power-saving.  However, we have noticed a
>> problem recently when nodes are woken up in order to start a job.  The
>> node will go from 'idle~' to, say, 'mixed#', but then the job will fail
>> and the node will be put in 'down*'.  We have turned up the log level to
>> 'debug' with the DebugFlag 'Power', but this hasn't produced anything
>> relevant.  The problem is, however, resolved if the node is rebooted.
>>
>> Thus, there seems to be some disturbance of the communication between
>> the slurmd on the woken node and the slurmctd on the administration
>> node.  Does anyone have any idea what might be going on?
>
>
> We have seen something similar with Slurm 16.05.10.
>
> How many nodes are in your network?  If there are more than about 400
> devices in the network, you must tune the kernel ARP cache of the slurmctld
> server, see
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
>
> /Ole

[slurm-dev] Re: Nodes in state 'down*' despite slurmd running

Reply via email to