Running slurm 15.08.12 on a debian 8 system we have a node that keeps being drained and I can't tell why. From slurmctld.log on our ctld primary:
[2017-01-28T06:45:29.961] error: Setting node hpcc-1 state to DRAIN [2017-01-28T06:45:29.961] drain_nodes: node hpcc-1 state set to DRAIN [2017-01-28T06:45:29.961] error: _slurm_rpc_node_registration node=hpcc-1: Invalid argument The slurmd.log on the node itself shows normal job completion message just before, and then nothing immediately after the drain: [2017-01-28T06:42:46.563] _run_prolog: run job script took usec=7 [2017-01-28T06:42:46.563] _run_prolog: prolog with lock for job 1930101 ran for 0 seconds [2017-01-28T06:42:48.427] [1930101.0] done with job [2017-01-28T14:37:26.365] [1928122] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 [2017-01-28T14:37:26.367] [1928122] done with job Any thoughts for figuring/fixing this? The node going to drain also happens to be our backup controller if that may be related to things: $ grep hpcc-1 slurm.conf BackupController=hpcc-1 NodeName=hpcc-1 Gres=xld:1,xcd:1 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN RealMemory=48000 TmpDisk=500000 PartitionName=headNode Nodes=hpcc-1 Default=NO MaxTime=INFINITE State=UP This is on our testing/development grid systems so we can easily make changes to debug/fix the problem.