[slurm-dev] Node switching to DRAIN for unknown reason, trouble shooting ideas?

E V Mon, 30 Jan 2017 07:38:36 -0800

Running slurm 15.08.12 on a debian 8 system we have a node that keeps
being drained and I can't tell why. From slurmctld.log on our ctld
primary:


[2017-01-28T06:45:29.961] error: Setting node hpcc-1 state to DRAIN
[2017-01-28T06:45:29.961] drain_nodes: node hpcc-1 state set to DRAIN
[2017-01-28T06:45:29.961] error: _slurm_rpc_node_registration
node=hpcc-1: Invalid argument

The slurmd.log on the node itself shows normal job completion message
just before, and then nothing immediately after the drain:

[2017-01-28T06:42:46.563] _run_prolog: run job script took usec=7
[2017-01-28T06:42:46.563] _run_prolog: prolog with lock for job
1930101 ran for 0 seconds
[2017-01-28T06:42:48.427] [1930101.0] done with job
[2017-01-28T14:37:26.365] [1928122] sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2017-01-28T14:37:26.367] [1928122] done with job

Any thoughts for figuring/fixing this? The node going to drain also
happens to be our backup controller if that may be related to things:
$ grep hpcc-1 slurm.conf
BackupController=hpcc-1
NodeName=hpcc-1 Gres=xld:1,xcd:1 Sockets=2 CoresPerSocket=6
ThreadsPerCore=2 State=UNKNOWN RealMemory=48000 TmpDisk=500000
PartitionName=headNode Nodes=hpcc-1 Default=NO MaxTime=INFINITE State=UP

This is on our testing/development grid systems so we can easily make
changes to debug/fix the problem.

[slurm-dev] Node switching to DRAIN for unknown reason, trouble shooting ideas?

Reply via email to