No eplilog scripts defined, and access to save state is fine, as an scontrol takeover works, but does have the side affect of the backup draining itself. I set SlurmctlDebug to debug3 and didn't get much more info: [2017-01-31T09:45:22.329] debug2: node_did_resp hpcc-1 [2017-01-31T09:45:22.329] debug2: node_did_resp r1-07 [2017-01-31T09:45:22.329] debug2: node_did_resp r1-03 [2017-01-31T09:45:22.329] debug2: node_did_resp r1-05 [2017-01-31T09:45:22.329] debug2: node_did_resp r1-02 [2017-01-31T09:45:22.329] debug2: node_did_resp r1-04 [2017-01-31T09:45:22.329] debug2: node_did_resp r1-01 [2017-01-31T09:45:22.341] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 [2017-01-31T09:45:22.341] error: Setting node hpcc-1 state to DRAIN [2017-01-31T09:45:22.341] drain_nodes: node hpcc-1 state set to DRAIN [2017-01-31T09:45:22.341] error: _slurm_rpc_node_registration node=hpcc-1: Invalid argument
I'll try turning it up to debug5 and also enable SlurmdDebug to see if that shows anything. On Mon, Jan 30, 2017 at 12:42 PM, Paddy Doyle <pa...@tchpc.tcd.ie> wrote: > > Hi E V, > > You could turn up the SlurmctldDebug and SlurmdDebug values in slurm.conf to > get > it to be more verbose. > > Do you have any epilog scripts defined? > > If it's related to the node being the backup controller, as a wild guess, > perhaps your backup control doesn't have access to the StateSaveLocation > directory? > > Paddy > > On Mon, Jan 30, 2017 at 07:38:39AM -0800, E V wrote: > >> >> Running slurm 15.08.12 on a debian 8 system we have a node that keeps >> being drained and I can't tell why. From slurmctld.log on our ctld >> primary: >> >> [2017-01-28T06:45:29.961] error: Setting node hpcc-1 state to DRAIN >> [2017-01-28T06:45:29.961] drain_nodes: node hpcc-1 state set to DRAIN >> [2017-01-28T06:45:29.961] error: _slurm_rpc_node_registration >> node=hpcc-1: Invalid argument >> >> The slurmd.log on the node itself shows normal job completion message >> just before, and then nothing immediately after the drain: >> >> [2017-01-28T06:42:46.563] _run_prolog: run job script took usec=7 >> [2017-01-28T06:42:46.563] _run_prolog: prolog with lock for job >> 1930101 ran for 0 seconds >> [2017-01-28T06:42:48.427] [1930101.0] done with job >> [2017-01-28T14:37:26.365] [1928122] sending >> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 >> [2017-01-28T14:37:26.367] [1928122] done with job >> >> Any thoughts for figuring/fixing this? The node going to drain also >> happens to be our backup controller if that may be related to things: >> $ grep hpcc-1 slurm.conf >> BackupController=hpcc-1 >> NodeName=hpcc-1 Gres=xld:1,xcd:1 Sockets=2 CoresPerSocket=6 >> ThreadsPerCore=2 State=UNKNOWN RealMemory=48000 TmpDisk=500000 >> PartitionName=headNode Nodes=hpcc-1 Default=NO MaxTime=INFINITE State=UP >> >> This is on our testing/development grid systems so we can easily make >> changes to debug/fix the problem. >> > > -- > Paddy Doyle > Trinity Centre for High Performance Computing, > Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. > Phone: +353-1-896-3725 > http://www.tchpc.tcd.ie/