No eplilog scripts defined, and access to save state is fine, as an
scontrol takeover works, but does have the side affect of the backup
draining itself. I set SlurmctlDebug to debug3 and didn't get much
more info:
[2017-01-31T09:45:22.329] debug2: node_did_resp hpcc-1
[2017-01-31T09:45:22.329] debug2: node_did_resp r1-07
[2017-01-31T09:45:22.329] debug2: node_did_resp r1-03
[2017-01-31T09:45:22.329] debug2: node_did_resp r1-05
[2017-01-31T09:45:22.329] debug2: node_did_resp r1-02
[2017-01-31T09:45:22.329] debug2: node_did_resp r1-04
[2017-01-31T09:45:22.329] debug2: node_did_resp r1-01
[2017-01-31T09:45:22.341] debug2: Processing RPC:
MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2017-01-31T09:45:22.341] error: Setting node hpcc-1 state to DRAIN
[2017-01-31T09:45:22.341] drain_nodes: node hpcc-1 state set to DRAIN
[2017-01-31T09:45:22.341] error: _slurm_rpc_node_registration
node=hpcc-1: Invalid argument

I'll try turning it up to debug5 and also enable SlurmdDebug to see if
that shows anything.

On Mon, Jan 30, 2017 at 12:42 PM, Paddy Doyle <pa...@tchpc.tcd.ie> wrote:
>
> Hi E V,
>
> You could turn up the SlurmctldDebug and SlurmdDebug values in slurm.conf to 
> get
> it to be more verbose.
>
> Do you have any epilog scripts defined?
>
> If it's related to the node being the backup controller, as a wild guess,
> perhaps your backup control doesn't have access to the StateSaveLocation
> directory?
>
> Paddy
>
> On Mon, Jan 30, 2017 at 07:38:39AM -0800, E V wrote:
>
>>
>> Running slurm 15.08.12 on a debian 8 system we have a node that keeps
>> being drained and I can't tell why. From slurmctld.log on our ctld
>> primary:
>>
>> [2017-01-28T06:45:29.961] error: Setting node hpcc-1 state to DRAIN
>> [2017-01-28T06:45:29.961] drain_nodes: node hpcc-1 state set to DRAIN
>> [2017-01-28T06:45:29.961] error: _slurm_rpc_node_registration
>> node=hpcc-1: Invalid argument
>>
>> The slurmd.log on the node itself shows normal job completion message
>> just before, and then nothing immediately after the drain:
>>
>> [2017-01-28T06:42:46.563] _run_prolog: run job script took usec=7
>> [2017-01-28T06:42:46.563] _run_prolog: prolog with lock for job
>> 1930101 ran for 0 seconds
>> [2017-01-28T06:42:48.427] [1930101.0] done with job
>> [2017-01-28T14:37:26.365] [1928122] sending
>> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
>> [2017-01-28T14:37:26.367] [1928122] done with job
>>
>> Any thoughts for figuring/fixing this? The node going to drain also
>> happens to be our backup controller if that may be related to things:
>> $ grep hpcc-1 slurm.conf
>> BackupController=hpcc-1
>> NodeName=hpcc-1 Gres=xld:1,xcd:1 Sockets=2 CoresPerSocket=6
>> ThreadsPerCore=2 State=UNKNOWN RealMemory=48000 TmpDisk=500000
>> PartitionName=headNode Nodes=hpcc-1 Default=NO MaxTime=INFINITE State=UP
>>
>> This is on our testing/development grid systems so we can easily make
>> changes to debug/fix the problem.
>>
>
> --
> Paddy Doyle
> Trinity Centre for High Performance Computing,
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> Phone: +353-1-896-3725
> http://www.tchpc.tcd.ie/

Reply via email to