Ah, success. It was gres related. I verified the slurm.conf's are the
same, but I never verified the gres.conf. It looks like our production
gres.conf had been copied to the backup controller which had the same
gres names, but different hosts associated with them. Fixing that and
restarting slurmd & slurmctld and now scontrol takeover works without
the backup draining. Thanks for the help folks.

On Wed, Feb 1, 2017 at 8:28 AM, E V <eliven...@gmail.com> wrote:
> Yes, head node & backup head sync to the same ntp server. Verifying by
> hand they seem to be within 1 sec of each other. Here's the nodes info
> it finds as it starts up in slurmd.log:
> [2017-01-31T15:31:59.711] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2
> Memory=48388 TmpDisk=508671 Uptime=1147426 CPUSpecList=(null)
> And the definition from slurm.conf for comparison:
> NodeName=hpcc-1 Gres=xvd:1,xcd:1 Sockets=2 CoresPerSocket=6
> ThreadsPerCore=2 State=UNKNOWN RealMemory=48000 TmpDisk=495000
> So that seems fine as far as I can tell. Yes slurm.conf is the same on
> all the systems, I verify with md5sum after every change before doing
> scontrol reconfigure. I haven't tried disabling the gres. I guess I'll
> try that next. Their semaphore capability does appear to be working
> correctly. The slurmd.log does show errors looking for non existant
> .so libs for the gres', so I guess it's feasible the fall back logic
> isn't working correctly, from slurmd.log:
>
> [2017-01-31T15:34:06.342] Trying to load plugin 
> /usr/local/lib/slurm/gres_xvd.so
> [2017-01-31T15:34:06.342] /usr/local/lib/slurm/gres_xvd.so: Does not
> exist or not a regular file.
> [2017-01-31T15:34:06.342] gres: Couldn't find the specified plugin
> name for gres/xvd looking at all files
>
>
> On Wed, Feb 1, 2017 at 7:43 AM, Paddy Doyle <pa...@tchpc.tcd.ie> wrote:
>>
>> Similar to Lachlan's suggestions: check that the slurm.conf is the same on
>> all nodes, and in particular that the number of cpus and cores are correct.
>>
>> Have you tried removing the Gres parameters? Perhaps it's looking for devices
>> it can't find.
>>
>> Paddy
>>
>> On Tue, Jan 31, 2017 at 02:08:51PM -0800, Lachlan Musicman wrote:
>>
>>> trival questions: does node has correct time wrt head node?  and is node
>>> correctly configured in slurm.conf? (# of cpus, amount of memory, etc)
>>>
>>> cheers
>>> L.
>>>
>>> ------
>>> The most dangerous phrase in the language is, "We've always done it this
>>> way."
>>>
>>> - Grace Hopper
>>>
>>> On 1 February 2017 at 08:03, E V <eliven...@gmail.com> wrote:
>>>
>>> >
>>> > enabling debug5 doesn't show anything more useful. I don't see
>>> > anything relevant in slurmd.log just job starts and stops.
>>> > slurmctld.log has the takeover output with backup head node
>>> > immediately draining itself same as before but with more of the
>>> > context before the DRAIN:
>>> >
>>> > [2017-01-31T15:37:38.387] debug:  Spawning registration agent for
>>> > bkr,hpcc-1,r1-[01-07] 9 hosts
>>> > [2017-01-31T15:37:38.387] debug2: Spawning RPC agent for msg_type
>>> > REQUEST_NODE_REGISTRATION_STATUS
>>> > [2017-01-31T15:37:38.387] debug2: got 1 threads to send out
>>> > [2017-01-31T15:37:38.388] debug3: Tree sending to bkr
>>> > [2017-01-31T15:37:38.388] debug2: slurm_connect failed: Connection refused
>>> > [2017-01-31T15:37:38.388] debug2: Error connecting slurm stream socket
>>> > at 172.18.1.102:6820: Connection refused
>>> > [2017-01-31T15:37:38.388] debug3: connect refused, retrying
>>> > [2017-01-31T15:37:38.388] debug2: Tree head got back 0 looking for 9
>>> > [2017-01-31T15:37:38.388] debug3: Tree sending to hpcc-1
>>> > [2017-01-31T15:37:38.389] debug3: Tree sending to r1-01
>>> > [2017-01-31T15:37:38.389] debug3: Tree sending to r1-02
>>> > [2017-01-31T15:37:38.389] debug3: Tree sending to r1-03
>>> > [2017-01-31T15:37:38.389] debug3: Tree sending to r1-04
>>> > [2017-01-31T15:37:38.389] debug3: Tree sending to r1-05
>>> > [2017-01-31T15:37:38.390] debug3: Tree sending to r1-07
>>> > [2017-01-31T15:37:38.390] debug3: Tree sending to r1-06
>>> > [2017-01-31T15:37:38.390] debug4: orig_timeout was 10000 we have 0
>>> > steps and a timeout of 10000
>>> > [2017-01-31T15:37:38.390] debug4: orig_timeout was 10000 we have 0
>>> > steps and a timeout of 10000
>>> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
>>> > steps and a timeout of 10000
>>> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
>>> > steps and a timeout of 10000
>>> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
>>> > steps and a timeout of 10000
>>> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
>>> > steps and a timeout of 10000
>>> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
>>> > steps and a timeout of 10000
>>> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
>>> > steps and a timeout of 10000
>>> > [2017-01-31T15:37:38.392] debug2: Tree head got back 1
>>> > [2017-01-31T15:37:38.392] debug2: Tree head got back 2
>>> > [2017-01-31T15:37:38.392] debug2: Tree head got back 3
>>> > [2017-01-31T15:37:38.392] debug2: Tree head got back 4
>>> > [2017-01-31T15:37:38.393] debug2: Processing RPC:
>>> > MESSAGE_NODE_REGISTRATION_STATUS from uid=0
>>> > [2017-01-31T15:37:38.393] error: Setting node hpcc-1 state to DRAIN
>>> > [2017-01-31T15:37:38.393] drain_nodes: node hpcc-1 state set to DRAIN
>>> > [2017-01-31T15:37:38.393] error: _slurm_rpc_node_registration
>>> > node=hpcc-1: Invalid argument
>>> > [2017-01-31T15:37:38.403] debug2: Processing RPC:
>>> > MESSAGE_NODE_REGISTRATION_STATUS from uid=0
>>> > [2017-01-31T15:37:38.403] debug3: Registered job 1932073.0 on node r1-05
>>> > [2017-01-31T15:37:38.403] debug3: resetting job_count on node r1-05 from 1
>>> > to 2
>>> > [2017-01-31T15:37:38.403] debug2: _slurm_rpc_node_registration
>>> > complete for r1-05 usec=76
>>> > [2017-01-31T15:37:38.404] debug2: Tree head got back 5
>>> > [2017-01-31T15:37:38.405] debug2: Tree head got back 6
>>> >
>>> > On Tue, Jan 31, 2017 at 9:54 AM, E V <eliven...@gmail.com> wrote:
>>> > >
>>> > > No eplilog scripts defined, and access to save state is fine, as an
>>> > > scontrol takeover works, but does have the side affect of the backup
>>> > > draining itself. I set SlurmctlDebug to debug3 and didn't get much
>>> > > more info:
>>> > > [2017-01-31T09:45:22.329] debug2: node_did_resp hpcc-1
>>> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-07
>>> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-03
>>> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-05
>>> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-02
>>> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-04
>>> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-01
>>> > > [2017-01-31T09:45:22.341] debug2: Processing RPC:
>>> > > MESSAGE_NODE_REGISTRATION_STATUS from uid=0
>>> > > [2017-01-31T09:45:22.341] error: Setting node hpcc-1 state to DRAIN
>>> > > [2017-01-31T09:45:22.341] drain_nodes: node hpcc-1 state set to DRAIN
>>> > > [2017-01-31T09:45:22.341] error: _slurm_rpc_node_registration
>>> > > node=hpcc-1: Invalid argument
>>> > >
>>> > > I'll try turning it up to debug5 and also enable SlurmdDebug to see if
>>> > > that shows anything.
>>> > >
>>> > > On Mon, Jan 30, 2017 at 12:42 PM, Paddy Doyle <pa...@tchpc.tcd.ie>
>>> > wrote:
>>> > >>
>>> > >> Hi E V,
>>> > >>
>>> > >> You could turn up the SlurmctldDebug and SlurmdDebug values in
>>> > slurm.conf to get
>>> > >> it to be more verbose.
>>> > >>
>>> > >> Do you have any epilog scripts defined?
>>> > >>
>>> > >> If it's related to the node being the backup controller, as a wild
>>> > guess,
>>> > >> perhaps your backup control doesn't have access to the 
>>> > >> StateSaveLocation
>>> > >> directory?
>>> > >>
>>> > >> Paddy
>>> > >>
>>> > >> On Mon, Jan 30, 2017 at 07:38:39AM -0800, E V wrote:
>>> > >>
>>> > >>>
>>> > >>> Running slurm 15.08.12 on a debian 8 system we have a node that keeps
>>> > >>> being drained and I can't tell why. From slurmctld.log on our ctld
>>> > >>> primary:
>>> > >>>
>>> > >>> [2017-01-28T06:45:29.961] error: Setting node hpcc-1 state to DRAIN
>>> > >>> [2017-01-28T06:45:29.961] drain_nodes: node hpcc-1 state set to DRAIN
>>> > >>> [2017-01-28T06:45:29.961] error: _slurm_rpc_node_registration
>>> > >>> node=hpcc-1: Invalid argument
>>> > >>>
>>> > >>> The slurmd.log on the node itself shows normal job completion message
>>> > >>> just before, and then nothing immediately after the drain:
>>> > >>>
>>> > >>> [2017-01-28T06:42:46.563] _run_prolog: run job script took usec=7
>>> > >>> [2017-01-28T06:42:46.563] _run_prolog: prolog with lock for job
>>> > >>> 1930101 ran for 0 seconds
>>> > >>> [2017-01-28T06:42:48.427] [1930101.0] done with job
>>> > >>> [2017-01-28T14:37:26.365] [1928122] sending
>>> > >>> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
>>> > >>> [2017-01-28T14:37:26.367] [1928122] done with job
>>> > >>>
>>> > >>> Any thoughts for figuring/fixing this? The node going to drain also
>>> > >>> happens to be our backup controller if that may be related to things:
>>> > >>> $ grep hpcc-1 slurm.conf
>>> > >>> BackupController=hpcc-1
>>> > >>> NodeName=hpcc-1 Gres=xld:1,xcd:1 Sockets=2 CoresPerSocket=6
>>> > >>> ThreadsPerCore=2 State=UNKNOWN RealMemory=48000 TmpDisk=500000
>>> > >>> PartitionName=headNode Nodes=hpcc-1 Default=NO MaxTime=INFINITE
>>> > State=UP
>>> > >>>
>>> > >>> This is on our testing/development grid systems so we can easily make
>>> > >>> changes to debug/fix the problem.
>>> > >>>
>>> > >>
>>> > >> --
>>> > >> Paddy Doyle
>>> > >> Trinity Centre for High Performance Computing,
>>> > >> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
>>> > >> Phone: +353-1-896-3725
>>> > >> http://www.tchpc.tcd.ie/
>>> >
>>
>> --
>> Paddy Doyle
>> Trinity Centre for High Performance Computing,
>> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
>> Phone: +353-1-896-3725
>> http://www.tchpc.tcd.ie/

Reply via email to