Re: [ClusterLabs] why is node fenced ?

Jan Pokorný Fri, 17 May 2019 06:09:48 -0700

On 16/05/19 17:10 +0200, Lentes, Bernd wrote:
> my HA-Cluster with two nodes fenced one on 14th of may.
> ha-idg-1 has been the DC, ha-idg-2 was fenced.
> It happened around 11:30 am.
> The log from the fenced one isn't really informative:
> 
> [...]
> 
> Node restarts at 11:44 am.
> The DC is more informative:
> 
> =================================
> 2019-05-14T11:24:05.105739+02:00 ha-idg-1 PackageKit: daemon quit
> 2019-05-14T11:24:05.106284+02:00 ha-idg-1 packagekitd[11617]: 
> (packagekitd:11617): GLib-CRITICAL **: Source ID 15 was not found when 
> attempting to remove it
> 2019-05-14T11:27:23.276813+02:00 ha-idg-1 liblogging-stdlog: -- MARK --
> 2019-05-14T11:30:01.248803+02:00 ha-idg-1 cron[24140]: 
> pam_unix(crond:session): session opened for user root by (uid=0)
> 2019-05-14T11:30:01.253150+02:00 ha-idg-1 systemd[1]: Started Session 17988 
> of user root.
> 2019-05-14T11:30:01.301674+02:00 ha-idg-1 CRON[24140]: 
> pam_unix(crond:session): session closed for user root
> 2019-05-14T11:30:03.710784+02:00 ha-idg-1 kernel: [1015426.947016] tg3 
> 0000:02:00.3 eth3: Link is down
> 2019-05-14T11:30:03.792500+02:00 ha-idg-1 kernel: [1015427.024779] bond1: 
> link status definitely down for interface eth3, disabling it
> 2019-05-14T11:30:04.849892+02:00 ha-idg-1 hp-ams[2559]: CRITICAL: Network 
> Adapter Link Down (Slot 0, Port 4)
> 2019-05-14T11:30:05.261968+02:00 ha-idg-1 kernel: [1015428.498127] tg3 
> 0000:02:00.3 eth3: Link is up at 100 Mbps, full duplex
> 2019-05-14T11:30:05.261985+02:00 ha-idg-1 kernel: [1015428.498138] tg3 
> 0000:02:00.3 eth3: Flow control is on for TX and on for RX
> 2019-05-14T11:30:05.261986+02:00 ha-idg-1 kernel: [1015428.498143] tg3 
> 0000:02:00.3 eth3: EEE is disabled
> 2019-05-14T11:30:05.352500+02:00 ha-idg-1 kernel: [1015428.584725] bond1: 
> link status definitely up for interface eth3, 100 Mbps full duplex
> 2019-05-14T11:30:05.983387+02:00 ha-idg-1 hp-ams[2559]: NOTICE: Network 
> Adapter Link Down (Slot 0, Port 4) has been repaired
> 2019-05-14T11:30:10.520149+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A 
> processor failed, forming new configuration.
> 2019-05-14T11:30:16.524341+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A new 
> membership (192.168.100.10:1120) was formed. Members left: 1084777492
> 2019-05-14T11:30:16.524799+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] Failed 
> to receive the leave message. failed: 1084777492
> 2019-05-14T11:30:16.525199+02:00 ha-idg-1 lvm[12430]: confchg callback. 0 
> joined, 1 left, 1 members
> 2019-05-14T11:30:16.525706+02:00 ha-idg-1 attrd[6967]:   notice: Node 
> ha-idg-2 state is now lost
> 2019-05-14T11:30:16.526143+02:00 ha-idg-1 cib[6964]:   notice: Node ha-idg-2 
> state is now lost
> 2019-05-14T11:30:16.526480+02:00 ha-idg-1 attrd[6967]:   notice: Removing all 
> ha-idg-2 attributes for peer loss
> 2019-05-14T11:30:16.526742+02:00 ha-idg-1 cib[6964]:   notice: Purged 1 peer 
> with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.527283+02:00 ha-idg-1 stonith-ng[6965]:   notice: Node 
> ha-idg-2 state is now lost
> 2019-05-14T11:30:16.527618+02:00 ha-idg-1 attrd[6967]:   notice: Purged 1 
> peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.527884+02:00 ha-idg-1 stonith-ng[6965]:   notice: Purged 
> 1 peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.528156+02:00 ha-idg-1 corosync[6957]:   [QUORUM] 
> Members[1]: 1084777482
> 2019-05-14T11:30:16.528435+02:00 ha-idg-1 corosync[6957]:   [MAIN  ] 
> Completed service synchronization, ready to provide service.
> 2019-05-14T11:30:16.548481+02:00 ha-idg-1 kernel: [1015439.782587] dlm: 
> closing connection to node 1084777492
> 2019-05-14T11:30:16.555995+02:00 ha-idg-1 dlm_controld[12279]: 1015492 fence 
> request 1084777492 pid 24568 nodedown time 1557826216 fence_all dlm_stonith
> 2019-05-14T11:30:16.626285+02:00 ha-idg-1 crmd[6969]:  warning: 
> Stonith/shutdown of node ha-idg-2 was not expected
> 2019-05-14T11:30:16.626534+02:00 ha-idg-1 dlm_stonith: stonith_api_time: 
> Found 1 entries for 1084777492/(null): 0 in progress, 1 completed
> 2019-05-14T11:30:16.626731+02:00 ha-idg-1 dlm_stonith: stonith_api_time: Node 
> 1084777492/(null) last kicked at: 1556884018
> 2019-05-14T11:30:16.626875+02:00 ha-idg-1 stonith-ng[6965]:   notice: Client 
> stonith-api.24568.6a9fa406 wants to fence (reboot) '1084777492' with device 
> '(any)'
> 2019-05-14T11:30:16.627026+02:00 ha-idg-1 crmd[6969]:   notice: State 
> transition S_IDLE -> S_POLICY_ENGINE
> 2019-05-14T11:30:16.627165+02:00 ha-idg-1 crmd[6969]:   notice: Node ha-idg-2 
> state is now lost
> 2019-05-14T11:30:16.627302+02:00 ha-idg-1 crmd[6969]:  warning: 
> Stonith/shutdown of node ha-idg-2 was not expected
> 2019-05-14T11:30:16.627439+02:00 ha-idg-1 stonith-ng[6965]:   notice: 
> Requesting peer fencing (reboot) of ha-idg-2
> 2019-05-14T11:30:16.627578+02:00 ha-idg-1 pacemakerd[6963]:   notice: Node 
> ha-idg-2 state is now lost
> ==================================
> 
> One network interface is gone for a short period. But it's in a
> bonding device (round-robin), so the connection shouldn't be lost.


Well, not all bonding modes are equal, and without any further
knowledge, my guess is that round-robin provides no redundancy
message-wise, so the token got irrecoverably lost anyway
(remember, corosync messaging has little in common with TCP
you may have in mind when reasoning about this) timeout kicked
in and the configured safety measure (fencing) ensued.

Looks good to me, especially since dlm was involved...

> Both nodes are connected directly, there is no switch in between.
> I manually (ifconfig eth3 down) stopped the interface afterwards
> several times ... nothing happened.

Forget about if-downing any interfaces, please, at least with 
corosync deployments without kronosnet.  Not sure if such
a suggestion shall be exposed more prominently, might be
a good idea for corosync folks to consider.  But this list
is full of these warnings already.

> The same with the second Interface (eth2).
> ???

-- 
Jan (Poki)

pgpGB4XwABzVL.pgp
Description: PGP signature

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] why is node fenced ?

Reply via email to