On 07/28/2016 01:48 PM, Nate Clark wrote: > On Mon, Jul 25, 2016 at 2:48 PM, Nate Clark <n...@neworld.us> wrote: >> On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot <kgail...@redhat.com> wrote: >>> On 07/23/2016 10:14 PM, Nate Clark wrote: >>>> On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov <arvidj...@gmail.com> >>>> wrote: >>>>> 23.07.2016 01:37, Nate Clark пишет: >>>>>> Hello, >>>>>> >>>>>> I am running pacemaker 1.1.13 with corosync and think I may have >>>>>> encountered a start up timing issue on a two node cluster. I didn't >>>>>> notice anything in the changelog for 14 or 15 that looked similar to >>>>>> this or open bugs. >>>>>> >>>>>> The rough out line of what happened: >>>>>> >>>>>> Module 1 and 2 running >>>>>> Module 1 is DC >>>>>> Module 2 shuts down >>>>>> Module 1 updates node attributes used by resources >>>>>> Module 1 shuts down >>>>>> Module 2 starts up >>>>>> Module 2 votes itself as DC >>>>>> Module 1 starts up >>>>>> Module 2 sees module 1 in corosync and notices it has quorum >>>>>> Module 2 enters policy engine state. >>>>>> Module 2 policy engine decides to fence 1 >>>>>> Module 2 then continues and starts resource on itself based upon the old >>>>>> state >>>>>> >>>>>> For some reason the integration never occurred and module 2 starts to >>>>>> perform actions based on stale state. >>>>>> >>>>>> Here is the full logs >>>>>> Jul 20 16:29:06.376805 module-2 crmd[21969]: notice: Connecting to >>>>>> cluster infrastructure: corosync >>>>>> Jul 20 16:29:06.386853 module-2 crmd[21969]: notice: Could not >>>>>> obtain a node name for corosync nodeid 2 >>>>>> Jul 20 16:29:06.392795 module-2 crmd[21969]: notice: Defaulting to >>>>>> uname -n for the local corosync node name >>>>>> Jul 20 16:29:06.403611 module-2 crmd[21969]: notice: Quorum lost >>>>>> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]: notice: Watching >>>>>> for stonith topology changes >>>>>> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]: notice: Added >>>>>> 'watchdog' to the device list (1 active devices) >>>>>> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]: notice: Relying >>>>>> on watchdog integration for fencing >>>>>> Jul 20 16:29:06.416905 module-2 cib[21964]: notice: Defaulting to >>>>>> uname -n for the local corosync node name >>>>>> Jul 20 16:29:06.417044 module-2 crmd[21969]: notice: >>>>>> pcmk_quorum_notification: Node module-2[2] - state is now member (was >>>>>> (null)) >>>>>> Jul 20 16:29:06.421821 module-2 crmd[21969]: notice: Defaulting to >>>>>> uname -n for the local corosync node name >>>>>> Jul 20 16:29:06.422121 module-2 crmd[21969]: notice: Notifications >>>>>> disabled >>>>>> Jul 20 16:29:06.422149 module-2 crmd[21969]: notice: Watchdog >>>>>> enabled but stonith-watchdog-timeout is disabled >>>>>> Jul 20 16:29:06.422286 module-2 crmd[21969]: notice: The local CRM >>>>>> is operational >>>>>> Jul 20 16:29:06.422312 module-2 crmd[21969]: notice: State >>>>>> transition S_STARTING -> S_PENDING [ input=I_PENDING >>>>>> cause=C_FSA_INTERNAL origin=do_started ] >>>>>> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]: notice: Added >>>>>> 'fence_sbd' to the device list (2 active devices) >>>>>> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]: notice: Added >>>>>> 'ipmi-1' to the device list (3 active devices) >>>>>> Jul 20 16:29:27.423578 module-2 crmd[21969]: warning: FSA: Input >>>>>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING >>>>>> Jul 20 16:29:27.424298 module-2 crmd[21969]: notice: State >>>>>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC >>>>>> cause=C_TIMER_POPPED origin=election_timeout_popped ] >>>>>> Jul 20 16:29:27.460834 module-2 crmd[21969]: warning: FSA: Input >>>>>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION >>>>>> Jul 20 16:29:27.463794 module-2 crmd[21969]: notice: Notifications >>>>>> disabled >>>>>> Jul 20 16:29:27.463824 module-2 crmd[21969]: notice: Watchdog >>>>>> enabled but stonith-watchdog-timeout is disabled >>>>>> Jul 20 16:29:27.473285 module-2 attrd[21967]: notice: Defaulting to >>>>>> uname -n for the local corosync node name >>>>>> Jul 20 16:29:27.498464 module-2 pengine[21968]: notice: Relying on >>>>>> watchdog integration for fencing >>>>>> Jul 20 16:29:27.498536 module-2 pengine[21968]: notice: We do not >>>>>> have quorum - fencing and resource management disabled >>>>>> Jul 20 16:29:27.502272 module-2 pengine[21968]: warning: Node >>>>>> module-1 is unclean! >>>>>> Jul 20 16:29:27.502287 module-2 pengine[21968]: notice: Cannot fence >>>>>> unclean nodes until quorum is attained (or no-quorum-policy is set to >>>>>> ignore) >>> >>> The above two messages indicate that module-2 cannot see module-1 at >>> startup, therefore it must assume it is potentially misbehaving and must >>> be shot. However, since it does not have quorum with only one out of two >>> nodes, it must wait until module-1 joins until it can shoot it! >>> >>> This is a special problem with quorum in a two-node cluster. There are a >>> variety of ways to deal with it, but the simplest is to set "two_node: >>> 1" in corosync.conf (with corosync 2 or later). This will make each node >>> wait for the other at startup, meaning both nodes must be started before >>> the cluster can start working, but from that point on, it will assume it >>> has quorum, and use fencing to ensure any lost node is really down. >> >> two_node is set to 1 for this system. I understand what you are saying >> but what usually happens is S_INTEGRATION occurs after quorum as >> achieved and the current DC acknowledges the other node which just >> started and then accepts into the cluster. However it looks like >> S_POLICY_ENGINE occurred first. > > I saw a similar situation occur on another two node system. Based on > Ken's previous comment it sounds like this is unexpected behavior for > when two_node is enabled, or did I misinterpret his comment? > > Thanks > -nate
I didn't think it through properly ... two_node will only affect quorum, so the above sequence makes sense once the cluster decides fencing is necessary. I'm not sure why it sometimes goes into S_INTEGRATION and sometimes S_POLICY_ENGINE. In the above logs, it goes into S_INTEGRATION because the DC election timed out. How are the logs in the successful case different? Maybe the other node happens to join before the DC election times out? _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org