On Mon, Jul 25, 2016 at 2:48 PM, Nate Clark <n...@neworld.us> wrote: > On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot <kgail...@redhat.com> wrote: >> On 07/23/2016 10:14 PM, Nate Clark wrote: >>> On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov <arvidj...@gmail.com> >>> wrote: >>>> 23.07.2016 01:37, Nate Clark пишет: >>>>> Hello, >>>>> >>>>> I am running pacemaker 1.1.13 with corosync and think I may have >>>>> encountered a start up timing issue on a two node cluster. I didn't >>>>> notice anything in the changelog for 14 or 15 that looked similar to >>>>> this or open bugs. >>>>> >>>>> The rough out line of what happened: >>>>> >>>>> Module 1 and 2 running >>>>> Module 1 is DC >>>>> Module 2 shuts down >>>>> Module 1 updates node attributes used by resources >>>>> Module 1 shuts down >>>>> Module 2 starts up >>>>> Module 2 votes itself as DC >>>>> Module 1 starts up >>>>> Module 2 sees module 1 in corosync and notices it has quorum >>>>> Module 2 enters policy engine state. >>>>> Module 2 policy engine decides to fence 1 >>>>> Module 2 then continues and starts resource on itself based upon the old >>>>> state >>>>> >>>>> For some reason the integration never occurred and module 2 starts to >>>>> perform actions based on stale state. >>>>> >>>>> Here is the full logs >>>>> Jul 20 16:29:06.376805 module-2 crmd[21969]: notice: Connecting to >>>>> cluster infrastructure: corosync >>>>> Jul 20 16:29:06.386853 module-2 crmd[21969]: notice: Could not >>>>> obtain a node name for corosync nodeid 2 >>>>> Jul 20 16:29:06.392795 module-2 crmd[21969]: notice: Defaulting to >>>>> uname -n for the local corosync node name >>>>> Jul 20 16:29:06.403611 module-2 crmd[21969]: notice: Quorum lost >>>>> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]: notice: Watching >>>>> for stonith topology changes >>>>> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]: notice: Added >>>>> 'watchdog' to the device list (1 active devices) >>>>> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]: notice: Relying >>>>> on watchdog integration for fencing >>>>> Jul 20 16:29:06.416905 module-2 cib[21964]: notice: Defaulting to >>>>> uname -n for the local corosync node name >>>>> Jul 20 16:29:06.417044 module-2 crmd[21969]: notice: >>>>> pcmk_quorum_notification: Node module-2[2] - state is now member (was >>>>> (null)) >>>>> Jul 20 16:29:06.421821 module-2 crmd[21969]: notice: Defaulting to >>>>> uname -n for the local corosync node name >>>>> Jul 20 16:29:06.422121 module-2 crmd[21969]: notice: Notifications >>>>> disabled >>>>> Jul 20 16:29:06.422149 module-2 crmd[21969]: notice: Watchdog >>>>> enabled but stonith-watchdog-timeout is disabled >>>>> Jul 20 16:29:06.422286 module-2 crmd[21969]: notice: The local CRM >>>>> is operational >>>>> Jul 20 16:29:06.422312 module-2 crmd[21969]: notice: State >>>>> transition S_STARTING -> S_PENDING [ input=I_PENDING >>>>> cause=C_FSA_INTERNAL origin=do_started ] >>>>> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]: notice: Added >>>>> 'fence_sbd' to the device list (2 active devices) >>>>> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]: notice: Added >>>>> 'ipmi-1' to the device list (3 active devices) >>>>> Jul 20 16:29:27.423578 module-2 crmd[21969]: warning: FSA: Input >>>>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING >>>>> Jul 20 16:29:27.424298 module-2 crmd[21969]: notice: State >>>>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC >>>>> cause=C_TIMER_POPPED origin=election_timeout_popped ] >>>>> Jul 20 16:29:27.460834 module-2 crmd[21969]: warning: FSA: Input >>>>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION >>>>> Jul 20 16:29:27.463794 module-2 crmd[21969]: notice: Notifications >>>>> disabled >>>>> Jul 20 16:29:27.463824 module-2 crmd[21969]: notice: Watchdog >>>>> enabled but stonith-watchdog-timeout is disabled >>>>> Jul 20 16:29:27.473285 module-2 attrd[21967]: notice: Defaulting to >>>>> uname -n for the local corosync node name >>>>> Jul 20 16:29:27.498464 module-2 pengine[21968]: notice: Relying on >>>>> watchdog integration for fencing >>>>> Jul 20 16:29:27.498536 module-2 pengine[21968]: notice: We do not >>>>> have quorum - fencing and resource management disabled >>>>> Jul 20 16:29:27.502272 module-2 pengine[21968]: warning: Node >>>>> module-1 is unclean! >>>>> Jul 20 16:29:27.502287 module-2 pengine[21968]: notice: Cannot fence >>>>> unclean nodes until quorum is attained (or no-quorum-policy is set to >>>>> ignore) >> >> The above two messages indicate that module-2 cannot see module-1 at >> startup, therefore it must assume it is potentially misbehaving and must >> be shot. However, since it does not have quorum with only one out of two >> nodes, it must wait until module-1 joins until it can shoot it! >> >> This is a special problem with quorum in a two-node cluster. There are a >> variety of ways to deal with it, but the simplest is to set "two_node: >> 1" in corosync.conf (with corosync 2 or later). This will make each node >> wait for the other at startup, meaning both nodes must be started before >> the cluster can start working, but from that point on, it will assume it >> has quorum, and use fencing to ensure any lost node is really down. > > two_node is set to 1 for this system. I understand what you are saying > but what usually happens is S_INTEGRATION occurs after quorum as > achieved and the current DC acknowledges the other node which just > started and then accepts into the cluster. However it looks like > S_POLICY_ENGINE occurred first.
I saw a similar situation occur on another two node system. Based on Ken's previous comment it sounds like this is unexpected behavior for when two_node is enabled, or did I misinterpret his comment? Thanks -nate _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org