On Fri, Jul 29, 2016 at 4:09 PM, Ken Gaillot <kgail...@redhat.com> wrote: > On 07/28/2016 01:48 PM, Nate Clark wrote: >> On Mon, Jul 25, 2016 at 2:48 PM, Nate Clark <n...@neworld.us> wrote: >>> On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot <kgail...@redhat.com> wrote: >>>> On 07/23/2016 10:14 PM, Nate Clark wrote: >>>>> On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov <arvidj...@gmail.com> >>>>> wrote: >>>>>> 23.07.2016 01:37, Nate Clark пишет: >>>>>>> Hello, >>>>>>> >>>>>>> I am running pacemaker 1.1.13 with corosync and think I may have >>>>>>> encountered a start up timing issue on a two node cluster. I didn't >>>>>>> notice anything in the changelog for 14 or 15 that looked similar to >>>>>>> this or open bugs. >>>>>>> >>>>>>> The rough out line of what happened: >>>>>>> >>>>>>> Module 1 and 2 running >>>>>>> Module 1 is DC >>>>>>> Module 2 shuts down >>>>>>> Module 1 updates node attributes used by resources >>>>>>> Module 1 shuts down >>>>>>> Module 2 starts up >>>>>>> Module 2 votes itself as DC >>>>>>> Module 1 starts up >>>>>>> Module 2 sees module 1 in corosync and notices it has quorum >>>>>>> Module 2 enters policy engine state. >>>>>>> Module 2 policy engine decides to fence 1 >>>>>>> Module 2 then continues and starts resource on itself based upon the >>>>>>> old state >>>>>>> >>>>>>> For some reason the integration never occurred and module 2 starts to >>>>>>> perform actions based on stale state. >>>>>>> >>>>>>> Here is the full logs >>>>>>> Jul 20 16:29:06.376805 module-2 crmd[21969]: notice: Connecting to >>>>>>> cluster infrastructure: corosync >>>>>>> Jul 20 16:29:06.386853 module-2 crmd[21969]: notice: Could not >>>>>>> obtain a node name for corosync nodeid 2 >>>>>>> Jul 20 16:29:06.392795 module-2 crmd[21969]: notice: Defaulting to >>>>>>> uname -n for the local corosync node name >>>>>>> Jul 20 16:29:06.403611 module-2 crmd[21969]: notice: Quorum lost >>>>>>> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]: notice: Watching >>>>>>> for stonith topology changes >>>>>>> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]: notice: Added >>>>>>> 'watchdog' to the device list (1 active devices) >>>>>>> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]: notice: Relying >>>>>>> on watchdog integration for fencing >>>>>>> Jul 20 16:29:06.416905 module-2 cib[21964]: notice: Defaulting to >>>>>>> uname -n for the local corosync node name >>>>>>> Jul 20 16:29:06.417044 module-2 crmd[21969]: notice: >>>>>>> pcmk_quorum_notification: Node module-2[2] - state is now member (was >>>>>>> (null)) >>>>>>> Jul 20 16:29:06.421821 module-2 crmd[21969]: notice: Defaulting to >>>>>>> uname -n for the local corosync node name >>>>>>> Jul 20 16:29:06.422121 module-2 crmd[21969]: notice: Notifications >>>>>>> disabled >>>>>>> Jul 20 16:29:06.422149 module-2 crmd[21969]: notice: Watchdog >>>>>>> enabled but stonith-watchdog-timeout is disabled >>>>>>> Jul 20 16:29:06.422286 module-2 crmd[21969]: notice: The local CRM >>>>>>> is operational >>>>>>> Jul 20 16:29:06.422312 module-2 crmd[21969]: notice: State >>>>>>> transition S_STARTING -> S_PENDING [ input=I_PENDING >>>>>>> cause=C_FSA_INTERNAL origin=do_started ] >>>>>>> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]: notice: Added >>>>>>> 'fence_sbd' to the device list (2 active devices) >>>>>>> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]: notice: Added >>>>>>> 'ipmi-1' to the device list (3 active devices) >>>>>>> Jul 20 16:29:27.423578 module-2 crmd[21969]: warning: FSA: Input >>>>>>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING >>>>>>> Jul 20 16:29:27.424298 module-2 crmd[21969]: notice: State >>>>>>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC >>>>>>> cause=C_TIMER_POPPED origin=election_timeout_popped ] >>>>>>> Jul 20 16:29:27.460834 module-2 crmd[21969]: warning: FSA: Input >>>>>>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION >>>>>>> Jul 20 16:29:27.463794 module-2 crmd[21969]: notice: Notifications >>>>>>> disabled >>>>>>> Jul 20 16:29:27.463824 module-2 crmd[21969]: notice: Watchdog >>>>>>> enabled but stonith-watchdog-timeout is disabled >>>>>>> Jul 20 16:29:27.473285 module-2 attrd[21967]: notice: Defaulting to >>>>>>> uname -n for the local corosync node name >>>>>>> Jul 20 16:29:27.498464 module-2 pengine[21968]: notice: Relying on >>>>>>> watchdog integration for fencing >>>>>>> Jul 20 16:29:27.498536 module-2 pengine[21968]: notice: We do not >>>>>>> have quorum - fencing and resource management disabled >>>>>>> Jul 20 16:29:27.502272 module-2 pengine[21968]: warning: Node >>>>>>> module-1 is unclean! >>>>>>> Jul 20 16:29:27.502287 module-2 pengine[21968]: notice: Cannot fence >>>>>>> unclean nodes until quorum is attained (or no-quorum-policy is set to >>>>>>> ignore) >>>> >>>> The above two messages indicate that module-2 cannot see module-1 at >>>> startup, therefore it must assume it is potentially misbehaving and must >>>> be shot. However, since it does not have quorum with only one out of two >>>> nodes, it must wait until module-1 joins until it can shoot it! >>>> >>>> This is a special problem with quorum in a two-node cluster. There are a >>>> variety of ways to deal with it, but the simplest is to set "two_node: >>>> 1" in corosync.conf (with corosync 2 or later). This will make each node >>>> wait for the other at startup, meaning both nodes must be started before >>>> the cluster can start working, but from that point on, it will assume it >>>> has quorum, and use fencing to ensure any lost node is really down. >>> >>> two_node is set to 1 for this system. I understand what you are saying >>> but what usually happens is S_INTEGRATION occurs after quorum as >>> achieved and the current DC acknowledges the other node which just >>> started and then accepts into the cluster. However it looks like >>> S_POLICY_ENGINE occurred first. >> >> I saw a similar situation occur on another two node system. Based on >> Ken's previous comment it sounds like this is unexpected behavior for >> when two_node is enabled, or did I misinterpret his comment? >> >> Thanks >> -nate > > I didn't think it through properly ... two_node will only affect quorum, > so the above sequence makes sense once the cluster decides fencing is > necessary. > > I'm not sure why it sometimes goes into S_INTEGRATION and sometimes > S_POLICY_ENGINE. In the above logs, it goes into S_INTEGRATION because > the DC election timed out. How are the logs in the successful case > different? Maybe the other node happens to join before the DC election > times out?
In both cases the DC election has occurred prior to the second node coming online. Here are more logs from the more successful startup. DC election occurs and the first node elects and integrates itself and then the second node becomes available and is integrated into the cluster. Jul 22 18:42:10 d25-22-left crmd[17714]: notice: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] Jul 22 18:42:11 d25-22-left stonith-ng[17710]: notice: Added 'fence_sbd' to the device list (2 active devices) Jul 22 18:42:12 d25-22-left stonith-ng[17710]: notice: Added 'ipmi-73' to the device list (3 active devices) Jul 22 18:42:31 d25-22-left crmd[17714]: warning: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING Jul 22 18:42:31 d25-22-left crmd[17714]: notice: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ] Jul 22 18:42:31 d25-22-left crmd[17714]: warning: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION Jul 22 18:42:31 d25-22-left crmd[17714]: notice: Notifications disabled Jul 22 18:42:31 d25-22-left crmd[17714]: notice: Watchdog enabled but stonith-watchdog-timeout is disabled Jul 22 18:42:31 d25-22-left attrd[17712]: notice: Defaulting to uname -n for the local corosync node name Jul 22 18:42:32 d25-22-left pengine[17713]: notice: Relying on watchdog integration for fencing Jul 22 18:42:32 d25-22-left pengine[17713]: notice: We do not have quorum - fencing and resource management disabled Jul 22 18:42:32 d25-22-left pengine[17713]: warning: Node d25-22-right is unclean! Jul 22 18:42:32 d25-22-left pengine[17713]: notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore) Jul 22 18:42:32 d25-22-left pengine[17713]: notice: Start fence_sbd (d25-22-left - blocked) Jul 22 18:42:32 d25-22-left pengine[17713]: notice: Start ipmi-73 (d25-22-left - blocked) Jul 22 18:42:32 d25-22-left pengine[17713]: notice: Start SlaveIP (d25-22-left - blocked) Jul 22 18:42:32 d25-22-left pengine[17713]: notice: Start postgres:0 (d25-22-left - blocked) Jul 22 18:42:32 d25-22-left pengine[17713]: notice: Start ethmonitor:0 (d25-22-left - blocked) Jul 22 18:42:32 d25-22-left pengine[17713]: notice: Start tomcat-instance:0 (d25-22-left - blocked) Jul 22 18:42:32 d25-22-left pengine[17713]: notice: Start ClusterMonitor:0 (d25-22-left - blocked) Jul 22 18:42:32 d25-22-left pengine[17713]: warning: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-warn-0.bz2 Jul 22 18:42:32 d25-22-left crmd[17714]: notice: Initiating action 4: monitor fence_sbd_monitor_0 on d25-22-left (local) Jul 22 18:42:32 d25-22-left crmd[17714]: notice: Initiating action 5: monitor ipmi-72_monitor_0 on d25-22-left (local) Jul 22 18:42:32 d25-22-left crmd[17714]: notice: Initiating action 6: monitor ipmi-73_monitor_0 on d25-22-left (local) Jul 22 18:42:32 d25-22-left crmd[17714]: notice: Initiating action 7: monitor MasterIP_monitor_0 on d25-22-left (local) Jul 22 18:42:32 d25-22-left crmd[17714]: notice: Initiating action 8: monitor SlaveIP_monitor_0 on d25-22-left (local) Jul 22 18:42:32 d25-22-left crmd[17714]: notice: Initiating action 9: monitor postgres:0_monitor_0 on d25-22-left (local) Jul 22 18:42:32 d25-22-left crmd[17714]: notice: Initiating action 10: monitor ethmonitor:0_monitor_0 on d25-22-left (local) Jul 22 18:42:32 d25-22-left crmd[17714]: notice: Initiating action 11: monitor tomcat-instance:0_monitor_0 on d25-22-left (local) Jul 22 18:42:32 d25-22-left crmd[17714]: notice: Initiating action 12: monitor ClusterMonitor:0_monitor_0 on d25-22-left (local) Jul 22 18:42:32 d25-22-left crmd[17714]: notice: Initiating action 13: monitor N1F1_monitor_0 on d25-22-left (local) Jul 22 18:42:32 d25-22-left crmd[17714]: notice: Initiating action 14: monitor N2F1_monitor_0 on d25-22-left (local) Jul 22 18:42:33 d25-22-left crmd[17714]: notice: Operation fence_sbd_monitor_0: not running (node=d25-22-left, call=5, rc=7, cib-update=29, confirmed=true) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Operation ipmi-72_monitor_0: not running (node=d25-22-left, call=9, rc=7, cib-update=30, confirmed=true) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Operation ipmi-73_monitor_0: not running (node=d25-22-left, call=13, rc=7, cib-update=31, confirmed=true) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Transition aborted by status-1-master-postgres, master-postgres=-INFINITY: Transient attribute change (create cib=1.94.6, source=abort_unless_down:319, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'], 0) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Operation ClusterMonitor_monitor_0: not running (node=d25-22-left, call=41, rc=7, cib-update=32, confirmed=true) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Operation ethmonitor_monitor_0: not running (node=d25-22-left, call=31, rc=7, cib-update=33, confirmed=true) Jul 22 18:42:34 d25-22-left crmd[17714]: error: pcmkRegisterNode: Triggered assert at xml.c:594 : node->type == XML_ELEMENT_NODE Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Operation tomcat-instance_monitor_0: not running (node=d25-22-left, call=36, rc=7, cib-update=34, confirmed=true) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Operation MasterIP_monitor_0: not running (node=d25-22-left, call=17, rc=7, cib-update=35, confirmed=true) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Operation SlaveIP_monitor_0: not running (node=d25-22-left, call=21, rc=7, cib-update=36, confirmed=true) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Operation N2F1_monitor_0: not running (node=d25-22-left, call=49, rc=7, cib-update=37, confirmed=true) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Operation N1F1_monitor_0: not running (node=d25-22-left, call=45, rc=7, cib-update=38, confirmed=true) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Operation postgres_monitor_0: not running (node=d25-22-left, call=26, rc=7, cib-update=39, confirmed=true) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Initiating action 3: probe_complete probe_complete-d25-22-left on d25-22-left (local) - no waiting Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Transition 0 (Complete=13, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Complete Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Relying on watchdog integration for fencing Jul 22 18:42:34 d25-22-left pengine[17713]: notice: We do not have quorum - fencing and resource management disabled Jul 22 18:42:34 d25-22-left pengine[17713]: warning: Node d25-22-right is unclean! Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start fence_sbd (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start ipmi-73 (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start SlaveIP (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start postgres:0 (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start ethmonitor:0 (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start tomcat-instance:0 (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start ClusterMonitor:0 (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: warning: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-warn-1.bz2 Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Relying on watchdog integration for fencing Jul 22 18:42:34 d25-22-left pengine[17713]: notice: We do not have quorum - fencing and resource management disabled Jul 22 18:42:34 d25-22-left pengine[17713]: warning: Node d25-22-right is unclean! Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start fence_sbd (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start ipmi-73 (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start SlaveIP (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start postgres:0 (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start ethmonitor:0 (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start tomcat-instance:0 (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left pengine[17713]: notice: Start ClusterMonitor:0 (d25-22-left - blocked) Jul 22 18:42:34 d25-22-left crmd[17714]: notice: Transition 2 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Complete Jul 22 18:42:34 d25-22-left crmd[17714]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jul 22 18:42:34 d25-22-left pengine[17713]: warning: Calculated Transition 2: /var/lib/pacemaker/pengine/pe-warn-2.bz2 Jul 22 18:42:40 d25-22-left crmd[17714]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jul 22 18:42:41 d25-22-left crmd[17714]: notice: Notifications disabled Jul 22 18:42:41 d25-22-left crmd[17714]: notice: Watchdog enabled but stonith-watchdog-timeout is disabled Jul 22 18:42:41 d25-22-left pengine[17713]: notice: Relying on watchdog integration for fencing Jul 22 18:42:41 d25-22-left pengine[17713]: notice: We do not have quorum - fencing and resource management disabled Jul 22 18:42:41 d25-22-left pengine[17713]: warning: Node d25-22-right is unclean! Jul 22 18:42:41 d25-22-left pengine[17713]: notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore) Jul 22 18:42:41 d25-22-left pengine[17713]: notice: Start fence_sbd (d25-22-left - blocked) Jul 22 18:42:41 d25-22-left pengine[17713]: notice: Start ipmi-73 (d25-22-left - blocked) Jul 22 18:42:41 d25-22-left pengine[17713]: notice: Start SlaveIP (d25-22-left - blocked) Jul 22 18:42:41 d25-22-left pengine[17713]: notice: Start postgres:0 (d25-22-left - blocked) Jul 22 18:42:41 d25-22-left pengine[17713]: notice: Start ethmonitor:0 (d25-22-left - blocked) Jul 22 18:42:41 d25-22-left pengine[17713]: notice: Start tomcat-instance:0 (d25-22-left - blocked) Jul 22 18:42:41 d25-22-left pengine[17713]: notice: Start ClusterMonitor:0 (d25-22-left - blocked) Jul 22 18:42:41 d25-22-left pengine[17713]: warning: Calculated Transition 3: /var/lib/pacemaker/pengine/pe-warn-3.bz2 Jul 22 18:42:41 d25-22-left crmd[17714]: notice: Transition 3 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-3.bz2): Complete Jul 22 18:42:41 d25-22-left crmd[17714]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jul 22 18:42:48 d25-22-left corosync[17646]: [TOTEM ] A new membership (172.20.71.72:84) was formed. Members joined: 2 Jul 22 18:42:48 d25-22-left corosync[17646]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Jul 22 18:42:48 d25-22-left corosync[17646]: [QUORUM] This node is within the primary component and will provide service. Jul 22 18:42:48 d25-22-left corosync[17646]: [QUORUM] Members[2]: 1 2 Jul 22 18:42:48 d25-22-left corosync[17646]: [MAIN ] Completed service synchronization, ready to provide service. Jul 22 18:42:48 d25-22-left crmd[17714]: notice: Membership 84: quorum acquired (2) Jul 22 18:42:48 d25-22-left pacemakerd[17687]: notice: Membership 84: quorum acquired (2) Jul 22 18:42:48 d25-22-left crmd[17714]: notice: Could not obtain a node name for corosync nodeid 2 Jul 22 18:42:48 d25-22-left pacemakerd[17687]: notice: Could not obtain a node name for corosync nodeid 2 Jul 22 18:42:48 d25-22-left crmd[17714]: notice: Could not obtain a node name for corosync nodeid 2 Jul 22 18:42:48 d25-22-left crmd[17714]: notice: pcmk_quorum_notification: Node (null)[2] - state is now member (was (null)) Jul 22 18:42:48 d25-22-left pacemakerd[17687]: notice: Could not obtain a node name for corosync nodeid 2 Jul 22 18:42:48 d25-22-left pacemakerd[17687]: notice: pcmk_quorum_notification: Node (null)[2] - state is now member (was (null)) Jul 22 18:42:49 d25-22-left pacemakerd[17687]: notice: Could not obtain a node name for corosync nodeid 2 Jul 22 18:42:49 d25-22-left attrd[17712]: notice: Could not obtain a node name for corosync nodeid 2 Jul 22 18:42:49 d25-22-left attrd[17712]: notice: crm_update_peer_proc: Node (null)[2] - state is now member (was (null)) Jul 22 18:42:49 d25-22-left stonith-ng[17710]: notice: Could not obtain a node name for corosync nodeid 2 Jul 22 18:42:49 d25-22-left stonith-ng[17710]: notice: crm_update_peer_proc: Node (null)[2] - state is now member (was (null)) Jul 22 18:42:49 d25-22-left stonith-ng[17710]: notice: Defaulting to uname -n for the local corosync node name Jul 22 18:42:49 d25-22-left cib[17709]: notice: Could not obtain a node name for corosync nodeid 2 Jul 22 18:42:49 d25-22-left cib[17709]: notice: crm_update_peer_proc: Node (null)[2] - state is now member (was (null)) Jul 22 18:42:50 d25-22-left crmd[17714]: notice: Could not obtain a node name for corosync nodeid 2 Jul 22 18:42:51 d25-22-left crmd[17714]: notice: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_HA_MESSAGE origin=route_message ] It seems like it could be dangerous for a node in a two node system to fence the other node as soon as quorum is reached since the node doing the fencing could be the one out of date. -nate _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org