On 16/07/2013, at 2:05 AM, "Howley, Tom" <tom.how...@hp.com> wrote:
> Hi Andrew, > > Thanks for the reply. I have a couple of more questions below. I have seem to > have two main problems: isolated node updating CIB; corosync behaviour to > ifdown. > >> Why isn't your normal fencing device working? > My normal fencing is working and was in place for nearly all of my testing. I > just tried the "suicide" option to see if it would prevent the isolated node > from carrying out any CIB updates. > > >> epoch is bumped after an election and a configuration change but NOT a >> status change. >> so it shouldn't be making it to 102 > My log below shows that the cib-bootstrap-options property is being updated. > Is this not a configuration change? Yes, but who changed it? I wouldn't expect that to happen automatically. > > >>> 1. My initial feeling was that the isolated node, Alice, (which has >>> no quorum) should not be updating a CIB that could potentially override the >>> sane part of the cluster. Is that a fair comment? > >> Not as currently designed. Although there may be some improvements we can >> make in that area. > Would you consider this a bug, or is there a case where this behaviour is > desired? Its probably a bug in the sense that we can do better. The fix will have to wait for 1.1.11 though. Its a simple change, but it needs a lot of testing to make sure any side-effects are accounted for. > > > In the meantime, I have a run script over the weekend that brings down the > network on the current drbd master, randomly using one of two options: ifdown > ethX; or add iptables rule to block all incoming and outgoing packages. All > of the roughly 350 block ports scenarios were successfully recovered (i.e. no > split-brain), whereas 130 out of 350 ifdown scenarios resulted in split-brain > (the script automatically repaired split-brain between test interations). > (Note that in order to aggravate the problem, these tests are based on using > stonith with an artificial delay before reset, and ensuring that > crm-fence-peer timeout is still greater than this delay -- I also intend to > redo test with normal conditions.) > > Is this a known/expected issue, which effectively means I shouldn't test > using "ifdown ethX"? The general consensus over the years is that ifdown is not considered a valid test - even at the corosync level without pacemaker involved. > If so, is there some configuration I can apply to change behaviour to ifdown? > My major fear is that some network failure could trigger the code path that > leads to the isolated node updating CIB, etc. > > > Thanks again, > > Tom > > -----Original Message----- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: 15 July 2013 01:52 > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Issue with an isolated node overriding CIB after > rejoining main cluster > > > On 12/07/2013, at 10:49 PM, "Howley, Tom" <tom.how...@hp.com> wrote: > >> Hi, >> >> pacemaker:1.1.6-2ubuntu3, > > ouch > >> corosync:1.4.2-2, drbd8-utils 2:8.3.11-0ubuntu1 >> >> I have a three node setup, with two nodes running DRBD, resource-level >> fencing enabled ('resource-and-stonith') and obviously stonith configured >> for each node. In my current test case, I bring down network interface on >> the DRBD primary/master node (using ifdown eth0, for example), which >> sometimes leads to split-brain when the isolated node rejoins the cluster - >> the serious problem is that upon rejoining, the isolated node is promoted to >> DRBD primary (despite the original fencing constraint) , which opens us up >> to data-loss for updates that occurred while that node was down. >> >> The exact problem scenario is as follows: >> - Alice: DRBD Primary/Master, Bob: Secondary/Slave, Jim: Quorum >> node, Epoch=100 >> - ifdown eth0 on Alice >> - Alice detects loss of network if, sets itself up as DC, carries >> out some CIB updates (see log snippet below) that raises the epoch level, >> say Epoch=102 > > epoch is bumped after an election and a configuration change but NOT a status > change. > so it shouldn't be making it to 102 > >> - Alice is shot via stonith. >> - Bob adds fencing rule to CIB to prevent promotion of DRBD on any >> other node, Epoch=101 >> - When Alice comes back and rejoins the cluster, the DC decides to >> sync to Alice CIB, thereby removing the fencing rule prematurely (i.e. >> before the drbd devices have resynched). >> - In some cases: Alice is promoted to Primary/Master and fences >> resource to prevent promotion on any other node. >> - We now have split-brain and potential loss of data. >> >> So some questions on the above: >> 1. My initial feeling was that the isolated node, Alice, (which has >> no quorum) should not be updating a CIB that could potentially override the >> sane part of the cluster. Is that a fair comment? > > Not as currently designed. Although there may be some improvements we can > make in that area. > >> 2. Is this issue just particular to my use of 'ifdown ethX' to disable >> the network? This is hinted at here: >> https://github.com/corosync/corosync/wiki/Corosync-and-ifdown-on-active-network-interface >> Has this issue been addressed, or will it be in the future? >> 3. If 'ifdown ethX is not valid', what is the best alternative that >> mimics what might happen in real world? I have tried blocking connections >> using iptables rules, dropping all incoming and outoing packets; initial >> testing appears to show different corosync behaviour that would hopefully >> not lead to my problem scenario, but I'm still in the process of confirming. >> I have also carried out some cable pulls and not run into issues yet, but >> this problem can be intermittent, so really needs an automated way to test >> many times. >> 4. The log snippet below from the isolated node shows that it updates >> the CIB twice sometime after detecting loss of network interface. Why does >> this happen? I believe that ultimately it is these CIB updates that >> increment the epoch, which leads to this CIB overriding the cluster later. >> >> I have also tried a no-quorum-policy of 'suicide' in an attempt to prevent >> CIB updates by the Alice, but it didn't make a different. > > Why isn't your normal fencing device working? > >> Note that to facilitate log collection and analysis, I have added a delay to >> the stonith reset operation, but I have also set the timeout on the >> crm-fence-peer script to ensure that it is greater than this 'deadtime'. >> >> Any advice on this would be greatly appreciated. >> >> Thanks, >> >> Tom >> >> Log snippet showing isolated node updating the CIB, which results in epoch >> being incremented two times: >> >> Jul 10 13:42:54 stratus18 corosync[1268]: [TOTEM ] A processor failed, >> forming new configuration. >> Jul 10 13:42:54 stratus18 corosync[1268]: [TOTEM ] The network interface >> is down. >> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20758]: TOMTEST-DEBUG: modified >> version >> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20758]: invoked for tomtest >> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20761]: TOMTEST-DEBUG: modified >> version >> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20761]: invoked for tomtest >> Jul 10 13:42:55 stratus18 stonith-ng: [1276]: info: stonith_command: >> Processed st_execute from lrmd: rc=-1 >> Jul 10 13:42:55 stratus18 external/ipmi[20806]: [20816]: ERROR: error >> executing ipmitool: Connect failed: Network is unreachable#015 Unable to get >> Chassis Power Status#015 >> Jul 10 13:42:55 stratus18 crm-fence-peer.sh[20758]: Call cib_query failed >> (-41): Remote node did not respond >> Jul 10 13:42:55 stratus18 crm-fence-peer.sh[20761]: Call cib_query failed >> (-41): Remote node did not respond >> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #7 eth0, >> 192.168.185.150#123, interface stats: received=0, sent=0, dropped=0, >> active_time=912 secs >> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #4 eth0, >> fe80::7ae7:d1ff:fe22:5270#123, interface stats: received=0, sent=0, >> dropped=0, active_time=6080 secs >> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #3 eth0, >> 192.168.185.118#123, interface stats: received=52, sent=53, dropped=0, >> active_time=6080 secs >> Jul 10 13:42:55 stratus18 ntpd[1062]: 192.168.8.97 interface 192.168.185.118 >> -> (none) >> Jul 10 13:42:55 stratus18 ntpd[1062]: peers refreshed >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] notice: >> pcmk_peer_update: Transitional membership event on ring 2728: memb=1, new=0, >> lost=2 >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: pcmk_peer_update: >> memb: .unknown. 16777343 >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: pcmk_peer_update: >> lost: stratus18 1991878848 >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: pcmk_peer_update: >> lost: stratus20 2025433280 >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] notice: >> pcmk_peer_update: Stable membership event on ring 2728: memb=1, new=0, lost=0 >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: update_member: >> Creating entry for node 16777343 born on 2728 >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: update_member: >> Node 16777343/unknown is now: member >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: pcmk_peer_update: >> MEMB: .pending. 16777343 >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] ERROR: >> pcmk_peer_update: Something strange happened: 1 >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: >> ais_mark_unseen_peer_dead: Node stratus17 was not seen in the previous >> transition >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: update_member: >> Node 1975101632/stratus17 is now: lost >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: >> ais_mark_unseen_peer_dead: Node stratus18 was not seen in the previous >> transition >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: update_member: >> Node 1991878848/stratus18 is now: lost >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: >> ais_mark_unseen_peer_dead: Node stratus20 was not seen in the previous >> transition >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: update_member: >> Node 2025433280/stratus20 is now: lost >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] WARN: >> pcmk_update_nodeid: Detected local node id change: 1991878848 -> 16777343 >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: destroy_ais_node: >> Destroying entry for node 1991878848 >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] notice: >> ais_remove_peer: Removed dead peer 1991878848 from the membership list >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: ais_remove_peer: >> Sending removal of 1991878848 to 2 children >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: update_member: >> 0x13d9520 Node 16777343 now known as stratus18 (was: (null)) >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: update_member: >> Node stratus18 now has 1 quorum votes (was 0) >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: update_member: >> Node stratus18 now has process list: 00000000000000000000000000111312 >> (1118994) >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: >> send_member_notification: Sending membership update 2728 to 2 children >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: update_member: >> 0x13d9520 Node 16777343 ((null)) born on: 2708 >> Jul 10 13:42:55 stratus18 corosync[1268]: [TOTEM ] A processor joined or >> left the membership and a new membership was formed. >> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 >> now has id: 16777343 >> Jul 10 13:42:55 stratus18 cib: [1277]: info: ais_dispatch_message: >> Membership 2728: quorum retained >> Jul 10 13:42:55 stratus18 cib: [1277]: info: ais_dispatch_message: Removing >> peer 1991878848/1991878848 >> Jul 10 13:42:55 stratus18 cib: [1277]: info: reap_crm_member: Peer >> 1991878848 is unknown >> Jul 10 13:42:55 stratus18 cib: [1277]: notice: ais_dispatch_message: >> Membership 2728: quorum lost >> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_update_peer: Node >> stratus17: id=1975101632 state=lost (new) addr=r(0) ip(192.168.185.117) >> votes=1 born=2724 seen=2724 proc=00000000000000000000000000111312 >> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_update_peer: Node >> stratus20: id=2025433280 state=lost (new) addr=r(0) ip(192.168.185.120) >> votes=1 born=4 seen=2724 proc=00000000000000000000000000111312 >> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 >> now has id: 1991878848 >> Jul 10 13:42:55 stratus18 corosync[1268]: [CPG ] chosen downlist: sender >> r(0) ip(127.0.0.1) ; members(old:3 left:3) >> Jul 10 13:42:55 stratus18 corosync[1268]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_get_peer: Node stratus18 >> now has id: 16777343 >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: >> Membership 2728: quorum retained >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: Removing >> peer 1991878848/1991878848 >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: reap_crm_member: Peer >> 1991878848 is unknown >> Jul 10 13:42:55 stratus18 crmd: [1281]: notice: ais_dispatch_message: >> Membership 2728: quorum lost >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_status_callback: status: >> stratus17 is now lost (was member) >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_peer: Node >> stratus17: id=1975101632 state=lost (new) addr=r(0) ip(192.168.185.117) >> votes=1 born=2724 seen=2724 proc=00000000000000000000000000111312 >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_status_callback: status: >> stratus20 is now lost (was member) >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_peer: Node >> stratus20: id=2025433280 state=lost (new) addr=r(0) ip(192.168.185.120) >> votes=1 born=4 seen=2724 proc=00000000000000000000000000111312 >> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: check_dead_member: Our DC node >> (stratus20) left the cluster >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State >> transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL >> origin=check_dead_member ] >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: update_dc: Unset DC stratus20 >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State >> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC >> cause=C_FSA_INTERNAL origin=do_election_check ] >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_te_control: Registering TE >> UUID: 6e335eff-5e48-4fc1-9003-0537ae948dfd >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: set_graph_functions: Setting >> custom graph functions >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: unpack_graph: Unpacked >> transition -1: 0 actions in 0 synapses >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_takeover: Taking over DC >> status for this partition >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_readwrite: We are >> now in R/W mode >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation >> complete: op cib_master for section 'all' (origin=local/crmd/57, >> version=0.76.46): ok (rc=0) >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation >> complete: op cib_modify for section cib (origin=local/crmd/58, >> version=0.76.47): ok (rc=0) >> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 >> now has id: 16777343 >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation >> complete: op cib_modify for section crm_config (origin=local/crmd/60, >> version=0.76.48): ok (rc=0) >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: join_make_offer: Making join >> offers based on membership 2728 >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_offer_all: join-1: >> Waiting on 1 outstanding join acks >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: >> Membership 2728: quorum still lost >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation >> complete: op cib_modify for section crm_config (origin=local/crmd/62, >> version=0.76.49): ok (rc=0) >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crmd_ais_dispatch: Setting >> expected votes to 2 >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: update_dc: Set DC to stratus18 >> (3.0.5) >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: >> Shutdown escalation occurs after: 1200000ms >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: >> Checking for expired actions every 900000ms >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: Sending >> expected-votes=3 to corosync >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: >> Membership 2728: quorum still lost >> Jul 10 13:42:55 stratus18 corosync[1268]: [pcmk ] info: >> update_expected_votes: Expected quorum votes 2 -> 3 >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <cib >> admin_epoch="0" epoch="76" num_updates="49" > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <configuration > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <crm_config > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - >> <cluster_property_set id="cib-bootstrap-options" > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <nvpair >> value="3" id="cib-bootstrap-options-expected-quorum-votes" /> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - >> </cluster_property_set> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </crm_config> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </configuration> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </cib> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <cib >> admin_epoch="0" cib-last-written="Wed Jul 10 13:25:58 2013" >> crm_feature_set="3.0.5" epoch="77" have-quorum="1" num_updates="1" >> update-client="crmd" update-origin="stratus17" validate-with="pacemaker-1.2" >> dc-uuid="stratus20" > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <configuration > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <crm_config > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + >> <cluster_property_set id="cib-bootstrap-options" > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <nvpair >> id="cib-bootstrap-options-expected-quorum-votes" >> name="expected-quorum-votes" value="2" /> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + >> </cluster_property_set> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </crm_config> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </configuration> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </cib> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation >> complete: op cib_modify for section crm_config (origin=local/crmd/65, >> version=0.77.1): ok (rc=0) >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crmd_ais_dispatch: Setting >> expected votes to 3 >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State >> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED >> cause=C_FSA_INTERNAL origin=check_join_state ] >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: All 1 >> cluster nodes responded to the join offer. >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_finalize: join-1: >> Syncing the CIB from stratus18 to the rest of the cluster >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <cib >> admin_epoch="0" epoch="77" num_updates="1" > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <configuration > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <crm_config > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - >> <cluster_property_set id="cib-bootstrap-options" > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <nvpair >> value="2" id="cib-bootstrap-options-expected-quorum-votes" /> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - >> </cluster_property_set> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </crm_config> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </configuration> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </cib> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <cib >> admin_epoch="0" cib-last-written="Wed Jul 10 13:42:55 2013" >> crm_feature_set="3.0.5" epoch="78" have-quorum="1" num_updates="1" >> update-client="crmd" update-origin="stratus18" validate-with="pacemaker-1.2" >> dc-uuid="stratus20" > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <configuration > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <crm_config > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + >> <cluster_property_set id="cib-bootstrap-options" > >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <nvpair >> id="cib-bootstrap-options-expected-quorum-votes" >> name="expected-quorum-votes" value="3" /> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + >> </cluster_property_set> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </crm_config> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </configuration> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </cib> >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation >> complete: op cib_modify for section crm_config (origin=local/crmd/68, >> version=0.78.1): ok (rc=0) >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation >> complete: op cib_sync for section 'all' (origin=local/crmd/69, >> version=0.78.1): ok (rc=0) >> Jul 10 13:42:55 stratus18 lrmd: [1278]: info: stonith_api_device_metadata: >> looking up external/ipmi/heartbeat metadata >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation >> complete: op cib_modify for section nodes (origin=local/crmd/70, >> version=0.78.2): ok (rc=0) >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_ack: join-1: >> Updating node state to member for stratus18 >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation >> complete: op cib_delete for section //node_state[@uname='stratus18']/lrm >> (origin=local/crmd/71, version=0.78.3): ok (rc=0) >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: erase_xpath_callback: Deletion >> of "//node_state[@uname='stratus18']/lrm": ok (rc=0) >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State >> transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED >> cause=C_FSA_INTERNAL origin=check_join_state ] >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: All 1 >> cluster nodes are eligible to run resources. >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_final: Ensuring DC, >> quorum and node attributes are up-to-date >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_quorum: Updating >> quorum status to false (call=75) >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: >> do_te_invoke:167 - Triggered transition abort (complete=1) : Peer Cancelled >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 76: >> Requesting the current CIB: S_POLICY_ENGINE >> Jul 10 13:42:55 stratus18 attrd: [1279]: notice: attrd_local_callback: >> Sending full refresh (origin=crmd) >> Jul 10 13:42:55 stratus18 attrd: [1279]: notice: attrd_trigger_update: >> Sending flush op to all hosts for: probe_complete (true) >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation >> complete: op cib_modify for section nodes (origin=local/crmd/73, >> version=0.78.5): ok (rc=0) >> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: match_down_event: No match for >> shutdown action on stratus17 >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: te_update_diff: >> Stonith/shutdown of stratus17 not matched >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: >> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, >> id=stratus17, magic=NA, cib=0.78.6) : Node failure >> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: match_down_event: No match for >> shutdown action on stratus20 >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: te_update_diff: >> Stonith/shutdown of stratus20 not matched >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: >> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, >> id=stratus20, magic=NA, cib=0.78.6) : Node failure >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 77: >> Requesting the current CIB: S_POLICY_ENGINE >> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 78: >> Requesting the current CIB: S_POLICY_ENGINE >> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation >> complete: op cib_modify for section cib (origin=local/crmd/75, >> version=0.78.7): ok (rc=0) >> Jul 10 13:42:56 stratus18 crmd: [1281]: info: do_pe_invoke_callback: >> Invoking the PE: query=78, ref=pe_calc-dc-1373460176-49, seq=2728, quorate=0 >> Jul 10 13:42:56 stratus18 attrd: [1279]: notice: attrd_trigger_update: >> Sending flush op to all hosts for: master-drbd_tomtest:0 (10000) >> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: cluster_status: We do not >> have quorum - fencing and resource management disabled >> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: pe_fence_node: Node >> stratus17 will be fenced because it is un-expectedly down >> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: determine_online_status: >> Node stratus17 is unclean >> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: pe_fence_node: Node >> stratus20 will be fenced because it is un-expectedly down >> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: determine_online_status: >> Node stratus20 is unclean >> Jul 10 13:42:56 stratus18 pengine: [1280]: notice: unpack_rsc_op: Hard error >> - drbd_tomtest:0_last_failure_0 failed with rc=5: Preventing ms_drbd_tomtest >> from re-starting on stratus20 >> Jul 10 13:42:56 stratus18 pengine: [1280]: notice: unpack_rsc_op: Hard error >> - tomtest_mysql_SERVICE_last_failure_0 failed with rc=5: Preventing >> tomtest_mysql_SERVICE from re-starting on stratus20 >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org