Re: [Pacemaker] Issue with an isolated node overriding CIB after rejoining main cluster

Andrew Beekhof Tue, 16 Jul 2013 06:23:42 -0700

On 16/07/2013, at 2:05 AM, "Howley, Tom" <tom.how...@hp.com> wrote:


> Hi Andrew,
> 
> Thanks for the reply. I have a couple of more questions below. I have seem to 
> have two main problems: isolated node updating CIB; corosync behaviour to 
> ifdown.
> 
>> Why isn't your normal fencing device working?
> My normal fencing is working and was in place for nearly all of my testing. I 
> just tried the "suicide" option to see if it would prevent the isolated node 
> from carrying out any  CIB updates.
> 
> 
>> epoch is bumped after an election and a configuration change but NOT a 
>> status change. 
>> so it shouldn't be making it to 102
> My log below shows that the cib-bootstrap-options property is being updated. 
> Is this not a configuration change?

Yes, but who changed it?
I wouldn't expect that to happen automatically.

> 
> 
>>> 1.       My initial feeling was that the isolated node, Alice,  (which has 
>>> no quorum) should not be updating a CIB that could potentially override the 
>>> sane part of the cluster. Is that a fair comment?
> 
>> Not as currently designed.  Although there may be some improvements we can 
>> make in that area.
> Would you consider this a bug, or is there a case where this behaviour is 
> desired?

Its probably a bug in the sense that we can do better.
The fix will have to wait for 1.1.11 though.  Its a simple change, but it needs 
a lot of testing to make sure any side-effects are accounted for. 

> 
> 
> In the meantime, I have a run script over the weekend that brings down the 
> network on the current drbd master, randomly using one of two options: ifdown 
> ethX; or add iptables rule to block all incoming and outgoing packages. All 
> of the roughly 350 block ports scenarios were successfully recovered (i.e. no 
> split-brain), whereas 130 out of 350 ifdown scenarios resulted in split-brain 
> (the script automatically repaired split-brain between test interations). 
> (Note that in order to aggravate the problem, these tests are based on using 
> stonith with an artificial delay before reset, and ensuring that 
> crm-fence-peer timeout is still greater than this delay -- I also intend to 
> redo test with normal conditions.)
> 
> Is this a known/expected issue, which effectively means I shouldn't test 
> using "ifdown ethX"?

The general consensus over the years is that ifdown is not considered a valid 
test - even at the corosync level without pacemaker involved.

> If so, is there some configuration I can apply to change behaviour to ifdown? 
> My major fear is that some network failure could trigger the code path that 
> leads to the isolated node updating CIB, etc. 
> 
> 
> Thanks again,
> 
> Tom
> 
> -----Original Message-----
> From: Andrew Beekhof [mailto:and...@beekhof.net] 
> Sent: 15 July 2013 01:52
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Issue with an isolated node overriding CIB after 
> rejoining main cluster
> 
> 
> On 12/07/2013, at 10:49 PM, "Howley, Tom" <tom.how...@hp.com> wrote:
> 
>> Hi,
>> 
>> pacemaker:1.1.6-2ubuntu3,
> 
> ouch
> 
>> corosync:1.4.2-2, drbd8-utils 2:8.3.11-0ubuntu1
>> 
>> I have a three node setup, with two nodes running DRBD, resource-level 
>> fencing enabled ('resource-and-stonith') and obviously stonith configured 
>> for each node. In my current test case, I bring down network interface on 
>> the DRBD primary/master node (using ifdown eth0, for example), which 
>> sometimes leads to split-brain when the isolated node rejoins the cluster - 
>> the serious problem is that upon rejoining, the isolated node is promoted to 
>> DRBD primary (despite the original fencing constraint) , which opens us up 
>> to data-loss for updates that occurred while that node was down.
>> 
>> The exact problem scenario is as follows:
>> -          Alice: DRBD Primary/Master, Bob: Secondary/Slave, Jim: Quorum 
>> node, Epoch=100
>> -          ifdown eth0 on Alice
>> -          Alice detects loss of network if, sets itself up as DC, carries 
>> out some CIB updates (see log snippet below) that raises the epoch level, 
>> say Epoch=102
> 
> epoch is bumped after an election and a configuration change but NOT a status 
> change.
> so it shouldn't be making it to 102
> 
>> -          Alice is shot via stonith.
>> -          Bob adds fencing rule to CIB to prevent promotion of DRBD on any 
>> other node, Epoch=101
>> -          When Alice comes back and rejoins the cluster, the DC decides to 
>> sync to Alice CIB, thereby removing the fencing rule prematurely (i.e. 
>> before the drbd devices have resynched).
>> -          In some cases: Alice is promoted to Primary/Master and fences 
>> resource to prevent promotion on any other node.
>> -          We now have split-brain and potential loss of data.
>> 
>> So some questions on the above:
>> 1.       My initial feeling was that the isolated node, Alice,  (which has 
>> no quorum) should not be updating a CIB that could potentially override the 
>> sane part of the cluster. Is that a fair comment?
> 
> Not as currently designed.  Although there may be some improvements we can 
> make in that area.
> 
>> 2.       Is this issue just particular to my use of 'ifdown ethX' to disable 
>> the network? This is hinted at here: 
>> https://github.com/corosync/corosync/wiki/Corosync-and-ifdown-on-active-network-interface
>>  Has this issue been addressed, or will it be in the future?
>> 3.        If 'ifdown ethX is not valid', what is the best alternative that 
>> mimics what might happen in real world? I have tried blocking connections 
>> using iptables rules, dropping all incoming and outoing packets; initial 
>> testing appears to show different corosync behaviour that would hopefully 
>> not lead to my problem scenario, but I'm still in the process of confirming. 
>> I have also carried out some cable pulls and not run into issues yet, but 
>> this problem can be intermittent, so really needs an automated way to test 
>> many times.
>> 4.       The log snippet below from the isolated node shows that it updates 
>> the CIB twice sometime after detecting loss of network interface. Why does 
>> this happen? I believe that ultimately it is these CIB updates that 
>> increment the epoch, which leads to this CIB overriding the cluster later.
>> 
>> I have also tried a no-quorum-policy of 'suicide' in an attempt to prevent 
>> CIB updates by the Alice, but it didn't make a different.
> 
> Why isn't your normal fencing device working?
> 
>> Note that to facilitate log collection and analysis, I have added a delay to 
>> the stonith reset operation, but I have also set the timeout on the 
>> crm-fence-peer script to ensure that it is greater than this 'deadtime'.
>> 
>> Any advice on this would be greatly appreciated.
>> 
>> Thanks,
>> 
>> Tom
>> 
>> Log snippet showing isolated node updating the CIB, which results in epoch 
>> being incremented two times:
>> 
>> Jul 10 13:42:54 stratus18 corosync[1268]:   [TOTEM ] A processor failed, 
>> forming new configuration.
>> Jul 10 13:42:54 stratus18 corosync[1268]:   [TOTEM ] The network interface 
>> is down.
>> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20758]: TOMTEST-DEBUG: modified 
>> version
>> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20758]: invoked for tomtest
>> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20761]: TOMTEST-DEBUG: modified 
>> version
>> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20761]: invoked for tomtest
>> Jul 10 13:42:55 stratus18 stonith-ng: [1276]: info: stonith_command: 
>> Processed st_execute from lrmd: rc=-1
>> Jul 10 13:42:55 stratus18 external/ipmi[20806]: [20816]: ERROR: error 
>> executing ipmitool: Connect failed: Network is unreachable#015 Unable to get 
>> Chassis Power Status#015
>> Jul 10 13:42:55 stratus18 crm-fence-peer.sh[20758]: Call cib_query failed 
>> (-41): Remote node did not respond
>> Jul 10 13:42:55 stratus18 crm-fence-peer.sh[20761]: Call cib_query failed 
>> (-41): Remote node did not respond
>> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #7 eth0, 
>> 192.168.185.150#123, interface stats: received=0, sent=0, dropped=0, 
>> active_time=912 secs
>> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #4 eth0, 
>> fe80::7ae7:d1ff:fe22:5270#123, interface stats: received=0, sent=0, 
>> dropped=0, active_time=6080 secs
>> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #3 eth0, 
>> 192.168.185.118#123, interface stats: received=52, sent=53, dropped=0, 
>> active_time=6080 secs
>> Jul 10 13:42:55 stratus18 ntpd[1062]: 192.168.8.97 interface 192.168.185.118 
>> -> (none)
>> Jul 10 13:42:55 stratus18 ntpd[1062]: peers refreshed
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] notice: 
>> pcmk_peer_update: Transitional membership event on ring 2728: memb=1, new=0, 
>> lost=2
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: 
>> memb: .unknown. 16777343
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: 
>> lost: stratus18 1991878848
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: 
>> lost: stratus20 2025433280
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] notice: 
>> pcmk_peer_update: Stable membership event on ring 2728: memb=1, new=0, lost=0
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
>> Creating entry for node 16777343 born on 2728
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
>> Node 16777343/unknown is now: member
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: 
>> MEMB: .pending. 16777343
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] ERROR: 
>> pcmk_peer_update: Something strange happened: 1
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: 
>> ais_mark_unseen_peer_dead: Node stratus17 was not seen in the previous 
>> transition
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
>> Node 1975101632/stratus17 is now: lost
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: 
>> ais_mark_unseen_peer_dead: Node stratus18 was not seen in the previous 
>> transition
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
>> Node 1991878848/stratus18 is now: lost
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: 
>> ais_mark_unseen_peer_dead: Node stratus20 was not seen in the previous 
>> transition
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
>> Node 2025433280/stratus20 is now: lost
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] WARN: 
>> pcmk_update_nodeid: Detected local node id change: 1991878848 -> 16777343
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: destroy_ais_node: 
>> Destroying entry for node 1991878848
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] notice: 
>> ais_remove_peer: Removed dead peer 1991878848 from the membership list
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: ais_remove_peer: 
>> Sending removal of 1991878848 to 2 children
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
>> 0x13d9520 Node 16777343 now known as stratus18 (was: (null))
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
>> Node stratus18 now has 1 quorum votes (was 0)
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
>> Node stratus18 now has process list: 00000000000000000000000000111312 
>> (1118994)
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: 
>> send_member_notification: Sending membership update 2728 to 2 children
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
>> 0x13d9520 Node 16777343 ((null)) born on: 2708
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [TOTEM ] A processor joined or 
>> left the membership and a new membership was formed.
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 
>> now has id: 16777343
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: ais_dispatch_message: 
>> Membership 2728: quorum retained
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: ais_dispatch_message: Removing 
>> peer 1991878848/1991878848
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: reap_crm_member: Peer 
>> 1991878848 is unknown
>> Jul 10 13:42:55 stratus18 cib: [1277]: notice: ais_dispatch_message: 
>> Membership 2728: quorum lost
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_update_peer: Node 
>> stratus17: id=1975101632 state=lost (new) addr=r(0) ip(192.168.185.117)  
>> votes=1 born=2724 seen=2724 proc=00000000000000000000000000111312
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_update_peer: Node 
>> stratus20: id=2025433280 state=lost (new) addr=r(0) ip(192.168.185.120)  
>> votes=1 born=4 seen=2724 proc=00000000000000000000000000111312
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 
>> now has id: 1991878848
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [CPG   ] chosen downlist: sender 
>> r(0) ip(127.0.0.1) ; members(old:3 left:3)
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [MAIN  ] Completed service 
>> synchronization, ready to provide service.
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_get_peer: Node stratus18 
>> now has id: 16777343
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: 
>> Membership 2728: quorum retained
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: Removing 
>> peer 1991878848/1991878848
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: reap_crm_member: Peer 
>> 1991878848 is unknown
>> Jul 10 13:42:55 stratus18 crmd: [1281]: notice: ais_dispatch_message: 
>> Membership 2728: quorum lost
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_status_callback: status: 
>> stratus17 is now lost (was member)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_peer: Node 
>> stratus17: id=1975101632 state=lost (new) addr=r(0) ip(192.168.185.117)  
>> votes=1 born=2724 seen=2724 proc=00000000000000000000000000111312
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_status_callback: status: 
>> stratus20 is now lost (was member)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_peer: Node 
>> stratus20: id=2025433280 state=lost (new) addr=r(0) ip(192.168.185.120)  
>> votes=1 born=4 seen=2724 proc=00000000000000000000000000111312
>> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: check_dead_member: Our DC node 
>> (stratus20) left the cluster
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State 
>> transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL 
>> origin=check_dead_member ]
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: update_dc: Unset DC stratus20
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State 
>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
>> cause=C_FSA_INTERNAL origin=do_election_check ]
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_te_control: Registering TE 
>> UUID: 6e335eff-5e48-4fc1-9003-0537ae948dfd
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: set_graph_functions: Setting 
>> custom graph functions
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: unpack_graph: Unpacked 
>> transition -1: 0 actions in 0 synapses
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_takeover: Taking over DC 
>> status for this partition
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_readwrite: We are 
>> now in R/W mode
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
>> complete: op cib_master for section 'all' (origin=local/crmd/57, 
>> version=0.76.46): ok (rc=0)
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
>> complete: op cib_modify for section cib (origin=local/crmd/58, 
>> version=0.76.47): ok (rc=0)
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 
>> now has id: 16777343
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
>> complete: op cib_modify for section crm_config (origin=local/crmd/60, 
>> version=0.76.48): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: join_make_offer: Making join 
>> offers based on membership 2728
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_offer_all: join-1: 
>> Waiting on 1 outstanding join acks
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: 
>> Membership 2728: quorum still lost
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
>> complete: op cib_modify for section crm_config (origin=local/crmd/62, 
>> version=0.76.49): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crmd_ais_dispatch: Setting 
>> expected votes to 2
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: update_dc: Set DC to stratus18 
>> (3.0.5)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: 
>> Shutdown escalation occurs after: 1200000ms
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: 
>> Checking for expired actions every 900000ms
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: Sending 
>> expected-votes=3 to corosync
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: 
>> Membership 2728: quorum still lost
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: 
>> update_expected_votes: Expected quorum votes 2 -> 3
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <cib 
>> admin_epoch="0" epoch="76" num_updates="49" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   <configuration >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     <crm_config >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       
>> <cluster_property_set id="cib-bootstrap-options" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -         <nvpair 
>> value="3" id="cib-bootstrap-options-expected-quorum-votes" />
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       
>> </cluster_property_set>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     </crm_config>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   </configuration>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </cib>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <cib 
>> admin_epoch="0" cib-last-written="Wed Jul 10 13:25:58 2013" 
>> crm_feature_set="3.0.5" epoch="77" have-quorum="1" num_updates="1" 
>> update-client="crmd" update-origin="stratus17" validate-with="pacemaker-1.2" 
>> dc-uuid="stratus20" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   <configuration >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     <crm_config >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       
>> <cluster_property_set id="cib-bootstrap-options" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +         <nvpair 
>> id="cib-bootstrap-options-expected-quorum-votes" 
>> name="expected-quorum-votes" value="2" />
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       
>> </cluster_property_set>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     </crm_config>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   </configuration>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </cib>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
>> complete: op cib_modify for section crm_config (origin=local/crmd/65, 
>> version=0.77.1): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crmd_ais_dispatch: Setting 
>> expected votes to 3
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State 
>> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
>> cause=C_FSA_INTERNAL origin=check_join_state ]
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: All 1 
>> cluster nodes responded to the join offer.
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_finalize: join-1: 
>> Syncing the CIB from stratus18 to the rest of the cluster
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <cib 
>> admin_epoch="0" epoch="77" num_updates="1" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   <configuration >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     <crm_config >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       
>> <cluster_property_set id="cib-bootstrap-options" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -         <nvpair 
>> value="2" id="cib-bootstrap-options-expected-quorum-votes" />
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       
>> </cluster_property_set>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     </crm_config>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   </configuration>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </cib>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <cib 
>> admin_epoch="0" cib-last-written="Wed Jul 10 13:42:55 2013" 
>> crm_feature_set="3.0.5" epoch="78" have-quorum="1" num_updates="1" 
>> update-client="crmd" update-origin="stratus18" validate-with="pacemaker-1.2" 
>> dc-uuid="stratus20" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   <configuration >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     <crm_config >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       
>> <cluster_property_set id="cib-bootstrap-options" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +         <nvpair 
>> id="cib-bootstrap-options-expected-quorum-votes" 
>> name="expected-quorum-votes" value="3" />
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       
>> </cluster_property_set>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     </crm_config>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   </configuration>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </cib>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
>> complete: op cib_modify for section crm_config (origin=local/crmd/68, 
>> version=0.78.1): ok (rc=0)
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
>> complete: op cib_sync for section 'all' (origin=local/crmd/69, 
>> version=0.78.1): ok (rc=0)
>> Jul 10 13:42:55 stratus18 lrmd: [1278]: info: stonith_api_device_metadata: 
>> looking up external/ipmi/heartbeat metadata
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
>> complete: op cib_modify for section nodes (origin=local/crmd/70, 
>> version=0.78.2): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_ack: join-1: 
>> Updating node state to member for stratus18
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
>> complete: op cib_delete for section //node_state[@uname='stratus18']/lrm 
>> (origin=local/crmd/71, version=0.78.3): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: erase_xpath_callback: Deletion 
>> of "//node_state[@uname='stratus18']/lrm": ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State 
>> transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED 
>> cause=C_FSA_INTERNAL origin=check_join_state ]
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: All 1 
>> cluster nodes are eligible to run resources.
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_final: Ensuring DC, 
>> quorum and node attributes are up-to-date
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_quorum: Updating 
>> quorum status to false (call=75)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: 
>> do_te_invoke:167 - Triggered transition abort (complete=1) : Peer Cancelled
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 76: 
>> Requesting the current CIB: S_POLICY_ENGINE
>> Jul 10 13:42:55 stratus18 attrd: [1279]: notice: attrd_local_callback: 
>> Sending full refresh (origin=crmd)
>> Jul 10 13:42:55 stratus18 attrd: [1279]: notice: attrd_trigger_update: 
>> Sending flush op to all hosts for: probe_complete (true)
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
>> complete: op cib_modify for section nodes (origin=local/crmd/73, 
>> version=0.78.5): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: match_down_event: No match for 
>> shutdown action on stratus17
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: te_update_diff: 
>> Stonith/shutdown of stratus17 not matched
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: 
>> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, 
>> id=stratus17, magic=NA, cib=0.78.6) : Node failure
>> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: match_down_event: No match for 
>> shutdown action on stratus20
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: te_update_diff: 
>> Stonith/shutdown of stratus20 not matched
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: 
>> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, 
>> id=stratus20, magic=NA, cib=0.78.6) : Node failure
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 77: 
>> Requesting the current CIB: S_POLICY_ENGINE
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 78: 
>> Requesting the current CIB: S_POLICY_ENGINE
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
>> complete: op cib_modify for section cib (origin=local/crmd/75, 
>> version=0.78.7): ok (rc=0)
>> Jul 10 13:42:56 stratus18 crmd: [1281]: info: do_pe_invoke_callback: 
>> Invoking the PE: query=78, ref=pe_calc-dc-1373460176-49, seq=2728, quorate=0
>> Jul 10 13:42:56 stratus18 attrd: [1279]: notice: attrd_trigger_update: 
>> Sending flush op to all hosts for: master-drbd_tomtest:0 (10000)
>> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: cluster_status: We do not 
>> have quorum - fencing and resource management disabled
>> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: pe_fence_node: Node 
>> stratus17 will be fenced because it is un-expectedly down
>> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: determine_online_status: 
>> Node stratus17 is unclean
>> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: pe_fence_node: Node 
>> stratus20 will be fenced because it is un-expectedly down
>> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: determine_online_status: 
>> Node stratus20 is unclean
>> Jul 10 13:42:56 stratus18 pengine: [1280]: notice: unpack_rsc_op: Hard error 
>> - drbd_tomtest:0_last_failure_0 failed with rc=5: Preventing ms_drbd_tomtest 
>> from re-starting on stratus20
>> Jul 10 13:42:56 stratus18 pengine: [1280]: notice: unpack_rsc_op: Hard error 
>> - tomtest_mysql_SERVICE_last_failure_0 failed with rc=5: Preventing 
>> tomtest_mysql_SERVICE from re-starting on stratus20
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Issue with an isolated node overriding CIB after rejoining main cluster

Reply via email to