Re: [Pacemaker] Issue with an isolated node overriding CIB after rejoining main cluster

Howley, Tom Mon, 15 Jul 2013 09:13:26 -0700

Hi Andrew,

Thanks for the reply. I have a couple of more questions below. I have seem to 
have two main problems: isolated node updating CIB; corosync behaviour to 
ifdown.


> Why isn't your normal fencing device working?
My normal fencing is working and was in place for nearly all of my testing. I 
just tried the "suicide" option to see if it would prevent the isolated node 
from carrying out any  CIB updates.


> epoch is bumped after an election and a configuration change but NOT a status 
> change. 
> so it shouldn't be making it to 102
My log below shows that the cib-bootstrap-options property is being updated. Is 
this not a configuration change?


>> 1.       My initial feeling was that the isolated node, Alice,  (which has 
>> no quorum) should not be updating a CIB that could potentially override the 
>> sane part of the cluster. Is that a fair comment?

> Not as currently designed.  Although there may be some improvements we can 
> make in that area.
Would you consider this a bug, or is there a case where this behaviour is 
desired?


In the meantime, I have a run script over the weekend that brings down the 
network on the current drbd master, randomly using one of two options: ifdown 
ethX; or add iptables rule to block all incoming and outgoing packages. All of 
the roughly 350 block ports scenarios were successfully recovered (i.e. no 
split-brain), whereas 130 out of 350 ifdown scenarios resulted in split-brain 
(the script automatically repaired split-brain between test interations). (Note 
that in order to aggravate the problem, these tests are based on using stonith 
with an artificial delay before reset, and ensuring that crm-fence-peer timeout 
is still greater than this delay -- I also intend to redo test with normal 
conditions.)

Is this a known/expected issue, which effectively means I shouldn't test using 
"ifdown ethX"? If so, is there some configuration I can apply to change 
behaviour to ifdown? My major fear is that some network failure could trigger 
the code path that leads to the isolated node updating CIB, etc. 


Thanks again,

Tom

-----Original Message-----
From: Andrew Beekhof [mailto:and...@beekhof.net] 
Sent: 15 July 2013 01:52
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Issue with an isolated node overriding CIB after 
rejoining main cluster


On 12/07/2013, at 10:49 PM, "Howley, Tom" <tom.how...@hp.com> wrote:

> Hi,
>  
> pacemaker:1.1.6-2ubuntu3,

ouch

> corosync:1.4.2-2, drbd8-utils 2:8.3.11-0ubuntu1
>  
> I have a three node setup, with two nodes running DRBD, resource-level 
> fencing enabled ('resource-and-stonith') and obviously stonith configured for 
> each node. In my current test case, I bring down network interface on the 
> DRBD primary/master node (using ifdown eth0, for example), which sometimes 
> leads to split-brain when the isolated node rejoins the cluster - the serious 
> problem is that upon rejoining, the isolated node is promoted to DRBD primary 
> (despite the original fencing constraint) , which opens us up to data-loss 
> for updates that occurred while that node was down.
>  
> The exact problem scenario is as follows:
> -          Alice: DRBD Primary/Master, Bob: Secondary/Slave, Jim: Quorum 
> node, Epoch=100
> -          ifdown eth0 on Alice
> -          Alice detects loss of network if, sets itself up as DC, carries 
> out some CIB updates (see log snippet below) that raises the epoch level, say 
> Epoch=102

epoch is bumped after an election and a configuration change but NOT a status 
change.
so it shouldn't be making it to 102

> -          Alice is shot via stonith.
> -          Bob adds fencing rule to CIB to prevent promotion of DRBD on any 
> other node, Epoch=101
> -          When Alice comes back and rejoins the cluster, the DC decides to 
> sync to Alice CIB, thereby removing the fencing rule prematurely (i.e. before 
> the drbd devices have resynched).
> -          In some cases: Alice is promoted to Primary/Master and fences 
> resource to prevent promotion on any other node.
> -          We now have split-brain and potential loss of data.
>  
> So some questions on the above:
> 1.       My initial feeling was that the isolated node, Alice,  (which has no 
> quorum) should not be updating a CIB that could potentially override the sane 
> part of the cluster. Is that a fair comment?

Not as currently designed.  Although there may be some improvements we can make 
in that area.

> 2.       Is this issue just particular to my use of 'ifdown ethX' to disable 
> the network? This is hinted at here: 
> https://github.com/corosync/corosync/wiki/Corosync-and-ifdown-on-active-network-interface
>  Has this issue been addressed, or will it be in the future?
> 3.        If 'ifdown ethX is not valid', what is the best alternative that 
> mimics what might happen in real world? I have tried blocking connections 
> using iptables rules, dropping all incoming and outoing packets; initial 
> testing appears to show different corosync behaviour that would hopefully not 
> lead to my problem scenario, but I'm still in the process of confirming. I 
> have also carried out some cable pulls and not run into issues yet, but this 
> problem can be intermittent, so really needs an automated way to test many 
> times.
> 4.       The log snippet below from the isolated node shows that it updates 
> the CIB twice sometime after detecting loss of network interface. Why does 
> this happen? I believe that ultimately it is these CIB updates that increment 
> the epoch, which leads to this CIB overriding the cluster later.
>  
> I have also tried a no-quorum-policy of 'suicide' in an attempt to prevent 
> CIB updates by the Alice, but it didn't make a different.

Why isn't your normal fencing device working?

> Note that to facilitate log collection and analysis, I have added a delay to 
> the stonith reset operation, but I have also set the timeout on the 
> crm-fence-peer script to ensure that it is greater than this 'deadtime'.
>  
> Any advice on this would be greatly appreciated.
>  
> Thanks,
>  
> Tom
>  
> Log snippet showing isolated node updating the CIB, which results in epoch 
> being incremented two times:
>  
> Jul 10 13:42:54 stratus18 corosync[1268]:   [TOTEM ] A processor failed, 
> forming new configuration.
> Jul 10 13:42:54 stratus18 corosync[1268]:   [TOTEM ] The network interface is 
> down.
> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20758]: TOMTEST-DEBUG: modified 
> version
> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20758]: invoked for tomtest
> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20761]: TOMTEST-DEBUG: modified 
> version
> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20761]: invoked for tomtest
> Jul 10 13:42:55 stratus18 stonith-ng: [1276]: info: stonith_command: 
> Processed st_execute from lrmd: rc=-1
> Jul 10 13:42:55 stratus18 external/ipmi[20806]: [20816]: ERROR: error 
> executing ipmitool: Connect failed: Network is unreachable#015 Unable to get 
> Chassis Power Status#015
> Jul 10 13:42:55 stratus18 crm-fence-peer.sh[20758]: Call cib_query failed 
> (-41): Remote node did not respond
> Jul 10 13:42:55 stratus18 crm-fence-peer.sh[20761]: Call cib_query failed 
> (-41): Remote node did not respond
> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #7 eth0, 
> 192.168.185.150#123, interface stats: received=0, sent=0, dropped=0, 
> active_time=912 secs
> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #4 eth0, 
> fe80::7ae7:d1ff:fe22:5270#123, interface stats: received=0, sent=0, 
> dropped=0, active_time=6080 secs
> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #3 eth0, 
> 192.168.185.118#123, interface stats: received=52, sent=53, dropped=0, 
> active_time=6080 secs
> Jul 10 13:42:55 stratus18 ntpd[1062]: 192.168.8.97 interface 192.168.185.118 
> -> (none)
> Jul 10 13:42:55 stratus18 ntpd[1062]: peers refreshed
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] notice: 
> pcmk_peer_update: Transitional membership event on ring 2728: memb=1, new=0, 
> lost=2
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: 
> memb: .unknown. 16777343
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: 
> lost: stratus18 1991878848
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: 
> lost: stratus20 2025433280
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] notice: 
> pcmk_peer_update: Stable membership event on ring 2728: memb=1, new=0, lost=0
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
> Creating entry for node 16777343 born on 2728
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
> Node 16777343/unknown is now: member
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: 
> MEMB: .pending. 16777343
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] ERROR: pcmk_peer_update: 
> Something strange happened: 1
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: 
> ais_mark_unseen_peer_dead: Node stratus17 was not seen in the previous 
> transition
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
> Node 1975101632/stratus17 is now: lost
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: 
> ais_mark_unseen_peer_dead: Node stratus18 was not seen in the previous 
> transition
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
> Node 1991878848/stratus18 is now: lost
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: 
> ais_mark_unseen_peer_dead: Node stratus20 was not seen in the previous 
> transition
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
> Node 2025433280/stratus20 is now: lost
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] WARN: 
> pcmk_update_nodeid: Detected local node id change: 1991878848 -> 16777343
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: destroy_ais_node: 
> Destroying entry for node 1991878848
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] notice: ais_remove_peer: 
> Removed dead peer 1991878848 from the membership list
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: ais_remove_peer: 
> Sending removal of 1991878848 to 2 children
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
> 0x13d9520 Node 16777343 now known as stratus18 (was: (null))
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
> Node stratus18 now has 1 quorum votes (was 0)
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
> Node stratus18 now has process list: 00000000000000000000000000111312 
> (1118994)
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: 
> send_member_notification: Sending membership update 2728 to 2 children
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 
> 0x13d9520 Node 16777343 ((null)) born on: 2708
> Jul 10 13:42:55 stratus18 corosync[1268]:   [TOTEM ] A processor joined or 
> left the membership and a new membership was formed.
> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 now 
> has id: 16777343
> Jul 10 13:42:55 stratus18 cib: [1277]: info: ais_dispatch_message: Membership 
> 2728: quorum retained
> Jul 10 13:42:55 stratus18 cib: [1277]: info: ais_dispatch_message: Removing 
> peer 1991878848/1991878848
> Jul 10 13:42:55 stratus18 cib: [1277]: info: reap_crm_member: Peer 1991878848 
> is unknown
> Jul 10 13:42:55 stratus18 cib: [1277]: notice: ais_dispatch_message: 
> Membership 2728: quorum lost
> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_update_peer: Node stratus17: 
> id=1975101632 state=lost (new) addr=r(0) ip(192.168.185.117)  votes=1 
> born=2724 seen=2724 proc=00000000000000000000000000111312
> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_update_peer: Node stratus20: 
> id=2025433280 state=lost (new) addr=r(0) ip(192.168.185.120)  votes=1 born=4 
> seen=2724 proc=00000000000000000000000000111312
> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 now 
> has id: 1991878848
> Jul 10 13:42:55 stratus18 corosync[1268]:   [CPG   ] chosen downlist: sender 
> r(0) ip(127.0.0.1) ; members(old:3 left:3)
> Jul 10 13:42:55 stratus18 corosync[1268]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_get_peer: Node stratus18 
> now has id: 16777343
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: 
> Membership 2728: quorum retained
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: Removing 
> peer 1991878848/1991878848
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: reap_crm_member: Peer 
> 1991878848 is unknown
> Jul 10 13:42:55 stratus18 crmd: [1281]: notice: ais_dispatch_message: 
> Membership 2728: quorum lost
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_status_callback: status: 
> stratus17 is now lost (was member)
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_peer: Node 
> stratus17: id=1975101632 state=lost (new) addr=r(0) ip(192.168.185.117)  
> votes=1 born=2724 seen=2724 proc=00000000000000000000000000111312
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_status_callback: status: 
> stratus20 is now lost (was member)
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_peer: Node 
> stratus20: id=2025433280 state=lost (new) addr=r(0) ip(192.168.185.120)  
> votes=1 born=4 seen=2724 proc=00000000000000000000000000111312
> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: check_dead_member: Our DC node 
> (stratus20) left the cluster
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State 
> transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL 
> origin=check_dead_member ]
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: update_dc: Unset DC stratus20
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State 
> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
> cause=C_FSA_INTERNAL origin=do_election_check ]
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_te_control: Registering TE 
> UUID: 6e335eff-5e48-4fc1-9003-0537ae948dfd
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: set_graph_functions: Setting 
> custom graph functions
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: unpack_graph: Unpacked 
> transition -1: 0 actions in 0 synapses
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_takeover: Taking over DC 
> status for this partition
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_readwrite: We are 
> now in R/W mode
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
> complete: op cib_master for section 'all' (origin=local/crmd/57, 
> version=0.76.46): ok (rc=0)
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
> complete: op cib_modify for section cib (origin=local/crmd/58, 
> version=0.76.47): ok (rc=0)
> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 now 
> has id: 16777343
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
> complete: op cib_modify for section crm_config (origin=local/crmd/60, 
> version=0.76.48): ok (rc=0)
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: join_make_offer: Making join 
> offers based on membership 2728
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_offer_all: join-1: 
> Waiting on 1 outstanding join acks
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: 
> Membership 2728: quorum still lost
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
> complete: op cib_modify for section crm_config (origin=local/crmd/62, 
> version=0.76.49): ok (rc=0)
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crmd_ais_dispatch: Setting 
> expected votes to 2
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: update_dc: Set DC to stratus18 
> (3.0.5)
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: Shutdown 
> escalation occurs after: 1200000ms
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: Checking 
> for expired actions every 900000ms
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: Sending 
> expected-votes=3 to corosync
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: 
> Membership 2728: quorum still lost
> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: 
> update_expected_votes: Expected quorum votes 2 -> 3
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <cib admin_epoch="0" 
> epoch="76" num_updates="49" >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   <configuration >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     <crm_config >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       
> <cluster_property_set id="cib-bootstrap-options" >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -         <nvpair 
> value="3" id="cib-bootstrap-options-expected-quorum-votes" />
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       
> </cluster_property_set>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     </crm_config>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   </configuration>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </cib>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <cib admin_epoch="0" 
> cib-last-written="Wed Jul 10 13:25:58 2013" crm_feature_set="3.0.5" 
> epoch="77" have-quorum="1" num_updates="1" update-client="crmd" 
> update-origin="stratus17" validate-with="pacemaker-1.2" dc-uuid="stratus20" >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   <configuration >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     <crm_config >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       
> <cluster_property_set id="cib-bootstrap-options" >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +         <nvpair 
> id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" 
> value="2" />
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       
> </cluster_property_set>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     </crm_config>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   </configuration>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </cib>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
> complete: op cib_modify for section crm_config (origin=local/crmd/65, 
> version=0.77.1): ok (rc=0)
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crmd_ais_dispatch: Setting 
> expected votes to 3
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State 
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
> cause=C_FSA_INTERNAL origin=check_join_state ]
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: All 1 
> cluster nodes responded to the join offer.
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_finalize: join-1: 
> Syncing the CIB from stratus18 to the rest of the cluster
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <cib admin_epoch="0" 
> epoch="77" num_updates="1" >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   <configuration >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     <crm_config >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       
> <cluster_property_set id="cib-bootstrap-options" >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -         <nvpair 
> value="2" id="cib-bootstrap-options-expected-quorum-votes" />
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       
> </cluster_property_set>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     </crm_config>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   </configuration>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </cib>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <cib admin_epoch="0" 
> cib-last-written="Wed Jul 10 13:42:55 2013" crm_feature_set="3.0.5" 
> epoch="78" have-quorum="1" num_updates="1" update-client="crmd" 
> update-origin="stratus18" validate-with="pacemaker-1.2" dc-uuid="stratus20" >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   <configuration >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     <crm_config >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       
> <cluster_property_set id="cib-bootstrap-options" >
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +         <nvpair 
> id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" 
> value="3" />
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       
> </cluster_property_set>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     </crm_config>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   </configuration>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </cib>
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
> complete: op cib_modify for section crm_config (origin=local/crmd/68, 
> version=0.78.1): ok (rc=0)
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
> complete: op cib_sync for section 'all' (origin=local/crmd/69, 
> version=0.78.1): ok (rc=0)
> Jul 10 13:42:55 stratus18 lrmd: [1278]: info: stonith_api_device_metadata: 
> looking up external/ipmi/heartbeat metadata
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
> complete: op cib_modify for section nodes (origin=local/crmd/70, 
> version=0.78.2): ok (rc=0)
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_ack: join-1: 
> Updating node state to member for stratus18
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
> complete: op cib_delete for section //node_state[@uname='stratus18']/lrm 
> (origin=local/crmd/71, version=0.78.3): ok (rc=0)
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: erase_xpath_callback: Deletion 
> of "//node_state[@uname='stratus18']/lrm": ok (rc=0)
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State 
> transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED 
> cause=C_FSA_INTERNAL origin=check_join_state ]
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: All 1 
> cluster nodes are eligible to run resources.
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_final: Ensuring DC, 
> quorum and node attributes are up-to-date
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_quorum: Updating 
> quorum status to false (call=75)
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: 
> do_te_invoke:167 - Triggered transition abort (complete=1) : Peer Cancelled
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 76: 
> Requesting the current CIB: S_POLICY_ENGINE
> Jul 10 13:42:55 stratus18 attrd: [1279]: notice: attrd_local_callback: 
> Sending full refresh (origin=crmd)
> Jul 10 13:42:55 stratus18 attrd: [1279]: notice: attrd_trigger_update: 
> Sending flush op to all hosts for: probe_complete (true)
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
> complete: op cib_modify for section nodes (origin=local/crmd/73, 
> version=0.78.5): ok (rc=0)
> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: match_down_event: No match for 
> shutdown action on stratus17
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: te_update_diff: 
> Stonith/shutdown of stratus17 not matched
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: 
> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, 
> id=stratus17, magic=NA, cib=0.78.6) : Node failure
> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: match_down_event: No match for 
> shutdown action on stratus20
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: te_update_diff: 
> Stonith/shutdown of stratus20 not matched
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: 
> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, 
> id=stratus20, magic=NA, cib=0.78.6) : Node failure
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 77: 
> Requesting the current CIB: S_POLICY_ENGINE
> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 78: 
> Requesting the current CIB: S_POLICY_ENGINE
> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation 
> complete: op cib_modify for section cib (origin=local/crmd/75, 
> version=0.78.7): ok (rc=0)
> Jul 10 13:42:56 stratus18 crmd: [1281]: info: do_pe_invoke_callback: Invoking 
> the PE: query=78, ref=pe_calc-dc-1373460176-49, seq=2728, quorate=0
> Jul 10 13:42:56 stratus18 attrd: [1279]: notice: attrd_trigger_update: 
> Sending flush op to all hosts for: master-drbd_tomtest:0 (10000)
> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: cluster_status: We do not 
> have quorum - fencing and resource management disabled
> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: pe_fence_node: Node 
> stratus17 will be fenced because it is un-expectedly down
> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: determine_online_status: 
> Node stratus17 is unclean
> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: pe_fence_node: Node 
> stratus20 will be fenced because it is un-expectedly down
> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: determine_online_status: 
> Node stratus20 is unclean
> Jul 10 13:42:56 stratus18 pengine: [1280]: notice: unpack_rsc_op: Hard error 
> - drbd_tomtest:0_last_failure_0 failed with rc=5: Preventing ms_drbd_tomtest 
> from re-starting on stratus20
> Jul 10 13:42:56 stratus18 pengine: [1280]: notice: unpack_rsc_op: Hard error 
> - tomtest_mysql_SERVICE_last_failure_0 failed with rc=5: Preventing 
> tomtest_mysql_SERVICE from re-starting on stratus20
>  
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Issue with an isolated node overriding CIB after rejoining main cluster

Reply via email to