Re: [Pacemaker] unknown third node added to a 2 node cluster?

2014-10-22 Thread Brian J. Murrell (brian)
On Mon, 2014-10-13 at 12:51 +1100, Andrew Beekhof wrote:
 
 Even the same address can be a problem. That brief window where things were 
 getting renewed can screw up corosync.

But as I proved, there was no renewal at all during the period of this
entire pacemaker run, so the use of DHCP here is a red-herring and does
not explain the observed behaviour.

 Never ever use dhcp for a cluster node. Ever. Really, never.

Fair enough.  But since this was not the cause of this problem, it's
still unexplained.  Is it a bug in pacemaker that it doesn't handle this
mysterious third node appearance/disappearance and it fouls up the
cluster?

 Yes. That is what nodeid's are calculated from.
 Different nodeid == different address

So your theory is that corosync on one of the nodes momentarily decided
to change which interface it was binding to and ...

 localhost is the most common one

... binded to localhost?  If so, I guess I should take this to the
corosync list.

b.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] unknown third node added to a 2 node cluster?

2014-10-12 Thread Andrew Beekhof

On 11 Oct 2014, at 1:35 am, Brian J. Murrell (brian) br...@interlinx.bc.ca 
wrote:

 On Wed, 2014-10-08 at 12:39 +1100, Andrew Beekhof wrote:
 On 8 Oct 2014, at 2:09 am, Brian J. Murrell (brian) 
 brian-squohqy54cvwr29bmmi...@public.gmane.org wrote:
 
 Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5
 and node6 I saw an unknown third node being added to the cluster,
 but only on node5:
 
 Is either node using dhcp?
 
 Yes, they both are.  The server is the ISC DHCP server (on EL6) and the
 address pool is much more plentiful than the node count.  That is all
 just to say that the DHCP server serving these nodes abides by the DHCP
 RFC's recommendation to allow clients to continue to use addresses they
 have already been assigned when making a renewal request.  And indeed,
 give them the same address they had previously after a lease expiry, as
 long as the pool is not constrained and address needed to satisfy a
 request from a different machine.
 
 I would guess node6 got a new IP address
 
 These nodes are using the ISC DHCP client.  That DHCP client logs in the
 same log (/var/log/messages) as was posted in my prior message when it
 renews a lease with messages such as:
 
 Oct 10 05:56:19 node6 dhclient[1026]: DHCPREQUEST on eth0 to 10.14.80.6 port 
 67 (xid=0x4f11c576)
 Oct 10 05:56:19 node6 dhclient[1026]: DHCPACK from 10.14.80.6 (xid=0x4f11c576)
 Oct 10 05:56:20 node6 dhclient[1026]: bound to 10.14.82.141 -- renewal in 
 8546 seconds.
 
 In the logs that I pasted the messages from in my previous message, such
 messages don't even exist because the nodes are not left up long enough
 to even get to a lease expiry.  These are tests nodes and so are
 rebooted frequently.
 
 TL;DR: I am quite certain the node did not get a new/different address.

Even the same address can be a problem. That brief window where things were 
getting renewed can screw up corosync.
Never ever use dhcp for a cluster node. Ever. Really, never.

 
 (or that corosync decided to bind to a different one)
 
 Bind to a different what?  Address?

Yes. That is what nodeid's are calculated from.
Different nodeid == different address

  As in binding to an address that
 was not even configured on the machine?

localhost is the most common one

 
 b.
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] unknown third node added to a 2 node cluster?

2014-10-10 Thread Brian J. Murrell (brian)
On Wed, 2014-10-08 at 12:39 +1100, Andrew Beekhof wrote:
 On 8 Oct 2014, at 2:09 am, Brian J. Murrell (brian) 
 brian-squohqy54cvwr29bmmi...@public.gmane.org wrote:
 
  Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5
  and node6 I saw an unknown third node being added to the cluster,
  but only on node5:
 
 Is either node using dhcp?

Yes, they both are.  The server is the ISC DHCP server (on EL6) and the
address pool is much more plentiful than the node count.  That is all
just to say that the DHCP server serving these nodes abides by the DHCP
RFC's recommendation to allow clients to continue to use addresses they
have already been assigned when making a renewal request.  And indeed,
give them the same address they had previously after a lease expiry, as
long as the pool is not constrained and address needed to satisfy a
request from a different machine.

 I would guess node6 got a new IP address

These nodes are using the ISC DHCP client.  That DHCP client logs in the
same log (/var/log/messages) as was posted in my prior message when it
renews a lease with messages such as:

Oct 10 05:56:19 node6 dhclient[1026]: DHCPREQUEST on eth0 to 10.14.80.6 port 67 
(xid=0x4f11c576)
Oct 10 05:56:19 node6 dhclient[1026]: DHCPACK from 10.14.80.6 (xid=0x4f11c576)
Oct 10 05:56:20 node6 dhclient[1026]: bound to 10.14.82.141 -- renewal in 8546 
seconds.

In the logs that I pasted the messages from in my previous message, such
messages don't even exist because the nodes are not left up long enough
to even get to a lease expiry.  These are tests nodes and so are
rebooted frequently.

TL;DR: I am quite certain the node did not get a new/different address.

 (or that corosync decided to bind to a different one)

Bind to a different what?  Address?  As in binding to an address that
was not even configured on the machine?

b.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] unknown third node added to a 2 node cluster?

2014-10-07 Thread Brian J. Murrell (brian)
Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5
and node6 I saw an unknown third node being added to the cluster,
but only on node5:

Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 12: memb=2, new=0, lost=0
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: memb: 
node6 3713011210
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: memb: 
node5 3729788426
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 12: memb=3, new=1, lost=0
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: update_member: Creating 
entry for node 2085752330 born on 12
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: update_member: Node 
2085752330/unknown is now: member
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: NEW:  
.pending. 2085752330
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node6 3713011210
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node5 3729788426
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
.pending. 2085752330

Above is where this third node seems to appear.

Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: 
send_member_notification: Sending membership update 12 to 2 children
Sep 18 22:52:16 node5 corosync[17321]:   [TOTEM ] A processor joined or left 
the membership and a new membership was formed.
Sep 18 22:52:16 node5 cib[17371]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node (null)[2085752330] - state is now member (was 
(null))
Sep 18 22:52:16 node5 crmd[17376]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node (null)[2085752330] - state is now member (was 
(null))
Sep 18 22:52:16 node5 crmd[17376]:   notice: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Sep 18 22:52:16 node5 crmd[17376]:error: join_make_offer: No recipient for 
welcome message
Sep 18 22:52:16 node5 crmd[17376]:  warning: do_state_transition: Only 2 of 3 
cluster nodes are eligible to run resources - continue 0
Sep 18 22:52:16 node5 attrd[17374]:   notice: attrd_local_callback: Sending 
full refresh (origin=crmd)
Sep 18 22:52:16 node5 attrd[17374]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: probe_complete (true)
Sep 18 22:52:16 node5 stonith-ng[17372]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: Diff: --- 0.31.2
Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: Diff: +++ 0.32.1 
4a679012144955c802557a39707247a2
Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: --   nvpair 
value=Stopped id=res1-meta_attributes-target-role/
Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: ++   nvpair 
name=target-role id=res1-meta_attributes-target-role value=Started/
Sep 18 22:52:16 node5 pengine[17375]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Sep 18 22:52:16 node5 pengine[17375]:   notice: LogActions: Start   
res1#011(node5)
Sep 18 22:52:16 node5 crmd[17376]:   notice: te_rsc_command: Initiating action 
7: start res1_start_0 on node5 (local)
Sep 18 22:52:16 node5 pengine[17375]:   notice: process_pe_message: Calculated 
Transition 22: /var/lib/pacemaker/pengine/pe-input-165.bz2
Sep 18 22:52:16 node5 stonith-ng[17372]:   notice: stonith_device_register: 
Device 'st-fencing' already existed in device list (1 active devices)

On node6 at the same time the following was in the log:

Sep 18 22:52:15 node6 corosync[11178]:   [TOTEM ] Incrementing problem counter 
for seqid 5 iface 10.128.0.221 to [1 of 10]
Sep 18 22:52:16 node6 corosync[11178]:   [TOTEM ] Incrementing problem counter 
for seqid 8 iface 10.128.0.221 to [2 of 10]
Sep 18 22:52:17 node6 corosync[11178]:   [TOTEM ] Decrementing problem counter 
for iface 10.128.0.221 to [1 of 10]
Sep 18 22:52:19 node6 corosync[11178]:   [TOTEM ] ring 1 active with no faults

Any idea what's going on here?

Cheers,
b.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] unknown third node added to a 2 node cluster?

2014-10-07 Thread Andrew Beekhof

On 8 Oct 2014, at 2:09 am, Brian J. Murrell (brian) br...@interlinx.bc.ca 
wrote:

 Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5
 and node6 I saw an unknown third node being added to the cluster,
 but only on node5:

Is either node using dhcp?
I would guess node6 got a new IP address (or that corosync decided to bind to a 
different one)

 
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] notice: pcmk_peer_update: 
 Transitional membership event on ring 12: memb=2, new=0, lost=0
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 memb: node6 3713011210
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 memb: node5 3729788426
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] notice: pcmk_peer_update: 
 Stable membership event on ring 12: memb=3, new=1, lost=0
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: update_member: 
 Creating entry for node 2085752330 born on 12
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: update_member: Node 
 2085752330/unknown is now: member
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 NEW:  .pending. 2085752330
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 MEMB: node6 3713011210
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 MEMB: node5 3729788426
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 MEMB: .pending. 2085752330
 
 Above is where this third node seems to appear.
 
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: 
 send_member_notification: Sending membership update 12 to 2 children
 Sep 18 22:52:16 node5 corosync[17321]:   [TOTEM ] A processor joined or left 
 the membership and a new membership was formed.
 Sep 18 22:52:16 node5 cib[17371]:   notice: crm_update_peer_state: 
 plugin_handle_membership: Node (null)[2085752330] - state is now member (was 
 (null))
 Sep 18 22:52:16 node5 crmd[17376]:   notice: crm_update_peer_state: 
 plugin_handle_membership: Node (null)[2085752330] - state is now member (was 
 (null))
 Sep 18 22:52:16 node5 crmd[17376]:   notice: do_state_transition: State 
 transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
 origin=abort_transition_graph ]
 Sep 18 22:52:16 node5 crmd[17376]:error: join_make_offer: No recipient 
 for welcome message
 Sep 18 22:52:16 node5 crmd[17376]:  warning: do_state_transition: Only 2 of 3 
 cluster nodes are eligible to run resources - continue 0
 Sep 18 22:52:16 node5 attrd[17374]:   notice: attrd_local_callback: Sending 
 full refresh (origin=crmd)
 Sep 18 22:52:16 node5 attrd[17374]:   notice: attrd_trigger_update: Sending 
 flush op to all hosts for: probe_complete (true)
 Sep 18 22:52:16 node5 stonith-ng[17372]:   notice: unpack_config: On loss of 
 CCM Quorum: Ignore
 Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: Diff: --- 0.31.2
 Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: Diff: +++ 0.32.1 
 4a679012144955c802557a39707247a2
 Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: --   nvpair 
 value=Stopped id=res1-meta_attributes-target-role/
 Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: ++   nvpair 
 name=target-role id=res1-meta_attributes-target-role value=Started/
 Sep 18 22:52:16 node5 pengine[17375]:   notice: unpack_config: On loss of CCM 
 Quorum: Ignore
 Sep 18 22:52:16 node5 pengine[17375]:   notice: LogActions: Start   
 res1#011(node5)
 Sep 18 22:52:16 node5 crmd[17376]:   notice: te_rsc_command: Initiating 
 action 7: start res1_start_0 on node5 (local)
 Sep 18 22:52:16 node5 pengine[17375]:   notice: process_pe_message: 
 Calculated Transition 22: /var/lib/pacemaker/pengine/pe-input-165.bz2
 Sep 18 22:52:16 node5 stonith-ng[17372]:   notice: stonith_device_register: 
 Device 'st-fencing' already existed in device list (1 active devices)
 
 On node6 at the same time the following was in the log:
 
 Sep 18 22:52:15 node6 corosync[11178]:   [TOTEM ] Incrementing problem 
 counter for seqid 5 iface 10.128.0.221 to [1 of 10]
 Sep 18 22:52:16 node6 corosync[11178]:   [TOTEM ] Incrementing problem 
 counter for seqid 8 iface 10.128.0.221 to [2 of 10]
 Sep 18 22:52:17 node6 corosync[11178]:   [TOTEM ] Decrementing problem 
 counter for iface 10.128.0.221 to [1 of 10]
 Sep 18 22:52:19 node6 corosync[11178]:   [TOTEM ] ring 1 active with no faults
 
 Any idea what's going on here?
 
 Cheers,
 b.
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org