Re: [Pacemaker] unknown third node added to a 2 node cluster?
On Mon, 2014-10-13 at 12:51 +1100, Andrew Beekhof wrote: Even the same address can be a problem. That brief window where things were getting renewed can screw up corosync. But as I proved, there was no renewal at all during the period of this entire pacemaker run, so the use of DHCP here is a red-herring and does not explain the observed behaviour. Never ever use dhcp for a cluster node. Ever. Really, never. Fair enough. But since this was not the cause of this problem, it's still unexplained. Is it a bug in pacemaker that it doesn't handle this mysterious third node appearance/disappearance and it fouls up the cluster? Yes. That is what nodeid's are calculated from. Different nodeid == different address So your theory is that corosync on one of the nodes momentarily decided to change which interface it was binding to and ... localhost is the most common one ... binded to localhost? If so, I guess I should take this to the corosync list. b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] unknown third node added to a 2 node cluster?
On 11 Oct 2014, at 1:35 am, Brian J. Murrell (brian) br...@interlinx.bc.ca wrote: On Wed, 2014-10-08 at 12:39 +1100, Andrew Beekhof wrote: On 8 Oct 2014, at 2:09 am, Brian J. Murrell (brian) brian-squohqy54cvwr29bmmi...@public.gmane.org wrote: Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5 and node6 I saw an unknown third node being added to the cluster, but only on node5: Is either node using dhcp? Yes, they both are. The server is the ISC DHCP server (on EL6) and the address pool is much more plentiful than the node count. That is all just to say that the DHCP server serving these nodes abides by the DHCP RFC's recommendation to allow clients to continue to use addresses they have already been assigned when making a renewal request. And indeed, give them the same address they had previously after a lease expiry, as long as the pool is not constrained and address needed to satisfy a request from a different machine. I would guess node6 got a new IP address These nodes are using the ISC DHCP client. That DHCP client logs in the same log (/var/log/messages) as was posted in my prior message when it renews a lease with messages such as: Oct 10 05:56:19 node6 dhclient[1026]: DHCPREQUEST on eth0 to 10.14.80.6 port 67 (xid=0x4f11c576) Oct 10 05:56:19 node6 dhclient[1026]: DHCPACK from 10.14.80.6 (xid=0x4f11c576) Oct 10 05:56:20 node6 dhclient[1026]: bound to 10.14.82.141 -- renewal in 8546 seconds. In the logs that I pasted the messages from in my previous message, such messages don't even exist because the nodes are not left up long enough to even get to a lease expiry. These are tests nodes and so are rebooted frequently. TL;DR: I am quite certain the node did not get a new/different address. Even the same address can be a problem. That brief window where things were getting renewed can screw up corosync. Never ever use dhcp for a cluster node. Ever. Really, never. (or that corosync decided to bind to a different one) Bind to a different what? Address? Yes. That is what nodeid's are calculated from. Different nodeid == different address As in binding to an address that was not even configured on the machine? localhost is the most common one b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] unknown third node added to a 2 node cluster?
On Wed, 2014-10-08 at 12:39 +1100, Andrew Beekhof wrote: On 8 Oct 2014, at 2:09 am, Brian J. Murrell (brian) brian-squohqy54cvwr29bmmi...@public.gmane.org wrote: Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5 and node6 I saw an unknown third node being added to the cluster, but only on node5: Is either node using dhcp? Yes, they both are. The server is the ISC DHCP server (on EL6) and the address pool is much more plentiful than the node count. That is all just to say that the DHCP server serving these nodes abides by the DHCP RFC's recommendation to allow clients to continue to use addresses they have already been assigned when making a renewal request. And indeed, give them the same address they had previously after a lease expiry, as long as the pool is not constrained and address needed to satisfy a request from a different machine. I would guess node6 got a new IP address These nodes are using the ISC DHCP client. That DHCP client logs in the same log (/var/log/messages) as was posted in my prior message when it renews a lease with messages such as: Oct 10 05:56:19 node6 dhclient[1026]: DHCPREQUEST on eth0 to 10.14.80.6 port 67 (xid=0x4f11c576) Oct 10 05:56:19 node6 dhclient[1026]: DHCPACK from 10.14.80.6 (xid=0x4f11c576) Oct 10 05:56:20 node6 dhclient[1026]: bound to 10.14.82.141 -- renewal in 8546 seconds. In the logs that I pasted the messages from in my previous message, such messages don't even exist because the nodes are not left up long enough to even get to a lease expiry. These are tests nodes and so are rebooted frequently. TL;DR: I am quite certain the node did not get a new/different address. (or that corosync decided to bind to a different one) Bind to a different what? Address? As in binding to an address that was not even configured on the machine? b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] unknown third node added to a 2 node cluster?
Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5 and node6 I saw an unknown third node being added to the cluster, but only on node5: Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 12: memb=2, new=0, lost=0 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: memb: node6 3713011210 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: memb: node5 3729788426 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 12: memb=3, new=1, lost=0 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: update_member: Creating entry for node 2085752330 born on 12 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: update_member: Node 2085752330/unknown is now: member Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: NEW: .pending. 2085752330 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: node6 3713011210 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: node5 3729788426 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: .pending. 2085752330 Above is where this third node seems to appear. Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: send_member_notification: Sending membership update 12 to 2 children Sep 18 22:52:16 node5 corosync[17321]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 18 22:52:16 node5 cib[17371]: notice: crm_update_peer_state: plugin_handle_membership: Node (null)[2085752330] - state is now member (was (null)) Sep 18 22:52:16 node5 crmd[17376]: notice: crm_update_peer_state: plugin_handle_membership: Node (null)[2085752330] - state is now member (was (null)) Sep 18 22:52:16 node5 crmd[17376]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Sep 18 22:52:16 node5 crmd[17376]:error: join_make_offer: No recipient for welcome message Sep 18 22:52:16 node5 crmd[17376]: warning: do_state_transition: Only 2 of 3 cluster nodes are eligible to run resources - continue 0 Sep 18 22:52:16 node5 attrd[17374]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Sep 18 22:52:16 node5 attrd[17374]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Sep 18 22:52:16 node5 stonith-ng[17372]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: Diff: --- 0.31.2 Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: Diff: +++ 0.32.1 4a679012144955c802557a39707247a2 Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: -- nvpair value=Stopped id=res1-meta_attributes-target-role/ Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: ++ nvpair name=target-role id=res1-meta_attributes-target-role value=Started/ Sep 18 22:52:16 node5 pengine[17375]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 18 22:52:16 node5 pengine[17375]: notice: LogActions: Start res1#011(node5) Sep 18 22:52:16 node5 crmd[17376]: notice: te_rsc_command: Initiating action 7: start res1_start_0 on node5 (local) Sep 18 22:52:16 node5 pengine[17375]: notice: process_pe_message: Calculated Transition 22: /var/lib/pacemaker/pengine/pe-input-165.bz2 Sep 18 22:52:16 node5 stonith-ng[17372]: notice: stonith_device_register: Device 'st-fencing' already existed in device list (1 active devices) On node6 at the same time the following was in the log: Sep 18 22:52:15 node6 corosync[11178]: [TOTEM ] Incrementing problem counter for seqid 5 iface 10.128.0.221 to [1 of 10] Sep 18 22:52:16 node6 corosync[11178]: [TOTEM ] Incrementing problem counter for seqid 8 iface 10.128.0.221 to [2 of 10] Sep 18 22:52:17 node6 corosync[11178]: [TOTEM ] Decrementing problem counter for iface 10.128.0.221 to [1 of 10] Sep 18 22:52:19 node6 corosync[11178]: [TOTEM ] ring 1 active with no faults Any idea what's going on here? Cheers, b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] unknown third node added to a 2 node cluster?
On 8 Oct 2014, at 2:09 am, Brian J. Murrell (brian) br...@interlinx.bc.ca wrote: Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5 and node6 I saw an unknown third node being added to the cluster, but only on node5: Is either node using dhcp? I would guess node6 got a new IP address (or that corosync decided to bind to a different one) Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 12: memb=2, new=0, lost=0 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: memb: node6 3713011210 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: memb: node5 3729788426 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 12: memb=3, new=1, lost=0 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: update_member: Creating entry for node 2085752330 born on 12 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: update_member: Node 2085752330/unknown is now: member Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: NEW: .pending. 2085752330 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: node6 3713011210 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: node5 3729788426 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: .pending. 2085752330 Above is where this third node seems to appear. Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: send_member_notification: Sending membership update 12 to 2 children Sep 18 22:52:16 node5 corosync[17321]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 18 22:52:16 node5 cib[17371]: notice: crm_update_peer_state: plugin_handle_membership: Node (null)[2085752330] - state is now member (was (null)) Sep 18 22:52:16 node5 crmd[17376]: notice: crm_update_peer_state: plugin_handle_membership: Node (null)[2085752330] - state is now member (was (null)) Sep 18 22:52:16 node5 crmd[17376]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Sep 18 22:52:16 node5 crmd[17376]:error: join_make_offer: No recipient for welcome message Sep 18 22:52:16 node5 crmd[17376]: warning: do_state_transition: Only 2 of 3 cluster nodes are eligible to run resources - continue 0 Sep 18 22:52:16 node5 attrd[17374]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Sep 18 22:52:16 node5 attrd[17374]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Sep 18 22:52:16 node5 stonith-ng[17372]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: Diff: --- 0.31.2 Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: Diff: +++ 0.32.1 4a679012144955c802557a39707247a2 Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: -- nvpair value=Stopped id=res1-meta_attributes-target-role/ Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: ++ nvpair name=target-role id=res1-meta_attributes-target-role value=Started/ Sep 18 22:52:16 node5 pengine[17375]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 18 22:52:16 node5 pengine[17375]: notice: LogActions: Start res1#011(node5) Sep 18 22:52:16 node5 crmd[17376]: notice: te_rsc_command: Initiating action 7: start res1_start_0 on node5 (local) Sep 18 22:52:16 node5 pengine[17375]: notice: process_pe_message: Calculated Transition 22: /var/lib/pacemaker/pengine/pe-input-165.bz2 Sep 18 22:52:16 node5 stonith-ng[17372]: notice: stonith_device_register: Device 'st-fencing' already existed in device list (1 active devices) On node6 at the same time the following was in the log: Sep 18 22:52:15 node6 corosync[11178]: [TOTEM ] Incrementing problem counter for seqid 5 iface 10.128.0.221 to [1 of 10] Sep 18 22:52:16 node6 corosync[11178]: [TOTEM ] Incrementing problem counter for seqid 8 iface 10.128.0.221 to [2 of 10] Sep 18 22:52:17 node6 corosync[11178]: [TOTEM ] Decrementing problem counter for iface 10.128.0.221 to [1 of 10] Sep 18 22:52:19 node6 corosync[11178]: [TOTEM ] ring 1 active with no faults Any idea what's going on here? Cheers, b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org