Digimer napsal(a):
Hi all,
Starting a new thread from the "Clustered LVM with iptables issue"
thread...
I've decided to review how I do networking entirely in my cluster. I
make zero claims to being great at networks, so I would love some feedback.
I've got three active/passive bonded interfaces; Back-Channel, Storage
and Internet-Facing networks. The IFN is "off limits" to the cluster as
it is dedicated to hosted server traffic only.
So before, I uses only the BCN for cluster traffic for cman/corosync
multicast traffic, no rrp. A couple months ago, I had a cluster
partition when VM live migration (also on the BCN) congested the
network. So I decided to enable RRP using the SN as backup, which has
been marginally successful.
Now, I want to switch to unicast (<cman transport="udpu"), RRP with
the SN as the backup and BCN as the primary ring and do a proper
IPTables firewall. Is this sane?
When I stopped iptables entirely and started cman with unicast + RRP,
I saw this:
====] Node 1
Sep 11 17:31:24 node1 kernel: DLM (built Aug 10 2015 09:45:36) installed
Sep 11 17:31:24 node1 corosync[2523]: [MAIN ] Corosync Cluster Engine
('1.4.7'): started and ready to provide service.
Sep 11 17:31:24 node1 corosync[2523]: [MAIN ] Corosync built-in
features: nss dbus rdma snmp
Sep 11 17:31:24 node1 corosync[2523]: [MAIN ] Successfully read
config from /etc/cluster/cluster.conf
Sep 11 17:31:24 node1 corosync[2523]: [MAIN ] Successfully parsed
cman config
Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] Initializing transport
(UDP/IP Unicast).
Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] Initializing
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] Initializing transport
(UDP/IP Unicast).
Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] Initializing
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] The network interface
[10.20.10.1] is now up.
Sep 11 17:31:24 node1 corosync[2523]: [QUORUM] Using quorum provider
quorum_cman
Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded:
corosync cluster quorum service v0.1
Sep 11 17:31:24 node1 corosync[2523]: [CMAN ] CMAN 3.0.12.1 (built
Jul 6 2015 05:30:35) started
Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded:
corosync CMAN membership service 2.90
Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded:
openais checkpoint service B.01.01
Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded:
corosync extended virtual synchrony service
Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded:
corosync configuration service
Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded:
corosync cluster closed process group service v1.01
Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded:
corosync cluster config database access v1.01
Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded:
corosync profile loading service
Sep 11 17:31:24 node1 corosync[2523]: [QUORUM] Using quorum provider
quorum_cman
Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded:
corosync cluster quorum service v0.1
Sep 11 17:31:24 node1 corosync[2523]: [MAIN ] Compatibility mode set
to whitetank. Using V1 and V2 of the synchronization engine.
Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] adding new UDPU member
{10.20.10.1}
Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] adding new UDPU member
{10.20.10.2}
Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] The network interface
[10.10.10.1] is now up.
Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] adding new UDPU member
{10.10.10.1}
Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] adding new UDPU member
{10.10.10.2}
Sep 11 17:31:27 node1 corosync[2523]: [TOTEM ] Incrementing problem
counter for seqid 1 iface 10.10.10.1 to [1 of 3]
Sep 11 17:31:27 node1 corosync[2523]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Sep 11 17:31:27 node1 corosync[2523]: [CMAN ] quorum regained,
resuming activity
Sep 11 17:31:27 node1 corosync[2523]: [QUORUM] This node is within the
primary component and will provide service.
Sep 11 17:31:27 node1 corosync[2523]: [QUORUM] Members[1]: 1
Sep 11 17:31:27 node1 corosync[2523]: [QUORUM] Members[1]: 1
Sep 11 17:31:27 node1 corosync[2523]: [CPG ] chosen downlist: sender
r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:0 left:0)
Sep 11 17:31:27 node1 corosync[2523]: [MAIN ] Completed service
synchronization, ready to provide service.
Sep 11 17:31:27 node1 corosync[2523]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Sep 11 17:31:27 node1 corosync[2523]: [QUORUM] Members[2]: 1 2
Sep 11 17:31:27 node1 corosync[2523]: [QUORUM] Members[2]: 1 2
Sep 11 17:31:27 node1 corosync[2523]: [CPG ] chosen downlist: sender
r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:1 left:0)
Sep 11 17:31:27 node1 corosync[2523]: [MAIN ] Completed service
synchronization, ready to provide service.
Sep 11 17:31:29 node1 corosync[2523]: [TOTEM ] ring 1 active with no
faults
Sep 11 17:31:29 node1 fenced[2678]: fenced 3.0.12.1 started
Sep 11 17:31:29 node1 dlm_controld[2691]: dlm_controld 3.0.12.1 started
Sep 11 17:31:30 node1 gfs_controld[2755]: gfs_controld 3.0.12.1 started
====
====] Node 2
Sep 11 17:31:23 node2 kernel: DLM (built Aug 10 2015 09:45:36) installed
Sep 11 17:31:23 node2 corosync[2271]: [MAIN ] Corosync Cluster Engine
('1.4.7'): started and ready to provide service.
Sep 11 17:31:23 node2 corosync[2271]: [MAIN ] Corosync built-in
features: nss dbus rdma snmp
Sep 11 17:31:23 node2 corosync[2271]: [MAIN ] Successfully read
config from /etc/cluster/cluster.conf
Sep 11 17:31:23 node2 corosync[2271]: [MAIN ] Successfully parsed
cman config
Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] Initializing transport
(UDP/IP Unicast).
Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] Initializing
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] Initializing transport
(UDP/IP Unicast).
Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] Initializing
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] The network interface
[10.20.10.2] is now up.
Sep 11 17:31:23 node2 corosync[2271]: [QUORUM] Using quorum provider
quorum_cman
Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded:
corosync cluster quorum service v0.1
Sep 11 17:31:23 node2 corosync[2271]: [CMAN ] CMAN 3.0.12.1 (built
Jul 6 2015 05:30:35) started
Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded:
corosync CMAN membership service 2.90
Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded:
openais checkpoint service B.01.01
Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded:
corosync extended virtual synchrony service
Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded:
corosync configuration service
Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded:
corosync cluster closed process group service v1.01
Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded:
corosync cluster config database access v1.01
Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded:
corosync profile loading service
Sep 11 17:31:23 node2 corosync[2271]: [QUORUM] Using quorum provider
quorum_cman
Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded:
corosync cluster quorum service v0.1
Sep 11 17:31:23 node2 corosync[2271]: [MAIN ] Compatibility mode set
to whitetank. Using V1 and V2 of the synchronization engine.
Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] adding new UDPU member
{10.20.10.1}
Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] adding new UDPU member
{10.20.10.2}
Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] The network interface
[10.10.10.2] is now up.
Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] adding new UDPU member
{10.10.10.1}
Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] adding new UDPU member
{10.10.10.2}
Sep 11 17:31:26 node2 corosync[2271]: [TOTEM ] Incrementing problem
counter for seqid 1 iface 10.10.10.2 to [1 of 3]
Sep 11 17:31:26 node2 corosync[2271]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Sep 11 17:31:26 node2 corosync[2271]: [CMAN ] quorum regained,
resuming activity
Sep 11 17:31:26 node2 corosync[2271]: [QUORUM] This node is within the
primary component and will provide service.
Sep 11 17:31:26 node2 corosync[2271]: [QUORUM] Members[1]: 2
Sep 11 17:31:26 node2 corosync[2271]: [QUORUM] Members[1]: 2
Sep 11 17:31:26 node2 corosync[2271]: [CPG ] chosen downlist: sender
r(0) ip(10.20.10.2) r(1) ip(10.10.10.2) ; members(old:0 left:0)
Sep 11 17:31:26 node2 corosync[2271]: [MAIN ] Completed service
synchronization, ready to provide service.
Sep 11 17:31:27 node2 corosync[2271]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Sep 11 17:31:27 node2 corosync[2271]: [QUORUM] Members[2]: 1 2
Sep 11 17:31:27 node2 corosync[2271]: [QUORUM] Members[2]: 1 2
Sep 11 17:31:27 node2 corosync[2271]: [CPG ] chosen downlist: sender
r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:1 left:0)
Sep 11 17:31:27 node2 corosync[2271]: [MAIN ] Completed service
synchronization, ready to provide service.
Sep 11 17:31:28 node2 corosync[2271]: [TOTEM ] ring 1 active with no
faults
Sep 11 17:31:28 node2 fenced[2359]: fenced 3.0.12.1 started
Sep 11 17:31:28 node2 dlm_controld[2390]: dlm_controld 3.0.12.1 started
Sep 11 17:31:29 node2 gfs_controld[2442]: gfs_controld 3.0.12.1 started
====
This looked good to me. So I wanted to test RRP by ifdown'ing bcn_bond1
on node 1 only, leaving bcn_bond1 up on node2. The cluster survived and
seemed to the SN, but I saw this error repeatedly printed;
====] Node 1
Sep 11 17:31:46 node1 kernel: bcn_bond1: Removing slave bcn_link1
Sep 11 17:31:46 node1 kernel: bcn_bond1: Releasing active interface
bcn_link1
Sep 11 17:31:46 node1 kernel: bcn_bond1: the permanent HWaddr of
bcn_link1 - 52:54:00:b0:e4:c8 - is still in use by bcn_bond1 - set the
HWaddr of bcn_link1 to a different address to avoid conflicts
Sep 11 17:31:46 node1 kernel: bcn_bond1: making interface bcn_link2 the
new active one
Sep 11 17:31:46 node1 kernel: ICMPv6 NA: someone advertises our address
fe80:0000:0000:0000:5054:00ff:feb0:e4c8 on bcn_link1!
Sep 11 17:31:46 node1 kernel: bcn_bond1: Removing slave bcn_link2
Sep 11 17:31:46 node1 kernel: bcn_bond1: Releasing active interface
bcn_link2
Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #7 bcn_link1,
fe80::5054:ff:feb0:e4c8#123, interface stats: received=0, sent=0,
dropped=0, active_time=48987 secs
Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #6 bcn_bond1,
fe80::5054:ff:feb0:e4c8#123, interface stats: received=0, sent=0,
dropped=0, active_time=48987 secs
Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #3 bcn_bond1,
10.20.10.1#123, interface stats: received=0, sent=0, dropped=0,
active_time=48987 secs
Sep 11 17:31:51 node1 corosync[2523]: [TOTEM ] Incrementing problem
counter for seqid 677 iface 10.20.10.1 to [1 of 3]
Sep 11 17:31:53 node1 corosync[2523]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:31:57 node1 corosync[2523]: [TOTEM ] Incrementing problem
counter for seqid 679 iface 10.20.10.1 to [1 of 3]
Sep 11 17:31:59 node1 corosync[2523]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:04 node1 corosync[2523]: [TOTEM ] Incrementing problem
counter for seqid 681 iface 10.20.10.1 to [1 of 3]
Sep 11 17:32:06 node1 corosync[2523]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:11 node1 corosync[2523]: [TOTEM ] Incrementing problem
counter for seqid 683 iface 10.20.10.1 to [1 of 3]
Sep 11 17:32:13 node1 corosync[2523]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:17 node1 corosync[2523]: [TOTEM ] Incrementing problem
counter for seqid 685 iface 10.20.10.1 to [1 of 3]
Sep 11 17:32:19 node1 corosync[2523]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:24 node1 corosync[2523]: [TOTEM ] Incrementing problem
counter for seqid 687 iface 10.20.10.1 to [1 of 3]
Sep 11 17:32:26 node1 corosync[2523]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:31 node1 corosync[2523]: [TOTEM ] Incrementing problem
counter for seqid 689 iface 10.20.10.1 to [1 of 3]
Sep 11 17:32:33 node1 corosync[2523]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:37 node1 corosync[2523]: [TOTEM ] Incrementing problem
counter for seqid 691 iface 10.20.10.1 to [1 of 3]
Sep 11 17:32:39 node1 corosync[2523]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:44 node1 corosync[2523]: [TOTEM ] Incrementing problem
counter for seqid 693 iface 10.20.10.1 to [1 of 3]
Sep 11 17:32:46 node1 corosync[2523]: [TOTEM ] ring 0 active with no
faults
====
====] Node 2
Sep 11 17:31:48 node2 corosync[2271]: [TOTEM ] Incrementing problem
counter for seqid 676 iface 10.20.10.2 to [1 of 3]
Sep 11 17:31:50 node2 corosync[2271]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:31:54 node2 corosync[2271]: [TOTEM ] Incrementing problem
counter for seqid 678 iface 10.20.10.2 to [1 of 3]
Sep 11 17:31:56 node2 corosync[2271]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:01 node2 corosync[2271]: [TOTEM ] Incrementing problem
counter for seqid 680 iface 10.20.10.2 to [1 of 3]
Sep 11 17:32:03 node2 corosync[2271]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:08 node2 corosync[2271]: [TOTEM ] Incrementing problem
counter for seqid 682 iface 10.20.10.2 to [1 of 3]
Sep 11 17:32:10 node2 corosync[2271]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:14 node2 corosync[2271]: [TOTEM ] Incrementing problem
counter for seqid 684 iface 10.20.10.2 to [1 of 3]
Sep 11 17:32:16 node2 corosync[2271]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:21 node2 corosync[2271]: [TOTEM ] Incrementing problem
counter for seqid 686 iface 10.20.10.2 to [1 of 3]
Sep 11 17:32:23 node2 corosync[2271]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:28 node2 corosync[2271]: [TOTEM ] Incrementing problem
counter for seqid 688 iface 10.20.10.2 to [1 of 3]
Sep 11 17:32:30 node2 corosync[2271]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:35 node2 corosync[2271]: [TOTEM ] Incrementing problem
counter for seqid 690 iface 10.20.10.2 to [1 of 3]
Sep 11 17:32:37 node2 corosync[2271]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:41 node2 corosync[2271]: [TOTEM ] Incrementing problem
counter for seqid 692 iface 10.20.10.2 to [1 of 3]
Sep 11 17:32:43 node2 corosync[2271]: [TOTEM ] ring 0 active with no
faults
Sep 11 17:32:48 node2 corosync[2271]: [TOTEM ] Incrementing problem
counter for seqid 694 iface 10.20.10.2 to [1 of 3]
Sep 11 17:32:50 node2 corosync[2271]: [TOTEM ] ring 0 active with no
faults
====
When I ifup'ed bcn_bond1 on node1, the messages stopped printing. So
before I even start on iptables, I am curious if I am doing something
incorrect here.
Advice?
Don't do ifdown. Corosync reacts on ifdown very badly (long time known
issue, also it's one of the reason for knet in future version).
Also rrp active is not so well tested as passive, so give a try to passive.
Honza
Thanks!
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org