Re: [ClusterLabs] Problems with corosync and pacemaker with error scenarios
On 01/16/2017 11:18 AM, Gerhard Wiesinger wrote: > Hello Ken, > > thank you for the answers. > > On 16.01.2017 16:43, Ken Gaillot wrote: >> On 01/16/2017 08:56 AM, Gerhard Wiesinger wrote: >>> Hello, >>> >>> I'm new to corosync and pacemaker and I want to setup a nginx cluster >>> with quorum. >>> >>> Requirements: >>> - 3 Linux maschines >>> - On 2 maschines floating IP should be handled and nginx as a load >>> balancing proxy >>> - 3rd maschine is for quorum only, no services must run there >>> >>> Installed on all 3 nodes corosync/pacemaker, firewall ports openend are: >>> 5404, 5405, 5406 for udp in both directions >> If you're using firewalld, the easiest configuration is: >> >>firewall-cmd --permanent --add-service=high-availability >> >> If not, depending on what you're running, you may also want to open TCP >> ports 2224 (pcsd), 3121 (Pacemaker Remote), and 21064 (DLM). > > I'm using shorewall on the lb01/lb02 nodes and firewalld on kvm01. > > pcs status > Cluster name: lbcluster > Stack: corosync > Current DC: lb01 (version 1.1.16-1.fc25-94ff4df) - partition with quorum > Last updated: Mon Jan 16 16:46:52 2017 > Last change: Mon Jan 16 15:07:59 2017 by root via cibadmin on lb01 > > 3 nodes configured > 40 resources configured > > Online: [ kvm01 lb01 lb02 ] > > Full list of resources: > ... > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: inactive/disabled > > BTW: I'm not running pcsd, as far as I know it is for UI configuration > only So ports ports 2224 (pcsd), 3121 (Pacemaker Remote), and 21064 > (DLM) are closed. Shouldn't be a problem, right? pcs uses pcsd for most of its commands, so if you want to use pcs, it should be enabled and allowed between nodes. You don't have Pacemaker Remote nodes, so you can leave that port closed. DLM is only necessary for certain resource types (such as clvmd). >>> OS: Fedora 25 >>> >>> Configuration of corosync (only the bindnetaddr is different on every >>> maschine) and pacemaker below. >> FYI you don't need a different bindnetaddr. You can (and generally >> should) use the *network* address, which is the same on all hosts. > > Only lb01 and lb02 are on the same network, kvm01 is on a different > location and network therefore. I'm not familiar with corosync nodes on the same ring using different networks, but I suppose it's OK since you're using udpu, with ring0_addr specified for each node. >>> Configuration works so far but error test scenarios don't work like >>> expected: >>> 1.) I had cases in testing without qourum and quorum again where the >>> cluster kept in Stopped state >>>I had to restart the whole stack to get it online again (killall -9 >>> corosync;systemctl restart corosync;systemctl restart pacemaker) >>>Any ideas? >> It will be next to impossible to say without logs. It's definitely not >> expected behavior. Stopping is the correct response to losing quorum; >> perhaps quorum is not being properly restored for some reason. What is >> your test methodology? > > I had it when I rebooted just one node. > > Testing scenarios are: > *) Rebooting > *) Starting/stopping corosync > *) network down simulation on lb01/lb02 > *) putting an interface down with ifconfig eth1:1 down (simulation of > loosing an IP address) > *) see also below > > Tested now again with all nodes up (I've configured 13 ip adresses for > the sake of getting a faster overview I posted only the config for 2 ip > adresses): > No automatic recovery happens. > e.g. ifconfig eth1:1 down > Resource Group: ClusterNetworking > ClusterIP_01 (ocf::heartbeat:IPaddr2): FAILED lb02 > ClusterIPRoute_01 (ocf::heartbeat:Route): FAILED lb02 > ClusterIPRule_01 (ocf::heartbeat:Iprule):Started lb02 > ClusterIP_02 (ocf::heartbeat:IPaddr2): FAILED lb02 > ClusterIPRoute_02 (ocf::heartbeat:Route): FAILED lb02 (blocked) > ClusterIPRule_02 (ocf::heartbeat:Iprule):Stopped > ClusterIP_03 (ocf::heartbeat:IPaddr2): Stopped > ClusterIPRoute_03 (ocf::heartbeat:Route): Stopped > ClusterIPRule_03 (ocf::heartbeat:Iprule):Stopped > ... > ClusterIP_13 (ocf::heartbeat:IPaddr2): Stopped > ClusterIPRoute_13 (ocf::heartbeat:Route): Stopped > ClusterIPRule_13 (ocf::heartbeat:Iprule):Stopped > webserver (ocf::heartbeat:nginx): Stopped > > > Failed Actions: > * ClusterIP_01_monitor_1 on lb02 'not running' (7): call=176, > status=complete, exitreason='none', > last-rc-change='Mon Jan 16 16:53:49 2017', queued=0ms, exec=0ms > * ClusterIP_02_monitor_1 on lb02 'not running' (7): call=182, > status=complete, exitreason='none', > last-rc-change='Mon Jan 16 16:54:01 2017', queued=0ms, exec=0ms > > Only this helps: > killall -9 corosync;systemctl restart corosync;systemctl restart pacemaker The question is why is ClusterIPRoute_02 blocked. As long as it's blocked, the cluster can'
Re: [ClusterLabs] Problems with corosync and pacemaker with error scenarios
Dne 16.1.2017 v 18:18 Gerhard Wiesinger napsal(a): Hello Ken, thank you for the answers. On 16.01.2017 16:43, Ken Gaillot wrote: On 01/16/2017 08:56 AM, Gerhard Wiesinger wrote: Hello, I'm new to corosync and pacemaker and I want to setup a nginx cluster with quorum. Requirements: - 3 Linux maschines - On 2 maschines floating IP should be handled and nginx as a load balancing proxy - 3rd maschine is for quorum only, no services must run there Installed on all 3 nodes corosync/pacemaker, firewall ports openend are: 5404, 5405, 5406 for udp in both directions If you're using firewalld, the easiest configuration is: firewall-cmd --permanent --add-service=high-availability If not, depending on what you're running, you may also want to open TCP ports 2224 (pcsd), 3121 (Pacemaker Remote), and 21064 (DLM). I'm using shorewall on the lb01/lb02 nodes and firewalld on kvm01. pcs status Cluster name: lbcluster Stack: corosync Current DC: lb01 (version 1.1.16-1.fc25-94ff4df) - partition with quorum Last updated: Mon Jan 16 16:46:52 2017 Last change: Mon Jan 16 15:07:59 2017 by root via cibadmin on lb01 3 nodes configured 40 resources configured Online: [ kvm01 lb01 lb02 ] Full list of resources: ... Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: inactive/disabled BTW: I'm not running pcsd, as far as I know it is for UI configuration only So ports ports 2224 (pcsd), 3121 (Pacemaker Remote), and 21064 (DLM) are closed. Shouldn't be a problem, right? Besides providing GUI configuration tool, pcsd also serves as a daemon pcs is talking to when managing cluster nodes. Even such basic commands like "pcs cluster setup", "pcs cluster start" and "pcs cluster stop" depend on pcsd running on the nodes. It is possible to manage a cluster without pcsd running but I really do not recommend that. Regards, Tomas OS: Fedora 25 Configuration of corosync (only the bindnetaddr is different on every maschine) and pacemaker below. FYI you don't need a different bindnetaddr. You can (and generally should) use the *network* address, which is the same on all hosts. Only lb01 and lb02 are on the same network, kvm01 is on a different location and network therefore. Configuration works so far but error test scenarios don't work like expected: 1.) I had cases in testing without qourum and quorum again where the cluster kept in Stopped state I had to restart the whole stack to get it online again (killall -9 corosync;systemctl restart corosync;systemctl restart pacemaker) Any ideas? It will be next to impossible to say without logs. It's definitely not expected behavior. Stopping is the correct response to losing quorum; perhaps quorum is not being properly restored for some reason. What is your test methodology? I had it when I rebooted just one node. Testing scenarios are: *) Rebooting *) Starting/stopping corosync *) network down simulation on lb01/lb02 *) putting an interface down with ifconfig eth1:1 down (simulation of loosing an IP address) *) see also below Tested now again with all nodes up (I've configured 13 ip adresses for the sake of getting a faster overview I posted only the config for 2 ip adresses): No automatic recovery happens. e.g. ifconfig eth1:1 down Resource Group: ClusterNetworking ClusterIP_01 (ocf::heartbeat:IPaddr2): FAILED lb02 ClusterIPRoute_01 (ocf::heartbeat:Route): FAILED lb02 ClusterIPRule_01 (ocf::heartbeat:Iprule):Started lb02 ClusterIP_02 (ocf::heartbeat:IPaddr2): FAILED lb02 ClusterIPRoute_02 (ocf::heartbeat:Route): FAILED lb02 (blocked) ClusterIPRule_02 (ocf::heartbeat:Iprule):Stopped ClusterIP_03 (ocf::heartbeat:IPaddr2): Stopped ClusterIPRoute_03 (ocf::heartbeat:Route): Stopped ClusterIPRule_03 (ocf::heartbeat:Iprule):Stopped ... ClusterIP_13 (ocf::heartbeat:IPaddr2): Stopped ClusterIPRoute_13 (ocf::heartbeat:Route): Stopped ClusterIPRule_13 (ocf::heartbeat:Iprule):Stopped webserver (ocf::heartbeat:nginx): Stopped Failed Actions: * ClusterIP_01_monitor_1 on lb02 'not running' (7): call=176, status=complete, exitreason='none', last-rc-change='Mon Jan 16 16:53:49 2017', queued=0ms, exec=0ms * ClusterIP_02_monitor_1 on lb02 'not running' (7): call=182, status=complete, exitreason='none', last-rc-change='Mon Jan 16 16:54:01 2017', queued=0ms, exec=0ms Only this helps: killall -9 corosync;systemctl restart corosync;systemctl restart pacemaker also had now on a test: Failed Actions: * webserver_start_0 on lb02 'not configured' (6): call=499, status=complete, exitreason='none', last-rc-change='Mon Jan 16 17:04:13 2017', queued=0ms, exec=4120ms Why is it not configured now? killall -9 cor
Re: [ClusterLabs] Problems with corosync and pacemaker with error scenarios
Hello Ken, thank you for the answers. On 16.01.2017 16:43, Ken Gaillot wrote: On 01/16/2017 08:56 AM, Gerhard Wiesinger wrote: Hello, I'm new to corosync and pacemaker and I want to setup a nginx cluster with quorum. Requirements: - 3 Linux maschines - On 2 maschines floating IP should be handled and nginx as a load balancing proxy - 3rd maschine is for quorum only, no services must run there Installed on all 3 nodes corosync/pacemaker, firewall ports openend are: 5404, 5405, 5406 for udp in both directions If you're using firewalld, the easiest configuration is: firewall-cmd --permanent --add-service=high-availability If not, depending on what you're running, you may also want to open TCP ports 2224 (pcsd), 3121 (Pacemaker Remote), and 21064 (DLM). I'm using shorewall on the lb01/lb02 nodes and firewalld on kvm01. pcs status Cluster name: lbcluster Stack: corosync Current DC: lb01 (version 1.1.16-1.fc25-94ff4df) - partition with quorum Last updated: Mon Jan 16 16:46:52 2017 Last change: Mon Jan 16 15:07:59 2017 by root via cibadmin on lb01 3 nodes configured 40 resources configured Online: [ kvm01 lb01 lb02 ] Full list of resources: ... Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: inactive/disabled BTW: I'm not running pcsd, as far as I know it is for UI configuration only So ports ports 2224 (pcsd), 3121 (Pacemaker Remote), and 21064 (DLM) are closed. Shouldn't be a problem, right? OS: Fedora 25 Configuration of corosync (only the bindnetaddr is different on every maschine) and pacemaker below. FYI you don't need a different bindnetaddr. You can (and generally should) use the *network* address, which is the same on all hosts. Only lb01 and lb02 are on the same network, kvm01 is on a different location and network therefore. Configuration works so far but error test scenarios don't work like expected: 1.) I had cases in testing without qourum and quorum again where the cluster kept in Stopped state I had to restart the whole stack to get it online again (killall -9 corosync;systemctl restart corosync;systemctl restart pacemaker) Any ideas? It will be next to impossible to say without logs. It's definitely not expected behavior. Stopping is the correct response to losing quorum; perhaps quorum is not being properly restored for some reason. What is your test methodology? I had it when I rebooted just one node. Testing scenarios are: *) Rebooting *) Starting/stopping corosync *) network down simulation on lb01/lb02 *) putting an interface down with ifconfig eth1:1 down (simulation of loosing an IP address) *) see also below Tested now again with all nodes up (I've configured 13 ip adresses for the sake of getting a faster overview I posted only the config for 2 ip adresses): No automatic recovery happens. e.g. ifconfig eth1:1 down Resource Group: ClusterNetworking ClusterIP_01 (ocf::heartbeat:IPaddr2): FAILED lb02 ClusterIPRoute_01 (ocf::heartbeat:Route): FAILED lb02 ClusterIPRule_01 (ocf::heartbeat:Iprule):Started lb02 ClusterIP_02 (ocf::heartbeat:IPaddr2): FAILED lb02 ClusterIPRoute_02 (ocf::heartbeat:Route): FAILED lb02 (blocked) ClusterIPRule_02 (ocf::heartbeat:Iprule):Stopped ClusterIP_03 (ocf::heartbeat:IPaddr2): Stopped ClusterIPRoute_03 (ocf::heartbeat:Route): Stopped ClusterIPRule_03 (ocf::heartbeat:Iprule):Stopped ... ClusterIP_13 (ocf::heartbeat:IPaddr2): Stopped ClusterIPRoute_13 (ocf::heartbeat:Route): Stopped ClusterIPRule_13 (ocf::heartbeat:Iprule):Stopped webserver (ocf::heartbeat:nginx): Stopped Failed Actions: * ClusterIP_01_monitor_1 on lb02 'not running' (7): call=176, status=complete, exitreason='none', last-rc-change='Mon Jan 16 16:53:49 2017', queued=0ms, exec=0ms * ClusterIP_02_monitor_1 on lb02 'not running' (7): call=182, status=complete, exitreason='none', last-rc-change='Mon Jan 16 16:54:01 2017', queued=0ms, exec=0ms Only this helps: killall -9 corosync;systemctl restart corosync;systemctl restart pacemaker also had now on a test: Failed Actions: * webserver_start_0 on lb02 'not configured' (6): call=499, status=complete, exitreason='none', last-rc-change='Mon Jan 16 17:04:13 2017', queued=0ms, exec=4120ms Why is it not configured now? killall -9 corosync;systemctl restart corosync;systemctl restart pacemaker got to the next situation below where some of them are FAILED and blocked on lb02 and cluster didn't start up (why is it started on different clusters anyway???) ClusterIP_05 (ocf::heartbeat:IPaddr2): Started lb01 ClusterIPRoute_05 (ocf::heartbeat:Route): Started lb01 ClusterIPRule_05 (ocf::heartbeat:Iprule):
Re: [ClusterLabs] Problems with corosync and pacemaker with error scenarios
On 01/16/2017 08:56 AM, Gerhard Wiesinger wrote: > Hello, > > I'm new to corosync and pacemaker and I want to setup a nginx cluster > with quorum. > > Requirements: > - 3 Linux maschines > - On 2 maschines floating IP should be handled and nginx as a load > balancing proxy > - 3rd maschine is for quorum only, no services must run there > > Installed on all 3 nodes corosync/pacemaker, firewall ports openend are: > 5404, 5405, 5406 for udp in both directions If you're using firewalld, the easiest configuration is: firewall-cmd --permanent --add-service=high-availability If not, depending on what you're running, you may also want to open TCP ports 2224 (pcsd), 3121 (Pacemaker Remote), and 21064 (DLM). > OS: Fedora 25 > > Configuration of corosync (only the bindnetaddr is different on every > maschine) and pacemaker below. FYI you don't need a different bindnetaddr. You can (and generally should) use the *network* address, which is the same on all hosts. > Configuration works so far but error test scenarios don't work like > expected: > 1.) I had cases in testing without qourum and quorum again where the > cluster kept in Stopped state > I had to restart the whole stack to get it online again (killall -9 > corosync;systemctl restart corosync;systemctl restart pacemaker) > Any ideas? It will be next to impossible to say without logs. It's definitely not expected behavior. Stopping is the correct response to losing quorum; perhaps quorum is not being properly restored for some reason. What is your test methodology? > 2.) Restarting pacemaker on inactive node also restarts resources on the > other active node: > a.) Everything up & ok > b.) lb01 handles all resources > c.) on lb02 which handles no resrouces: systemctl restart pacemaker: > All resources will also be restart with a short outage on lb01 (state > is Stopped, Started[ lb01 lb02 ] and then Started lb02) > How can this be avoided? This is not expected behavior, except with clones, which I don't see you using. > 3.) Stopping and starting corosync doesn't awake the node up again: > systemctl stop corosync;sleep 10;systemctl restart corosync > Online: [ kvm01 lb01 ] > OFFLINE: [ lb02 ] > Stays in that state until pacemaker is restarted: systemctl restart > pacemaker > Bug? No, pacemaker should always restart if corosync restarts. That is specified in the systemd units, so I'm not sure why pacemaker didn't automatically restart in your case. > 4.) "systemctl restart corosync" hangs sometimes (waiting 2 min) > needs a > killall -9 corosync;systemctl restart corosync;systemctl restart > pacemaker > sequence to get it up gain > > 5.) Simulation of split brain: Disabling/reenabling local firewall > (ports 5404, 5405, 5406) on node lb01 and lb02 for the following ports FYI for an accurate simulation, be sure to block both incoming and outgoing traffic on the corosync ports. > doesn't bring corosync up again after reenabling lb02 firewall > partition WITHOUT quorum > Online: [ kvm01 ] > OFFLINE: [ lb01 lb02 ] > NOK: restart on lb02: systemctl restart corosync;systemctl restart > pacemaker > OK: restart on lb02 and kvm01 (quorum host): systemctl restart > corosync;systemctl restart pacemaker > I also see that non enabled hosts (quorum hosts) are also tried to be > started on kvm01 > Started[ kvm01 lb02 ] > Started lb02 > Any ideas? > > I've also written a new ocf:heartbeat:Iprule to modify "ip rule" > accordingly. > > Versions are: > corosync: 2.4.2 > pacemaker: 1.1.16 > Kernel: 4.9.3-200.fc25.x86_64 > > Thnx. > > Ciao, > Gerhard > > Corosync config: > > > totem { > version: 2 > cluster_name: lbcluster > crypto_cipher: aes256 > crypto_hash: sha512 > interface { > ringnumber: 0 > bindnetaddr: 1.2.3.35 > mcastport: 5405 > } > transport: udpu > } > logging { > fileline: off > to_logfile: yes > to_syslog: yes > logfile: /var/log/cluster/corosync.log > debug: off > timestamp: on > logger_subsys { > subsys: QUORUM > debug: off > } > } > nodelist { > node { > ring0_addr: lb01 > nodeid: 1 > } > node { > ring0_addr: lb02 > nodeid: 2 > } > node { > ring0_addr: kvm01 > nodeid: 3 > } > } > quorum { > # Enable and configure quorum subsystem (default: off) > # see also corosync.conf.5 and votequorum.5 > #provider: corosync_votequorum > provider: corosync_votequorum > # Only for 2 node setup! > # two_node: 1 > } > ==
[ClusterLabs] Problems with corosync and pacemaker with error scenarios
Hello, I'm new to corosync and pacemaker and I want to setup a nginx cluster with quorum. Requirements: - 3 Linux maschines - On 2 maschines floating IP should be handled and nginx as a load balancing proxy - 3rd maschine is for quorum only, no services must run there Installed on all 3 nodes corosync/pacemaker, firewall ports openend are: 5404, 5405, 5406 for udp in both directions OS: Fedora 25 Configuration of corosync (only the bindnetaddr is different on every maschine) and pacemaker below. Configuration works so far but error test scenarios don't work like expected: 1.) I had cases in testing without qourum and quorum again where the cluster kept in Stopped state I had to restart the whole stack to get it online again (killall -9 corosync;systemctl restart corosync;systemctl restart pacemaker) Any ideas? 2.) Restarting pacemaker on inactive node also restarts resources on the other active node: a.) Everything up & ok b.) lb01 handles all resources c.) on lb02 which handles no resrouces: systemctl restart pacemaker: All resources will also be restart with a short outage on lb01 (state is Stopped, Started[ lb01 lb02 ] and then Started lb02) How can this be avoided? 3.) Stopping and starting corosync doesn't awake the node up again: systemctl stop corosync;sleep 10;systemctl restart corosync Online: [ kvm01 lb01 ] OFFLINE: [ lb02 ] Stays in that state until pacemaker is restarted: systemctl restart pacemaker Bug? 4.) "systemctl restart corosync" hangs sometimes (waiting 2 min) needs a killall -9 corosync;systemctl restart corosync;systemctl restart pacemaker sequence to get it up gain 5.) Simulation of split brain: Disabling/reenabling local firewall (ports 5404, 5405, 5406) on node lb01 and lb02 for the following ports doesn't bring corosync up again after reenabling lb02 firewall partition WITHOUT quorum Online: [ kvm01 ] OFFLINE: [ lb01 lb02 ] NOK: restart on lb02: systemctl restart corosync;systemctl restart pacemaker OK: restart on lb02 and kvm01 (quorum host): systemctl restart corosync;systemctl restart pacemaker I also see that non enabled hosts (quorum hosts) are also tried to be started on kvm01 Started[ kvm01 lb02 ] Started lb02 Any ideas? I've also written a new ocf:heartbeat:Iprule to modify "ip rule" accordingly. Versions are: corosync: 2.4.2 pacemaker: 1.1.16 Kernel: 4.9.3-200.fc25.x86_64 Thnx. Ciao, Gerhard Corosync config: totem { version: 2 cluster_name: lbcluster crypto_cipher: aes256 crypto_hash: sha512 interface { ringnumber: 0 bindnetaddr: 1.2.3.35 mcastport: 5405 } transport: udpu } logging { fileline: off to_logfile: yes to_syslog: yes logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } nodelist { node { ring0_addr: lb01 nodeid: 1 } node { ring0_addr: lb02 nodeid: 2 } node { ring0_addr: kvm01 nodeid: 3 } } quorum { # Enable and configure quorum subsystem (default: off) # see also corosync.conf.5 and votequorum.5 #provider: corosync_votequorum provider: corosync_votequorum # Only for 2 node setup! # two_node: 1 } # Default properties pcs property set stonith-enabled=false pcs property set no-quorum-policy=stop pcs property set default-resource-stickiness=100 pcs property set symmetric-cluster=false # Delete & cleanup resources pcs resource delete webserver pcs resource cleanup webserver pcs resource delete ClusterIP_01 pcs resource cleanup ClusterIP_01 pcs resource delete ClusterIPRoute_01 pcs resource cleanup ClusterIPRoute_01 pcs resource delete ClusterIPRule_01 pcs resource cleanup ClusterIPRule_01 pcs resou