[Pacemaker] howto group resources without having an order
Dear Developers & Users, i have 4 resources: p_eth0 p_conntrackd p_openvpn1 p_openvpn2 Right now, I use group and colocation to let p_eth0 and p_conntrackd start in the right order (first eth0, then conntrackd). I want now to also include p_openvpn1 + 2 but not having them in any order. Means - running on the same cluster node but independent from each other. I want to be able to not depend on openvpn1 to start openvpn2 (that's the default behavior iirc without groups/orders). Any help is greatly appreciated. Best regards Stefan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Beginner Question: not able to shutdown 2nd node
On Mon, Nov 25, 2013 at 8:44 PM, Digimer wrote: > On 25/11/13 21:18, T.J. Yang wrote: > > Hi > > > > I need help here, Looks like I missed a step to startup two nodes to > > listen on port 2224 ? > > > > [root@ilclpm01 ~]# pcs --version > > 0.9.90 > > [root@ilclpm01 ~]# pcs --debug cluster stop ilclpm02 > > Sending HTTP Request to: https://ilclpm02:2224/remote/cluster_stop > > Data: None > > Response Reason: [Errno 111] Connection refused > > Error: unable to stop all nodes > > Unable to connect to ilclpm02 ([Errno 111] Connection refused) > > Is pcsd running? > > If this is RHEL / CentOS 6, then I do not believe pcsd works. > Hi digimer Thanks for responding to my question. I can't find pcsd binary from three packages I installed. [root@ilclpm01 ~]# rpm -qil pcs cman pacemaker |grep pcsd [root@ilclpm01 ~]# following are more details about my test cluster. 3617 ?SLsl 0:07 corosync -f 3674 ?Ssl0:00 fenced 3690 ?Ssl0:00 dlm_controld 3749 ?Ssl0:00 gfs_controld 3832 pts/0S 0:01 pacemakerd 3838 ?Ss 0:01 \_ /usr/libexec/pacemaker/cib 3839 ?Ss 0:01 \_ /usr/libexec/pacemaker/stonithd 3840 ?Ss 0:02 \_ /usr/libexec/pacemaker/lrmd 3841 ?Ss 0:01 \_ /usr/libexec/pacemaker/attrd 3842 ?Ss 0:00 \_ /usr/libexec/pacemaker/pengine 3843 ?Ss 0:01 \_ /usr/libexec/pacemaker/crmd [root@ilclpm01 ~]# rpm -q cman cman-3.0.12.1-59.el6.x86_64 [root@ilclpm01 ~]# rpm -q pacemaker pacemaker-1.1.10-14.el6.x86_64 [root@ilclpm01 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.5 (Santiago) [root@ilclpm01 ~]# > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- T.J. Yang ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Beginner Question: not able to shutdown 2nd node
On 25/11/13 21:18, T.J. Yang wrote: > Hi > > I need help here, Looks like I missed a step to startup two nodes to > listen on port 2224 ? > > [root@ilclpm01 ~]# pcs --version > 0.9.90 > [root@ilclpm01 ~]# pcs --debug cluster stop ilclpm02 > Sending HTTP Request to: https://ilclpm02:2224/remote/cluster_stop > Data: None > Response Reason: [Errno 111] Connection refused > Error: unable to stop all nodes > Unable to connect to ilclpm02 ([Errno 111] Connection refused) Is pcsd running? If this is RHEL / CentOS 6, then I do not believe pcsd works. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Beginner Question: not able to shutdown 2nd node
Hi I need help here, Looks like I missed a step to startup two nodes to listen on port 2224 ? [root@ilclpm01 ~]# pcs --version 0.9.90 [root@ilclpm01 ~]# pcs --debug cluster stop ilclpm02 Sending HTTP Request to: https://ilclpm02:2224/remote/cluster_stop Data: None Response Reason: [Errno 111] Connection refused Error: unable to stop all nodes Unable to connect to ilclpm02 ([Errno 111] Connection refused) [root@ilclpm01 ~]# pcs status Cluster name: pacemaker1 Last updated: Mon Nov 25 20:12:49 2013 Last change: Mon Nov 25 19:07:52 2013 via cibadmin on ilclpm01 Stack: cman Current DC: ilclpm02 - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured 2 Resources configured Online: [ ilclpm01 ilclpm02 ] Full list of resources: my_first_svc (ocf::pacemaker:Dummy): Started ilclpm02 ClusterIP (ocf::heartbeat:IPaddr2): Started ilclpm01 [root@ilclpm01 ~]# [root@ilclpm01 ~]# nmap ilclpm02 Starting Nmap 5.51 ( http://nmap.org ) at 2013-11-25 20:16 CST Nmap scan report for ilclpm02 (100.64.16.102) Host is up (0.82s latency). rDNS record for 10.64.16.102: ilclpm02.test.net Not shown: 998 closed ports PORTSTATE SERVICE 22/tcp open ssh 111/tcp open rpcbind MAC Address: 00:50:56:xx:xx:CA (VMware) Nmap done: 1 IP address (1 host up) scanned in 5.66 seconds [root@ilclpm01 ~]# -- T.J. Yang ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker very often STONITHs other node
W dniu 25.11.2013 18:25, Digimer pisze: I'd like to see the full logs, starting from a little before the issue started. Here are logs since Nov 17 until Nov 24 (my pastebin is too small to handle them): Node A - https://www.dropbox.com/sh/dj08fbckj9zo104/Ew1QpdRq9A/A.log Node B - https://www.dropbox.com/sh/dj08fbckj9zo104/p9ldlBkGkG/B.log It looks though like, for whatever reason, a stop was called, failed, so the node was fenced. This would mean that congestion, as you suggested, is not the likely cause. Out of curiosity though; what bonding mode are you using? My testing showed that only mode=1 was reliable. Since I tested, corosync added support for mode=0 and mode=2, but I've not re-tested them. When I was doing my bonding tests, I found all other modes to break communications in some manner of use or failure/recovery testing. I use 802.3ad mode (so it is mode 4): auto bond0 iface bond0 inet static slaves eth4 eth5 bond-mode 802.3ad bond-lacp_rate fast bond-miimon 100 bond-downdelay 200 bond-updelay 200 address 10.0.0.1 netmask 255.255.255.0 broadcast 10.0.0.255 Do you think that it could be the reason - I mean wrong mode and some communication issues because of that? Thank you once more! -- Michał Margula, alche...@uznam.net.pl, http://alchemyx.uznam.net.pl/ "W życiu piękne są tylko chwile" [Ryszard Riedel] ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker very often STONITHs other node
On 25/11/13 10:39, Michał Margula wrote: > W dniu 25.11.2013 15:44, Digimer pisze: >> My first thought is that the network is congested. That is a lot of >> servers to have on the system. Do you or can you isolate the corosync >> traffic from the drbd traffic? >> >> Personally, I always setup a dedicated network for corosync, another for >> drbd and a third for all traffic to/from the servers. With this, I have >> never had a congestion-based problem. >> >> If possible, please past all logs from both nodes, starting just before >> the stonith occurred until recovery completed please. >> > > Hello, > > DRBD and CRM go over dedicated link (bonded two gigabit links into one). > It is never saturated nor congested, it barely reaches 300 Mbps in > highest points. I have a separate link for traffic from/to virtual > machines and also separate link to manage nodes (just for SSH, SNMP). I > can isolate corosync to separate link but it could take some time to do. > > Now logs... > > Trouble started at November 23, 15:14. > Here is a log from "A" node: http://pastebin.com/yM1fqvQ6 > Node B: http://pastebin.com/nwbctcgg > > Node B is the one that got hit by STONITH. It got killed at 15:18:50. I > have some trouble understanding reasons for that. > > Is reason for STONITH that those operations took long time to finish? > > Nov 23 15:14:49 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the > operation stop[114] on XEN-piaskownica for client 9529 stayed in > operation list for 24760 ms (longer than 1 ms) > Nov 23 15:14:50 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the > operation stop[115] on XEN-acsystemy01 for client 9529 stayed in > operation list for 25760 ms (longer than 1 ms) > Nov 23 15:15:15 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the > operation stop[116] on XEN-frodo for client 9529 stayed in operation > list for 50760 ms (longer than 1 ms) > > But I wonder what in first place made it to stop those virtual machines? > Another clue is here: > > Nov 23 15:15:43 rivendell-B lrmd: [9526]: WARN: configuration advice: > reduce operation contention either by increasing lrmd max_children or by > increasing intervals of monitor operations > > And here: > > coro-A.log:Nov 23 15:14:19 rivendell-A pengine: [8839]: WARN: > unpack_rsc_op: Processing failed op primitive-LVM:1_last_failure_0 on > rivendell-B: not running (7) > > But why not running? It is not really a true. Also some trouble with > fencing: > > coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: > unpack_rsc_op: Processing failed op fencing-of-B_last_failure_0 on > rivendell-A: unknown error (1) > coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: > common_apply_stickiness: Forcing fencing-of-B away from rivendell-A > after 100 failures (max=100) > > Thank you! > I'd like to see the full logs, starting from a little before the issue started. It looks though like, for whatever reason, a stop was called, failed, so the node was fenced. This would mean that congestion, as you suggested, is not the likely cause. Out of curiosity though; what bonding mode are you using? My testing showed that only mode=1 was reliable. Since I tested, corosync added support for mode=0 and mode=2, but I've not re-tested them. When I was doing my bonding tests, I found all other modes to break communications in some manner of use or failure/recovery testing. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker very often STONITHs other node
W dniu 25.11.2013 15:44, Digimer pisze: My first thought is that the network is congested. That is a lot of servers to have on the system. Do you or can you isolate the corosync traffic from the drbd traffic? Personally, I always setup a dedicated network for corosync, another for drbd and a third for all traffic to/from the servers. With this, I have never had a congestion-based problem. If possible, please past all logs from both nodes, starting just before the stonith occurred until recovery completed please. Hello, DRBD and CRM go over dedicated link (bonded two gigabit links into one). It is never saturated nor congested, it barely reaches 300 Mbps in highest points. I have a separate link for traffic from/to virtual machines and also separate link to manage nodes (just for SSH, SNMP). I can isolate corosync to separate link but it could take some time to do. Now logs... Trouble started at November 23, 15:14. Here is a log from "A" node: http://pastebin.com/yM1fqvQ6 Node B: http://pastebin.com/nwbctcgg Node B is the one that got hit by STONITH. It got killed at 15:18:50. I have some trouble understanding reasons for that. Is reason for STONITH that those operations took long time to finish? Nov 23 15:14:49 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the operation stop[114] on XEN-piaskownica for client 9529 stayed in operation list for 24760 ms (longer than 1 ms) Nov 23 15:14:50 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the operation stop[115] on XEN-acsystemy01 for client 9529 stayed in operation list for 25760 ms (longer than 1 ms) Nov 23 15:15:15 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the operation stop[116] on XEN-frodo for client 9529 stayed in operation list for 50760 ms (longer than 1 ms) But I wonder what in first place made it to stop those virtual machines? Another clue is here: Nov 23 15:15:43 rivendell-B lrmd: [9526]: WARN: configuration advice: reduce operation contention either by increasing lrmd max_children or by increasing intervals of monitor operations And here: coro-A.log:Nov 23 15:14:19 rivendell-A pengine: [8839]: WARN: unpack_rsc_op: Processing failed op primitive-LVM:1_last_failure_0 on rivendell-B: not running (7) But why not running? It is not really a true. Also some trouble with fencing: coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: unpack_rsc_op: Processing failed op fencing-of-B_last_failure_0 on rivendell-A: unknown error (1) coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: common_apply_stickiness: Forcing fencing-of-B away from rivendell-A after 100 failures (max=100) Thank you! -- Michał Margula, alche...@uznam.net.pl, http://alchemyx.uznam.net.pl/ "W życiu piękne są tylko chwile" [Ryszard Riedel] ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker very often STONITHs other node
On 25/11/13 06:40, Michał Margula wrote: > Hello! > > I wanted to ask for your help because we are having much trouble with > cluster based on Pacemaker. > > We have two identical nodes - PowerEdge R510 with 2x Xeon X5650, 64 GB > of RAM, MegaRAID SAS 2108 RAID (PERC H700) - system disk - RAID 1 on > SSDs (SSDSC2CW060A3) and two volumes - one RAID 1 with WD3000FYYZ and > one RAID 1 with WD1002FBYS -- both Western Digital disks. Both nodes are > linked with two gigabit direct fiber links (no switch in between). > > We have two DRBD volumes - /dev/drbd1 (1TB on WD1002FBYS disks) and > /dev/drbd2 (3TB on WD3000FYYZ disks). On top of DRBD (used as PVs) we > have a LVM with LVs for virtual machines which run under XEN. > > Here is our CRM configuration - http://pastebin.com/raqsvRTA > > We have previously used fast USB drives instead of SSD for root > filesystem and it caused some trouble - it was lagging on I/O and one > node "thought" that another one was having trouble and performing > STONITH on it. After replacing it with SSDs we had no more trouble with > that issue. > > But now from time to time it happens that we get STONITH of one nodes, > and reason is unclear to us. > > For example last time we found it in logs: > > Nov 23 15:14:24 rivendell-B crmd: [9529]: info: process_lrm_event: LRM > operation primitive-LVM:1_monitor_12 (call=54, rc=7, cib-update=124, > confirmed=false) not running > > And after that node rivendell-B got STONITH. Previously we had trouble > with DRBD - node stopped DRBD for no apparent reason and again - > STONITH. Unfortunately we did not check logs that time. > > Also when doing some tasks on one of nodes (for example "crm resource > migrate" of few XEN virtual machines) it can cause STONITH also. > > Could you give us some hints? Maybe our configuration is wrong? To be > honest we had no previous experience with HA clusters so we created it > based on configuration. > > It is working now for over a year now but giving us headaches and we are > wondering if we should drop Pacemaker and use something else (even > manual stopping and starting of virtual machines comes in mind). > > Thank you in advance! My first thought is that the network is congested. That is a lot of servers to have on the system. Do you or can you isolate the corosync traffic from the drbd traffic? Personally, I always setup a dedicated network for corosync, another for drbd and a third for all traffic to/from the servers. With this, I have never had a congestion-based problem. If possible, please past all logs from both nodes, starting just before the stonith occurred until recovery completed please. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] some questions about STONITH
>...snip... >> Make next test: >> #stonith_admin --reboot=dev-cluster2-node2 >> Node reboot, but resource don't start. >> In crm_mon status - Node dev-cluster2-node2 (172793105): pending. >> And it will be hung. > > That is *probably* a race - the node reboots too fast, or still > communicates for a bit after the fence has supposedly completed (if it's > not a reboot -nf, but a mere reboot). We have had problems here in the > past. > > You may want to file a proper bug report with crm_report included, and > preferably corosync/pacemaker debugging enabled. It was found that he hangs not forever. Triggered timeout - in 20 minutes. crm_report archive - http://send2me.ru/pen2.tar.bz2 Of course in the logs many type entries: pgsql:1: Breaking dependency loop at msPostgresql But where does this relationship after a timeout, I do not understand. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Pacemaker very often STONITHs other node
Hello! I wanted to ask for your help because we are having much trouble with cluster based on Pacemaker. We have two identical nodes - PowerEdge R510 with 2x Xeon X5650, 64 GB of RAM, MegaRAID SAS 2108 RAID (PERC H700) - system disk - RAID 1 on SSDs (SSDSC2CW060A3) and two volumes - one RAID 1 with WD3000FYYZ and one RAID 1 with WD1002FBYS -- both Western Digital disks. Both nodes are linked with two gigabit direct fiber links (no switch in between). We have two DRBD volumes - /dev/drbd1 (1TB on WD1002FBYS disks) and /dev/drbd2 (3TB on WD3000FYYZ disks). On top of DRBD (used as PVs) we have a LVM with LVs for virtual machines which run under XEN. Here is our CRM configuration - http://pastebin.com/raqsvRTA We have previously used fast USB drives instead of SSD for root filesystem and it caused some trouble - it was lagging on I/O and one node "thought" that another one was having trouble and performing STONITH on it. After replacing it with SSDs we had no more trouble with that issue. But now from time to time it happens that we get STONITH of one nodes, and reason is unclear to us. For example last time we found it in logs: Nov 23 15:14:24 rivendell-B crmd: [9529]: info: process_lrm_event: LRM operation primitive-LVM:1_monitor_12 (call=54, rc=7, cib-update=124, confirmed=false) not running And after that node rivendell-B got STONITH. Previously we had trouble with DRBD - node stopped DRBD for no apparent reason and again - STONITH. Unfortunately we did not check logs that time. Also when doing some tasks on one of nodes (for example "crm resource migrate" of few XEN virtual machines) it can cause STONITH also. Could you give us some hints? Maybe our configuration is wrong? To be honest we had no previous experience with HA clusters so we created it based on configuration. It is working now for over a year now but giving us headaches and we are wondering if we should drop Pacemaker and use something else (even manual stopping and starting of virtual machines comes in mind). Thank you in advance! -- Michał Margula, alche...@uznam.net.pl, http://alchemyx.uznam.net.pl/ "W życiu piękne są tylko chwile" [Ryszard Riedel] ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Need to relax corosync due to backup of VM through snapshot
On Sun, Nov 24, 2013 at 4:47 PM, Steven Dake wrote: > Using a real-world example > token: 1 > retrans_before_loss_const: 10 > > token will be retransmitted roughly every 1000 msec and the token will be > determined lost after 1msec. > OK, thank you very much for clarifying this. I also took the time to post a comment on the openstack ha manual page with your words. Gianluca ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org