Re: [ClusterLabs] After Startup, Pacemaker Gasps and Dies
Thanks for the suggestions. I basically uninstalled pacemaker and ripped every single file out of the system that had "pacemaker" in its name, then reinstalled and it is now working fine. I had already uninstalled and re-installed a few times, but the uninstall left some orphaned files. -- Eric Robinson -Original Message- From: Ken Gaillot [mailto:kgail...@redhat.com] Sent: Monday, July 25, 2016 7:52 AM To: users@clusterlabs.org Cc: Eric Robinson <eric.robin...@psmnv.com> Subject: Re: [ClusterLabs] After Startup, Pacemaker Gasps and Dies On 07/23/2016 05:30 PM, Eric Robinson wrote: > I've created a 15 or so Corosync+Pacemaker clusters and never had this > kind of issue. > > > > These servers are running the following software > > > > RHEL 6.3 > > pacemaker-libs-1.1.12-8.el6_7.2.x86_64 > > pacemaker-1.1.12-8.el6_7.2.x86_64 > > corosync-1.4.7-5.el6.x86_64 > > pacemaker-cluster-libs-1.1.12-8.el6_7.2.x86_64 > > pacemaker-cli-1.1.12-8.el6_7.2.x86_64 > > corosynclib-1.4.7-5.el6.x86_64 > > crmsh-2.0-1.el6.x86_64 > > > > Corosync starts fine and both nodes join the cluster. > > Pacemaker appears to start fine, but 'crm configure show' produces the > error... > > > > [root@ha14b ~]# crm configure show > > ERROR: running cibadmin -Ql: Could not establish cib_rw connection: > Connection refused (111) > > Signon to CIB failed: Transport endpoint is not connected > > Init failed, could not perform requested operations > > ERROR: configure: Missing requirements There have been many fixes since RHEL 6.3 -- I'd recommend upgrading to 6.8 if possible. (FYI, RHEL switched from crm to pcs in that time frame, so the command-line syntax is a little different.) The logs show that the cluster is using the "legacy plugin" to connect pacemaker to corosync. On RHEL 6, it's preferred to use CMAN instead of the plugin, so again, I'd recommend trying that first if possible. It would involve installing the cman package and reconfiguring corosync.conf. The immediate cause of the problem is that the CIB daemon is dumping core. Unfortunately, there's no indication of why in the logs. If you have the debuginfo versions of the packages available, you could try looking at the stack trace. However, no one is familiar with the code base in 6.3 anymore, so that might not be terribly useful. FYI to subscribe to this list, see http://clusterlabs.org/mailman/listinfo/users > > > After a short while Pacemaker dies... > > > > [root@ha14b ~]# service pacemaker status > > pacemakerd dead but pid file exists > > > > The Pacemaker log shows the following... > > > > [root@ha14a log]# cat pacemaker.log > > Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: crm_ipc_connect: > Could not establish pacemakerd connection: Connection refused (111) > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: config_find_next: > Processing additional service options... > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: > Found 'pacemaker' for option: name > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: > Found '1' for option: ver > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_cluster_type: > Detected an active 'classic openais (with plugin)' cluster > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: mcp_read_config: > Reading configure for stack: classic openais (with plugin) > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: config_find_next: > Processing additional service options... > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: > Found 'pacemaker' for option: name > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: > Found '1' for option: ver > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: > Defaulting to 'no' for option: use_logd > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: > Defaulting to 'no' for option: use_mgmtd > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: config_find_next: > Processing additional logging options... > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: > Found 'off' for option: debug > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: > Found 'yes' for option: to_logfile > > Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: > Found '/var/log/corosync.log' for option: logfile > > Jul 22 23:29:45 [4616] ha14a pacemakerd: notice: crm_add_logfile: >
[ClusterLabs] subscribe
-- Eric Robinson Chief Information Officer Physician Select Management, LLC 775.885.2211 x 112 ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] subscribe
___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Colocations and Orders Syntax Changed?
Indeed. My mistake. -- Eric Robinson -Original Message- From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] Sent: Friday, January 20, 2017 4:25 AM To: users@clusterlabs.org Subject: [ClusterLabs] Antw: Re: Antw: Colocations and Orders Syntax Changed? >>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 20.01.2017 um >>> 12:56 in Nachricht <dm5pr03mb2729d5003219b644b4e0bc7cfa...@dm5pr03mb2729.namprd03.prod.outlook.com> > Thanks for the input. I usually just do a 'crm config show > > myfile.xml.date_time' and the read it back in if I need to. I guess 'crm configure show xml > myfile.xml.date_time', because here I get "ERROR: config: No such command" and no XML... ;-) Acutally I'm using "cibadmin -Q -o configuration", because I think it's faster... Regards, Ulrich ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Colocations and Orders Syntax Changed?
Thanks for the input. I usually just do a 'crm config show > myfile.xml.date_time' and the read it back in if I need to. -- Eric Robinson > -Original Message- > From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] > Sent: Thursday, January 19, 2017 12:04 AM > To: users@clusterlabs.org > Subject: [ClusterLabs] Antw: Colocations and Orders Syntax Changed? > > Hi! > > This might not help, but messing up your cluster is a part of life. I decided > to > track cluster changes automatically (more human-usable than the diffs > logged in pacemaker's log). So if the current configuration stops to work, I > look at the changes and try to undo them until things work again (hoping it's > not the current pacemaker patch) ;-) > > So I have a script run by cron that saves the current configuration (readable > and XML), then looks if it's different from the one saved last. If so, the new > configuration is saved. For convenience I add a hash to the files, and link a > timestamp to the hashed files (So if you cycle between configurations, you'll > save some space ;-)) So I can diff between any of the configurations saved (a > kind of time machine)... > > Regards, > Ulrich > > >>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 19.01.2017 um > >>> 05:08 in > Nachricht > <DM5PR03MB2729B7480FDF490F0DA81AC0FA7E0@DM5PR03MB2729.nampr > d03.prod.outlook.com> > > > Greetings! > > > > I have a lot of pacemaker clusters, each running multiple instances of > > mysql. I configure it so that the mysql resources are all dependent > > on an underlying stack of supporting resources which consists of a > > virtual IP address (p_vip), a filesystem (p_fs), often an LVM resource > > (p_lvm), and a drbd resource (p_drbd). If any resource in the > > underlying stack resource moves, then all of them move together and the > mysql resources follow. > > However, each of the mysql resources can be stopped and started > > independently without impacting any other resources. I accomplish that > > with a configuration such as the following: > > > > colocation c_clust10 inf: ( p_mysql_103 p_mysql_150 p_mysql_204 > > p_mysql_206 > > p_mysql_244 p_mysql_247 ) p_vip_clust10 p_fs_clust10 ms_drbd0:Master > > order o_clust10 inf: ms_drbd0:promote p_fs_clust10 p_vip_clust10 ( p > > p_mysql_103 p_mysql_150 p_mysql_204 p_mysql_206 p_mysql_244 > > p_mysql_247) > > > > This has suddenly stopped working. On my newest cluster I have the > > following. When I try to use the same approach, the configuration gets > > rearranged on me automatically. The parentheses get moved. Often each > > of the underlying resources is changed to the same thing with ":Master" > following. > > Sometimes the whole colocation stanza gets replaced with raw xml. I > > have messed around with it, and the following is the best I can come > > up with, but when I stop a mysql resource everything else stops! > > > > colocation c_clust19 inf: ( p_mysql_057 p_mysql_092 p_mysql_187 > > p_mysql_213 > > p_vip_clust19 p_mysql_702 p_mysql_743 p_fs_clust19 p_lv_on_drbd0 ) ( > > ms_drbd0:Master ) order o_clust19 inf: ms_drbd0:promote ( > > p_lv_on_drbd0:start ) ( p_fs_clust19 > > p_vip_clust19 ) ( p_mysql_057 p_mysql_092 p_mysql_187 p_mysql_213 > > p_mysql_702 > > p_mysql_743 ) > > > > The old cluster is running Pacemaker 1.1.10. The new one is running 1.1.12. > > > > What can I do to get it running right again? I want all the underlying > > resources (vip, fs, lvm, drbd) to move together. I want the mysql > > instances to be collocated with the underlying resources, but I want > > them to be independent of each other so they can each be started and > > stopped without hurting anything. > > > > -- > > Eric Robinson > > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Colocations and Orders Syntax Changed?
Greetings! I have a lot of pacemaker clusters, each running multiple instances of mysql. I configure it so that the mysql resources are all dependent on an underlying stack of supporting resources which consists of a virtual IP address (p_vip), a filesystem (p_fs), often an LVM resource (p_lvm), and a drbd resource (p_drbd). If any resource in the underlying stack resource moves, then all of them move together and the mysql resources follow. However, each of the mysql resources can be stopped and started independently without impacting any other resources. I accomplish that with a configuration such as the following: colocation c_clust10 inf: ( p_mysql_103 p_mysql_150 p_mysql_204 p_mysql_206 p_mysql_244 p_mysql_247 ) p_vip_clust10 p_fs_clust10 ms_drbd0:Master order o_clust10 inf: ms_drbd0:promote p_fs_clust10 p_vip_clust10 ( p p_mysql_103 p_mysql_150 p_mysql_204 p_mysql_206 p_mysql_244 p_mysql_247) This has suddenly stopped working. On my newest cluster I have the following. When I try to use the same approach, the configuration gets rearranged on me automatically. The parentheses get moved. Often each of the underlying resources is changed to the same thing with ":Master" following. Sometimes the whole colocation stanza gets replaced with raw xml. I have messed around with it, and the following is the best I can come up with, but when I stop a mysql resource everything else stops! colocation c_clust19 inf: ( p_mysql_057 p_mysql_092 p_mysql_187 p_mysql_213 p_vip_clust19 p_mysql_702 p_mysql_743 p_fs_clust19 p_lv_on_drbd0 ) ( ms_drbd0:Master ) order o_clust19 inf: ms_drbd0:promote ( p_lv_on_drbd0:start ) ( p_fs_clust19 p_vip_clust19 ) ( p_mysql_057 p_mysql_092 p_mysql_187 p_mysql_213 p_mysql_702 p_mysql_743 ) The old cluster is running Pacemaker 1.1.10. The new one is running 1.1.12. What can I do to get it running right again? I want all the underlying resources (vip, fs, lvm, drbd) to move together. I want the mysql instances to be collocated with the underlying resources, but I want them to be independent of each other so they can each be started and stopped without hurting anything. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] I've been working on a split-brain prevention strategy for 2-node clusters.
Digimer, thanks for your thoughts. Booth is one of the solutions I looked at, but I don't like it because it is complex and difficult to implement (and perhaps costly in terms of AWS services or something similar)). As I read through your comments, I returned again and again to the feeling that the troubles you described do not apply to the deaddrop scenario. Your observations are correct in that you cannot make assumptions about the state of the other node when all coms are down. You cannot count on the other node being in a predictable state. That is certainly true, and it is the very problem that I hope to address with DeadDrop. It provides a last-resort "back channel" for coms between the cluster nodes when all other coms are down, removing the element of assumption. Consider a few scenarios. 1. Data center A is primary, B is secondary. Coms are lost between A and B, but both of them can still reach the Internet. Node A notices loss of coms with B, but it is already primary so it cares not. Node B sees loss of normal cluster communication, and it might normally think of switching to primary, but first it checks the DeadDrop and it sees a note from A saying, "I'm fine and serving pages for customers." B aborts its plan to become primary. Later, after normal links are restored, B rejoins the cluster still as secondary. There is no element of assumption here. 2. Data center A is primary, B is secondary. A loses communication with the Internet, but not with B. B can still talk to the Internet. B initiates a graceful failover. Again no assumptions. 3. Data center A is primary, B is secondary. Data center A goes completely dark. No communication to anything, not to B, and not to the outside world. B wants to go primary, but first it checks DeadDrop, and it finds that A is not leaving messages there either. It therefore KNOWS that A cannot reach the Internet and is not reachable by customers. No assumptions there. B assumes primary role and customers are happy. When A comes back online, it detects split-brain and refuses to join the cluster, notifying operators. Later, operators manually resolve the split brain. There is no perfect solution, of course, but is seems to me that this simple approach provides a level of availability beyond what you would normally get with a 2-node cluster. What am I missing? -- Eric Robinson -Original Message- From: Digimer [mailto:li...@alteeve.ca] Sent: Sunday, October 09, 2016 2:05 PM To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org> Subject: Re: [ClusterLabs] I've been working on a split-brain prevention strategy for 2-node clusters. On 09/10/16 04:33 PM, Eric Robinson wrote: > I've been working on a script for preventing split-brain in 2-node > clusters and I would appreciate comments from everyone. If someone > already has a solution like this, let me know! > > Most of my database clusters are 2-nodes, with each node in a > geographically separate data center. Our layout looks like the > following diagram. Each server node has three physical connections to the > world. > LANs A, B , C, D are all physically separate cable plants and > cross-connects between the data centers (using different switches, > routers, power, fiber paths, etc.). This is to ensure maximum cluster > communication intelligence. LANs A and B (Corosync ring 0) are bonded > at the NICs, as are LANs C and D (Corosync ring 1). > > Hopefully this diagram will come through intact... > > > > ++ > || > | Third party | > | Web Hosting | > +---++---+ > || > || > || > || > || > || > ++XX | > XXX XX+-+XXX >XX XX XXX > XXX XX > XX X > X XXX > ++The InterwebsXXX+-+ > |XXX X | > | XX XX | > | X XX | > | X XXX | > | XX XX XX| > | XXX | > | | > | Internet | Internet > |
Re: [ClusterLabs] Establishing Timeouts
I'm mostly interested in prventing false-positive cluster failovers that might occur during manual network maintenance (for example, testing switch and link outages). >> Thanks for the clarification. So what's the easiest way to ensure that the >> cluster waits a >> desired timeout before deciding that a re-convergence is > necessary? >By raising the token (lost) timeout I would say. >Please correct my (Chrissie) but I see the token (lost) timout somehow as >resilience against static delays + jitter on top and the >token_retransmits_before_loss_const >as resilience against packet-loss. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Establishing Timeouts
Basically, when we turn off a switch, I want to keep the cluster from failing over before Linux bonding has had a chance to recover. I'm mostly interested in prventing false-positive cluster failovers that might occur during manual network maintenance (for example, testing switch and link outages). >> Thanks for the clarification. So what's the easiest way to ensure >> that the cluster waits a desired timeout before deciding that a >> re-convergence is > necessary? >By raising the token (lost) timeout I would say. >Please correct my (Chrissie) but I see the token (lost) timout somehow >as resilience against static delays + jitter on top and the >token_retransmits_before_loss_const >as resilience against packet-loss. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] I've been working on a split-brain prevention strategy for 2-node clusters.
role in arbitration. Thoughts? -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Establishing Timeouts
I have about a dozen corosync+pacemaker clusters and I am just now getting around to understanding timeouts. Most of my corosync.conf files look something like this: version:2 token: 5000 token_retransmits_before_loss_const: 10 join: 1000 consensus: 7500 vsftype:none max_messages: 20 secauth:off threads:0 clear_node_high_bit: yes rrp_mode: active If I understand this correctly, this means the node will wait 50 seconds (5000ms x 10) before deciding that a cluster reconfig is necessary (perhaps after a link failure). Is that correct? I'm trying to understand how this works together with my bonded NIC's arp_interval settings. I normally set arp_interval=1000. My question is, how many arp losses are required before the bonding driver decides to failover to the other link? If arp_interval=1000, how many times does the driver send an arp and fail to receive a reply before it decides that the link is dead? I think I need to know this so I can set my corosync.conf settings correctly to avoid "false positive" cluster failovers. In other words, if there is a link or switch failure, I want to make sure that the cluster allows plenty of time for link communication to recover before deciding that a node has actually died. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Easy Linux Bonding Question?
Short version: How many missed arp_intervals does the bonding driver wait before removing the PASSIVE slave from the bond? Long version: I'm confused about this because I know the passive slave watches for the active slave's arp broadcast as a way of knowing that the passive slave's link is good. However, if the switch to which the active slave is connected fails, then NEITHER the active slave nor the passive slave will see a packet. (The active slave won't get a reply from the target, and the passive slave won't see the active's request.) So how does the bonding driver decide if it should deactivate the passive slave too? I'm assuming it sends it goes active immediately and begins sending its own arp requests to the target, and if it still does not get a response, then it is removed from the bond too, resulting in NO active slaves. This means that if you have two servers that are looking at each other as arp_targets, you could end up with race conditions where there are no active interfaces at either end. That would be bad. To prevent that, I'm wanting to configure different timeouts at each end, but the downdelay parameter only works with miimon. How do I control the delay with arp_ip_target? -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Trying this question again re: arp_interval
Does anyone know how many arp_intervals must pass without a reply before the bonding driver downs the primary NIC? Just one? -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Can Bonding Cause a Broadcast Storm?
If a Linux server with bonded interfaces attached to different switches is rebooted, is it possible that a bridge loop could result for a brief period? We noticed that one of our 100 Linux servers became unresponsive and appears to have rebooted. (The cause has not been determined.) A couple of minutes afterwards, we saw a gigantic spike in traffic on all switches in the network that lasted for about 7 minutes, causing latency and packet loss on the network. Everything was still reachable, but slowly. The condition stopped as soon as the Linux server in question became reachable again. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Can Bonding Cause a Broadcast Storm?
mode 1. No special switch configuration. spanning tree not enabled. I have 100+ Linux servers, all of which use bonding. The network has been stable for 10 years. No changes recently. However, this is the second time that we have seen high latency and traced it down to the behavior of one particular server. I'm wondering if there is something about bonding that could result in a temporary bridge loop. From: Jeremy Voorhis <jvoor...@gmail.com> Sent: Tuesday, November 15, 2016 2:13:59 PM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] Can Bonding Cause a Broadcast Storm? What bonding mode are you using? Some modes require additional configuration from the switch to avoid flooding. Also, is spanning tree enabled on the switches? On Tue, Nov 15, 2016 at 1:26 PM Eric Robinson <eric.robin...@psmnv.com<mailto:eric.robin...@psmnv.com>> wrote: If a Linux server with bonded interfaces attached to different switches is rebooted, is it possible that a bridge loop could result for a brief period? We noticed that one of our 100 Linux servers became unresponsive and appears to have rebooted. (The cause has not been determined.) A couple of minutes afterwards, we saw a gigantic spike in traffic on all switches in the network that lasted for about 7 minutes, causing latency and packet loss on the network. Everything was still reachable, but slowly. The condition stopped as soon as the Linux server in question became reachable again. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org<mailto:Users@clusterlabs.org> http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Establishing Timeouts
> AFAIK, it _all_ ARP targets did not respond _once_ the link will be > considered down It would be great if someone could confirm that. > after "Down Delay". I guess you want to use multiple (and the correct ones) > ARP IP targets... Yes, I use multiple targets, and arp_all_targets=any. Down Delay only applies to miimon links, I'm afraid, not to arp_ip_target. --Eric ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Can't See Why This Cluster Failed Over
> crm configure show xml c_clust19 Here is what I am entering using crmsh (version 2.0-1): colocation c_clust19 inf: [ p_mysql_057 p_mysql_092 p_mysql_187 ] p_vip_clust19 p_fs_clust19 p_lv_on_drbd0 ms_drbd0:Master order o_clust19 inf: ms_drbd0:promote p_lv_on_drbd0 p_fs_clust19 p_vip_clust19 [ p_mysql_057 p_mysql_092 p_mysql_187 ] After I save it, I get no errors, but it converts it to this... colocation c_clust19 inf: [ p_mysql_057 p_mysql_092 p_mysql_187 ] ( p_vip_clust19:Master p_fs_clust19:Master p_lv_on_drbd0:Master ) ( ms_drbd0:Master ) order o_clust19 inf: ms_drbd0:promote ( p_lv_on_drbd0:start p_fs_clust19:start p_vip_clust19:start ) [ p_mysql_057 p_mysql_092 p_mysql_187 ] This looks incorrect to me. Here is the xml that it generates. The resources in set c_clust19-1 should start sequentially, starting with p_lv_on_drbd0 and ending with p_vip_clust19. I also don't understand why p_lv_on_drbd0 and p_vip_clust19 are getting the Master designation. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Fraud Detection Check?
> -Original Message- > From: Dmitri Maziuk [mailto:dmitri.maz...@gmail.com] > Sent: Thursday, April 13, 2017 8:30 AM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Fraud Detection Check? > > On 2017-04-13 01:39, Jan Pokorný wrote: > > > After a bit of a search, the best practice at the list server seems to > > be: > > > >> [...] if you change the message (eg, by adding a list signature or by > >> adding the list name to the Subject field), you *should* DKIM sign. > > This is of course going entirely off-topic for the list, but DKIM's stated > purpose is to sign mail coming from *.clusterlabs.org with a key from > clustrlab.org's dns zone file. DKIM is not mandatory, so you strip all > existing > dkim headers and then either sign or not, it's up to you. > > None of this is new. SourceForge list manager, for example, adds SF footers > *inside* the PGP-signed MIME part, resulting in the exact same "invalid > signature" problem. > > Dima > Thanks for all the feedback, guys. Bottom line for me is that I'm only seeing it in messages that I send to ClusterLabs list, but you guys are not seeing it at all, even in my messages. So if it isn't bothering anybody else, it is also not bothering me enough to do anything about it. I believe the problem is on my end (or rather, at Office 365) but if it is not being seen in the list then I don't care that much. I just want to make sure people in the list are not getting alerts that my mails are fraudulent. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] 2-Node Cluster Pointless?
> -Original Message- > From: Digimer [mailto:li...@alteeve.ca] > Sent: Sunday, April 16, 2017 11:17 AM > To: Cluster Labs - All topics related to open-source clustering welcomed > <users@clusterlabs.org>; Eric Robinson <eric.robin...@psmnv.com> > Subject: Re: [ClusterLabs] 2-Node Cluster Pointless? > > On 16/04/17 01:53 PM, Eric Robinson wrote: > > I was reading in "Clusters from Scratch" where Beekhof states, "Some > would argue that two-node clusters are always pointless, but that is an > argument for another time." Is there a page or thread where this argument > has been fleshed out? Most of my dozen clusters are 2 nodes. I hate to think > they're pointless. > > > > -- > > Eric Robinson > > There is a belief that you can't build a reliable cluster without quorum. I > am of > the mind that you *can* build a very reliable 2-node cluster. In fact, every > cluster our company has deployed, going back over five years, has been 2- > node and have had exception uptimes. > > The confusion comes from the belief that quorum is required and stonith is > option. The reality is the opposite. I'll come back to this in a minute. > > In a two-node cluster, you have two concerns; > > 1. If communication between the nodes fail, but both nodes are alive, how > do you avoid a split brain? > > 2. If you have a two node cluster and enable cluster startup on boot, how do > you avoid a fence loop? > > Many answer #1 by saying "you need a quorum node to break the tie". In > some cases, this works, but only when all nodes are behaving in a predictable > manner. > > Many answer #2 by saying "well, with three nodes, if a node boots and can't > talk to either other node, it is inquorate and won't do anything". > This is a valid mechanism, but it is not the only one. > > So let me answer these from a 2-node perspective; > > 1. You use stonith and the faster node lives, the slower node dies. From the > moment of comms failure, the cluster blocks (needed with quorum, > too) and doesn't restore operation until the (slower) peer is in a known > state; Off. You can bias this by setting a fence delay against your preferred > node. So say node 1 is the node that normally hosts your services, then you > add 'delay="15"' to node 1's fence method. This tells node 2 to wait 15 > seconds before fencing node 1. If both nodes are alive, node 2 will be fenced > before the timer expires. > > 2. In Corosync v2+, there is a 'wait_for_all' option that tells a node to not > do > anything until it is able to talk to the peer node. So in the case of a fence > after > a comms break, the node that reboots will come up, fail to reach the survivor > node and do nothing more. Perfect. > > Now let me come back to quorum vs. stonith; > > Said simply; Quorum is a tool for when everything is working. Fencing is a > tool > for when things go wrong. > > Lets assume that your cluster is working find, then for whatever reason, > node 1 hangs hard. At the time of the freeze, it was hosting a virtual IP and > an NFS service. Node 2 declares node 1 lost after a period of time and > decides it needs to take over; > > In the 3-node scenario, without stonith, node 2 reforms a cluster with node 3 > (quorum node), decides that it is quorate, starts its NFS server and takes > over the virtual IP. So far, so good... Until node 1 comes out of its hang. At > that moment, node 1 has no idea time has passed. It has no reason to think > "am I still quorate? Are my locks still valid?" It just finishes whatever it > was in > the middle of doing and bam, split-brain. At the least, you have two nodes > claiming the same IP at the same time. At worse, you had uncoordinated > writes to shared storage and you've corrupted your data. > > In the 2-node scenario, with stonith, node 2 is always quorate, so after > declaring node 1 lost, it moves to fence node 1. Once node 1 is fenced, > *then* it starts NFS, takes over the virtual IP and restores services. > In this case, no split-brain is possible because node 1 has rebooted and > comes up with a fresh state (or it's on fire and never coming back anyway). > > This is why quorum is optional and stonith/fencing is not. > > Now, with this said, I won't say that 3+ node clusters are bad. They're fine > if > they suit your use-case, but even with 3+ nodes you still must use stonith. > > My *personal* arguments in favour of 2-node clusters over 3+ nodes is this; > > A cluster is not beautiful when there is nothing left to add. It is beautiful > when there is nothing left to take away. > > In availability clustering, nothing should ev
Re: [ClusterLabs] 2-Node Cluster Pointless?
> In shred-nothing cluster "split brain" means whichever MAC address > is in ARP cache of the border router is the one that gets the traffic. > How does the existing code figure this one out? I'm guessing the surviving node broadcasts a gratuitous arp reply. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] 2-Node Cluster Pointless?
> This isn't the first time this has come up, so I decided > to elaborate on this email by writing an article on the topic. > It's a first-draft so there are likely spelling/grammar > mistakes. However, the body is done. > https://www.alteeve.com/w/The_2-Node_Myth It looks like my question was well-timed, as it served as a catalyst for you to write the article. Thanks much, I am working through it now and will doubtless have some questions and comments. Before I say anything more, I want to do some testing in my lab to make sure I have my thoughts collected. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Can't See Why This Cluster Failed Over
Somebody want to look at this log and tell me why the cluster failed over? All we did was add a new resource. We've done it many times before without any problems. -- Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request: Forwarding cib_apply_diff operation for section 'all' to master (origin=local/cibadmin/2) Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: --- 0.605.2 2 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: +++ 0.607.0 65654c97e62cd549f22f777a5290fe3a Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: + /cib: @epoch=607, @num_updates=0 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/resources: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/resources: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/constraints/rsc_colocation[@id='c_clust19']/resource_set[@id='c_clust19-0']: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/constraints/rsc_colocation[@id='c_clust19']/resource_set[@id='c_clust19-0']: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/constraints/rsc_order[@id='o_clust19']/resource_set[@id='o_clust19-3']: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/constraints/rsc_order[@id='o_clust19']/resource_set[@id='o_clust19-3']: Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=ha14a/cibadmin/2, version=0.607.0) Apr 03 08:50:30 [22762] ha14acib: info: write_cib_contents: Archived previous version as /var/lib/pacemaker/cib/cib-36.raw Apr 03 08:50:30 [22762] ha14acib: info: write_cib_contents: Wrote version 0.607.0 of the CIB to disk (digest: 1afdb9e480f870a095aa9e39719d29c4) Apr 03 08:50:30 [22762] ha14acib: info: retrieveCib:Reading cluster configuration from: /var/lib/pacemaker/cib/cib.DkIgSs (digest: /var/lib/pacemaker/cib/cib.hPwa66) Apr 03 08:50:30 [22764] ha14a lrmd: info: process_lrmd_get_rsc_info: Resource 'p_mysql_745' not found (17 active resources) Apr 03 08:50:30 [22764] ha14a lrmd: info: process_lrmd_rsc_register: Added 'p_mysql_745' to the rsc list (18 active resources) Apr 03 08:50:30 [22767] ha14a crmd: info: do_lrm_rsc_op: Performing key=10:7484:7:91ef4b03-8769-47a1-a364-060569c46e52 op=p_mysql_745_monitor_0 Apr 03 08:50:30 [22764] ha14a lrmd: info: process_lrmd_get_rsc_info: Resource 'p_mysql_746' not found (18 active resources) Apr 03 08:50:30 [22764] ha14a lrmd: info: process_lrmd_rsc_register: Added 'p_mysql_746' to the rsc list (19 active resources) Apr 03 08:50:30 [22767] ha14a crmd: info: do_lrm_rsc_op: Performing key=11:7484:7:91ef4b03-8769-47a1-a364-060569c46e52 op=p_mysql_746_monitor_0 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: --- 0.607.0 2 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: +++ 0.607.1 (null) Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: + /cib: @num_updates=1 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/status/node_state[@id='ha14b']/lrm[@id='ha14b']/lrm_resources: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=ha14b/crmd/7665, version=0.607.1) Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: --- 0.607.1 2 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: +++ 0.607.2 (null) Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: + /cib: @num_updates=2 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/status/node_state[@id='ha14b']/lrm[@id='ha14b']/lrm_resources: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=ha14b/crmd/7666, version=0.607.2) Apr 03 08:50:30 [22767] ha14a crmd: notice: process_lrm_event: Operation p_mysql_745_monitor_0: not running (node=ha14a, call=142, rc=7, cib-update=88, confirmed=true) Apr 03 08:50:30 [22767] ha14a crmd: notice: process_lrm_event: ha14a-p_mysql_745_monitor_0:142 [ not started\n ] Apr 03 08:50:30 [22762]
[ClusterLabs] Can't See Why This Cluster Failed Over
Somebody want to look at this log and tell me why the cluster failed over? All we did was add a new resource. We've done it many times before without any problems. -- Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request: Forwarding cib_apply_diff operation for section 'all' to master (origin=local/cibadmin/2) Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: --- 0.605.2 2 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: +++ 0.607.0 65654c97e62cd549f22f777a5290fe3a Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: + /cib: @epoch=607, @num_updates=0 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/resources: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/resources: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/constraints/rsc_colocation[@id='c_clust19']/resource_set[@id='c_clust19-0']: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/constraints/rsc_colocation[@id='c_clust19']/resource_set[@id='c_clust19-0']: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/constraints/rsc_order[@id='o_clust19']/resource_set[@id='o_clust19-3']: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/configuration/constraints/rsc_order[@id='o_clust19']/resource_set[@id='o_clust19-3']: Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=ha14a/cibadmin/2, version=0.607.0) Apr 03 08:50:30 [22762] ha14acib: info: write_cib_contents: Archived previous version as /var/lib/pacemaker/cib/cib-36.raw Apr 03 08:50:30 [22762] ha14acib: info: write_cib_contents: Wrote version 0.607.0 of the CIB to disk (digest: 1afdb9e480f870a095aa9e39719d29c4) Apr 03 08:50:30 [22762] ha14acib: info: retrieveCib:Reading cluster configuration from: /var/lib/pacemaker/cib/cib.DkIgSs (digest: /var/lib/pacemaker/cib/cib.hPwa66) Apr 03 08:50:30 [22764] ha14a lrmd: info: process_lrmd_get_rsc_info: Resource 'p_mysql_745' not found (17 active resources) Apr 03 08:50:30 [22764] ha14a lrmd: info: process_lrmd_rsc_register: Added 'p_mysql_745' to the rsc list (18 active resources) Apr 03 08:50:30 [22767] ha14a crmd: info: do_lrm_rsc_op: Performing key=10:7484:7:91ef4b03-8769-47a1-a364-060569c46e52 op=p_mysql_745_monitor_0 Apr 03 08:50:30 [22764] ha14a lrmd: info: process_lrmd_get_rsc_info: Resource 'p_mysql_746' not found (18 active resources) Apr 03 08:50:30 [22764] ha14a lrmd: info: process_lrmd_rsc_register: Added 'p_mysql_746' to the rsc list (19 active resources) Apr 03 08:50:30 [22767] ha14a crmd: info: do_lrm_rsc_op: Performing key=11:7484:7:91ef4b03-8769-47a1-a364-060569c46e52 op=p_mysql_746_monitor_0 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: --- 0.607.0 2 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: +++ 0.607.1 (null) Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: + /cib: @num_updates=1 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/status/node_state[@id='ha14b']/lrm[@id='ha14b']/lrm_resources: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=ha14b/crmd/7665, version=0.607.1) Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: --- 0.607.1 2 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: +++ 0.607.2 (null) Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: + /cib: @num_updates=2 Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ /cib/status/node_state[@id='ha14b']/lrm[@id='ha14b']/lrm_resources: Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=ha14b/crmd/7666, version=0.607.2) Apr 03 08:50:30 [22767] ha14a crmd: notice: process_lrm_event: Operation p_mysql_745_monitor_0: not running (node=ha14a, call=142, rc=7, cib-update=88, confirmed=true) Apr 03 08:50:30 [22767] ha14a crmd: notice: process_lrm_event: ha14a-p_mysql_745_monitor_0:142 [ not started\n ] Apr 03 08:50:30 [22762]
Re: [ClusterLabs] Fraud Detection Check?
> I've received your emails without any alteration or flagging as "fraud". > So I don't think we're doing anything to your emails. Good to know. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Fraud Detection Check?
>> You guys got a thing against Office 365? > doesn't everybody? Fair enough. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Fraud Detection Check?
> On a serious note, I too received your e-mails without any red flags attached. Thanks for the confirmation. I guess I'm the only one seeing those warnings. Maybe Office 365 has a problem with ClusterLabs. ;-) -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: DRBD and SSD TRIM - Slow!
1) iotop did not show any significant io, just maybe 30k/second of drbd traffic. 2) okay. I've never done that before. I'll give it a shot. 3) I'm not sure what I'm looking at there. -- Eric Robinson > -Original Message- > From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] > Sent: Tuesday, August 01, 2017 11:28 PM > To: users@clusterlabs.org > Subject: [ClusterLabs] Antw: DRBD and SSD TRIM - Slow! > > Hi! > > I know little about trim operations, but you could try one of these: > > 1) iotop to see whether some I/O is done during trimming (assuming > trimming itself is not considered to be I/O) > > 2) Try blocktrace on the affected devices to see what's going on. It's hard to > set up and to extract the info you are looking for, but it provides deep > insights > > 3) Watch /sys/block/$BDEV/stat for performance statistics. I don't know how > well DRBD supports these, however (e.g. MDRAID shows no wait times and > no busy operations, while a multipath map has it all). > > Regards, > Ulrich > > >>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 02.08.2017 um > >>> 07:09 in > Nachricht > <DM5PR03MB27297014DF96DC01FE849A63FAB00@DM5PR03MB2729.nampr > d03.prod.outlook.com> > > > Does anyone know why trimming a filesystem mounted on a DRBD volume > > takes so long? I mean like three days to trim a 1.2TB filesystem. > > > > Here are some pertinent details: > > > > OS: SLES 12 SP2 > > Kernel: 4.4.74-92.29 > > Drives: 6 x Samsung SSD 840 Pro 512GB > > RAID: 0 (mdraid) > > DRBD: 9.0.8 > > Protocol: C > > Network: Gigabit > > Utilization: 10% > > Latency: < 1ms > > Loss: 0% > > Iperf test: 900 mbits/sec > > > > When I write to a non-DRBD partition, I get 400MB/sec (bypassing caches). > > When I trim a non-DRBD partition, it completes fast. > > When I write to a DRBD volume, I get 80MB/sec. > > > > When I trim a DRBD volume, it takes bloody ages! > > > > -- > > Eric Robinson > > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] DRBD and SSD TRIM - Slow!
Does anyone know why trimming a filesystem mounted on a DRBD volume takes so long? I mean like three days to trim a 1.2TB filesystem. Here are some pertinent details: OS: SLES 12 SP2 Kernel: 4.4.74-92.29 Drives: 6 x Samsung SSD 840 Pro 512GB RAID: 0 (mdraid) DRBD: 9.0.8 Protocol: C Network: Gigabit Utilization: 10% Latency: < 1ms Loss: 0% Iperf test: 900 mbits/sec When I write to a non-DRBD partition, I get 400MB/sec (bypassing caches). When I trim a non-DRBD partition, it completes fast. When I write to a DRBD volume, I get 80MB/sec. When I trim a DRBD volume, it takes bloody ages! -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: DRBD and SSD TRIM - Slow! -- RESOLVED!
For anyone else who has this problem, we have reduced the time required to trim a 1.3TB volume from 3 days to 1.5 minutes. Initially, we had used mdraid to build a raid0 array with a 32K chunk size. We initialized it as a drbd disk, synced it, built an lvm logical volume on it, and created an ext4 filesystem on the volume. Creating the filesystem and trimming it took 3 days (each time, every time, across multiple tests). When running lsblk -D, we noticed that the DISC-MAX value for the array was only 32K, compared to 4GB for the SSD drive itself. We also noticed that the number matched the chunk size. We deleted the array and built a new one with a 4MB chunk size. The DISC-MAX value changed to 4MB, which is the max selectable chunk size (but still way below the other DISC-MAX values shown in lsblk -D). We realized that, when using mdadm, the DISK-MAX value ends up matching the array chunk size. We theorized that the small DISC-MAX value was responsible for the slow trim rate across the DRBD link. Instead of using mdadm to build the array, we used LVM to create a striped logical volume and made that the backing device for drbd. Then lsblk -D showed a DISC-MAX size of 128MB. Creating an ext4 filesystem on it and trimming only took 1.5 minutes (across multiple tests). Somebody knowledgeable may be able to explain how DISC-MAX affects the trim speed, and why the DISC-MAX value is different when creating the array with mdadm versus lvm. -- Eric Robinson > -Original Message- > From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] > Sent: Wednesday, August 02, 2017 11:36 PM > To: users@clusterlabs.org > Subject: [ClusterLabs] Antw: Re: Antw: DRBD and SSD TRIM - Slow! > > >>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 02.08.2017 um > >>> 23:20 in > Nachricht > <DM5PR03MB2729C66CEC1E3B8B9E297185FAB00@DM5PR03MB2729.nampr > d03.prod.outlook.com> > > > 1) iotop did not show any significant io, just maybe 30k/second of > > drbd traffic. > > > > 2) okay. I've never done that before. I'll give it a shot. > > > > 3) I'm not sure what I'm looking at there. > > See /usr/src/linux/Documentation/block/stat.txt ;-) I wrote an NRPE plugin > to monitor those with performance data and verbose text output, e.g.: > CFS_VMs-xen: [delta 120s], 1.15086 IO/s read, 60.7789 IO/s write, 0 req/s > read merges, 0 req/s write merges, 4.53674 sec/s read, 486.231 sec/s write, > 2.36844 ms/s read wait, 2702.19 ms/s write wait, 0 req in_flight, 115.987 ms/s > active, 2704.53 ms/s wait > > Regards, > Ulrich > > > > > -- > > Eric Robinson > > > >> -Original Message- > >> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] > >> Sent: Tuesday, August 01, 2017 11:28 PM > >> To: users@clusterlabs.org > >> Subject: [ClusterLabs] Antw: DRBD and SSD TRIM - Slow! > >> > >> Hi! > >> > >> I know little about trim operations, but you could try one of these: > >> > >> 1) iotop to see whether some I/O is done during trimming (assuming > >> trimming itself is not considered to be I/O) > >> > >> 2) Try blocktrace on the affected devices to see what's going on. > >> It's hard > > to > >> set up and to extract the info you are looking for, but it provides > >> deep insights > >> > >> 3) Watch /sys/block/$BDEV/stat for performance statistics. I don't > >> know how well DRBD supports these, however (e.g. MDRAID shows no > wait > >> times and no busy operations, while a multipath map has it all). > >> > >> Regards, > >> Ulrich > >> > >> >>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 02.08.2017 um > >> >>> 07:09 in > >> Nachricht > >> > <DM5PR03MB27297014DF96DC01FE849A63FAB00@DM5PR03MB2729.nampr > >> d03.prod.outlook.com> > >> > >> > Does anyone know why trimming a filesystem mounted on a DRBD > volume > >> > takes so long? I mean like three days to trim a 1.2TB filesystem. > >> > > >> > Here are some pertinent details: > >> > > >> > OS: SLES 12 SP2 > >> > Kernel: 4.4.74-92.29 > >> > Drives: 6 x Samsung SSD 840 Pro 512GB > >> > RAID: 0 (mdraid) > >> > DRBD: 9.0.8 > >> > Protocol: C > >> > Network: Gigabit > >> > Utilization: 10% > >> > Latency: < 1ms > >> > Loss: 0% > >> > Iperf test: 900 mbits/sec > >> > > >> > When I write to a non-DRBD partition, I get 400MB/sec (bypassing > caches). > >&g
[ClusterLabs] verify status starts at 100% and stays there?
I have drbd 9.0.8. I started an online verify, and immediately checked status, and I see... ha11a:/ha01_mysql/trimtester # drbdadm status ha01_mysql role:Primary disk:UpToDate ha11b role:Secondary replication:VerifyT peer-disk:UpToDate done:100.00 ...which looks like it is finished, but the tail of dmesg says... [336704.851209] drbd ha01_mysql/0 drbd0 ha11b: repl( Established -> VerifyT ) [336704.851244] drbd ha01_mysql/0 drbd0: Online Verify start sector: 0 ...which looks like the verify is still in progress. So is it done, or is it still in progress? Is this a drbd bug? -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: verify status starts at 100% and stays there?
Yeah, UpToDate was not of concern to me. The part that threw me off was "done:100.00." It did eventually finish, though, and that was shown in the dmesg output. However, 'drbdadm status' said "done:100.00" the whole time, from start to finish, which seems weird. -- Eric Robinson > -Original Message- > From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] > Sent: Thursday, August 03, 2017 11:25 PM > To: users@clusterlabs.org > Subject: [ClusterLabs] Antw: verify status starts at 100% and stays there? > > >>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 04.08.2017 um > >>> 06:53 in > Nachricht > <DM5PR03MB2729739B8FC91B96F0CD3BE8FAB60@DM5PR03MB2729.nampr > d03.prod.outlook.com> > > > I have drbd 9.0.8. I started an online verify, and immediately checked > > status, and I see... > > > > ha11a:/ha01_mysql/trimtester # drbdadm status ha01_mysql role:Primary > > disk:UpToDate > > ha11b role:Secondary > > replication:VerifyT peer-disk:UpToDate done:100.00 > > > > ...which looks like it is finished, but the tail of dmesg says... > > > > [336704.851209] drbd ha01_mysql/0 drbd0 ha11b: repl( Established -> > > VerifyT ) [336704.851244] drbd ha01_mysql/0 drbd0: Online Verify start > > sector: 0 > > > > ...which looks like the verify is still in progress. > > > > So is it done, or is it still in progress? Is this a drbd bug? > > Not deep into DRBD, but I guess "disk:UpToDate" just indicated that up to > the present moment DRBD thinks the disks are up to date (unless veryfy > wouold detect otherwise). Maybe there should be an additional status like > "syncing,verifying, etc." > > Regards, > Ulrich > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] ClusterLabs.Org Documentation Problem?
The documentation located here... http://clusterlabs.org/doc/ ...is confusing because it offers two combinations: Pacemaker 1.0 for Corosync 1.x Pacemaker 1.1 for Corosync 2.x According to the documentation, if you use Corosync 1.x you need Pacemaker 1.0, but if you use Corosync 2.x then you need Pacemaker 1.1. However, on my Centos 6.9 system, when I do 'yum install pacemaker corosync" I get the following versions: pacemaker-1.1.15-5.el6.x86_64 corosync-1.4.7-5.el6.x86_64 What's the correct answer? Does Pacemaker 1.1.15 work with Corosync 1.4.7? If so, is the documentation at ClusterLabs misleading? -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] ClusterLabs.Org Documentation Problem?
Thanks for the reply. Yes, it's a bit confusing. I did end up using the documentation for Corosync 2.X since that seemed newer, but it also assumed CentOS/RHEL7 and systemd-based commands. It also incorporates cman, pcsd, psmisc, and policycoreutils-pythonwhich, which are all new to me. If there is anything I can do to assist with getting the documentation cleaned up, I'd be more than glad to help. -- Eric Robinson -Original Message- From: Ken Gaillot [mailto:kgail...@redhat.com] Sent: Tuesday, August 22, 2017 2:08 PM To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org> Subject: Re: [ClusterLabs] ClusterLabs.Org Documentation Problem? On Tue, 2017-08-22 at 19:40 +, Eric Robinson wrote: > The documentation located here… > > > > http://clusterlabs.org/doc/ > > > > …is confusing because it offers two combinations: > > > > Pacemaker 1.0 for Corosync 1.x > > Pacemaker 1.1 for Corosync 2.x > > > > According to the documentation, if you use Corosync 1.x you need > Pacemaker 1.0, but if you use Corosync 2.x then you need Pacemaker > 1.1. > > > > However, on my Centos 6.9 system, when I do ‘yum install pacemaker > corosync” I get the following versions: > > > > pacemaker-1.1.15-5.el6.x86_64 > > corosync-1.4.7-5.el6.x86_64 > > > > What’s the correct answer? Does Pacemaker 1.1.15 work with Corosync > 1.4.7? If so, is the documentation at ClusterLabs misleading? > > > > -- > Eric Robinson The page actually offers a third option ... "Pacemaker 1.1 for CMAN or Corosync 1.x". That's the configuration used by CentOS 6. However, that's still a bit misleading; the documentation set for "Pacemaker 1.1 for Corosync 2.x" is the only one that is updated, and it's mostly independent of the underlying layer, so you should prefer that set. I plan to reorganize that page in the coming months, so I'll try to make it clearer. -- Ken Gaillot <kgail...@redhat.com> ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?
> > Out of curiosity, what did I say that indicates that we're not using > > fencing? > > > > Same place you said you were new to HA and needed to learn corosync and > pacemaker to use OpenBSD. > I must have misspoken. I said I stopped using OpenBSD back around the year 2000 and switched to Linux (because market pressure). I didn't mean to imply that I was new to HA. --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?
> > I must have misspoken. > > No, I had invisible tags all over my last two messages. Haha, okay. Thought I was going nuts for a moment. --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?
> > Out of curiosity, do the openSUSE Leap repos and packages work with > SLES? > > I know that there are some base system differences that could cause > problems, things like Leap using systemd/journald for logging while SLES is > still logging via syslog-ng (IIRC)... so it's possible that you could get into > problems if you mix versions. And adding the Leap repositories to SLES will > probably mess things up since both deliver slightly different versions of the > base system. > Good information. > For SLES, there's now the Package Hub which has open source packages > taken from Leap and confirmed not to conflict with SLES, so you can mix a > supported base system with unsupported open source packages with less > risk for breaking anything: > > https://packagehub.suse.com/ > That sounds like a possibility. We're such freaking cheapos, we have a bunch or RHEL servers, but only 1 of them has a subscription, which we keep active so we have access to the RHEL knowledge base. We never call for support. All the other servers are kept up to date using the CentOS repos, which works fine. I'm thinking of doing similar with SLES with the PackageHub repos, but maybe we'll just use Leap. I haven't installed Leap yet. The SLES installer is winderful compared to Red Hat's. I like it a lot, especially the GUI disk partitioner with its device maps and mount maps and whatnot. If the Leap installer is like it, I will be tempted to go with it instead of SLES. --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?
> Jokes (?) aside; Red Hat and SUSE both have paid teams that make sure the > HA software works well. So if you're new to HA, I strongly recommend > sticking with one of those two, and SUSE is what you mentioned. If you really > want to go to BSD or something else, I would recommend learning HA on > SUSE/RHEL and then, after you know what config works for you, migrate to > the target OS. That way you have only one set of variables at a time. > I don't know how "new" I am. I've been using HA for a decade or so. Started with heartbeat V1. Deploying a Corosync+Pacemaker+DRBD cluster is pretty much a slam dunk for me these days. However, there's certainly a lot more that I DON'T know than I DO know. --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?
> > Also, use fencing. Seriously, just do it. > > Yeah. Fencing is the only bit that's missing from this picture. > Out of curiosity, what did I say that indicates that we're not using fencing? --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?
> > I can understand how SUSE can charge for support, but not for the > software itself. Corosync, Pacemaker, and DRBD are all open source. > > So why do not you download open source and compile it yourself? > I've done that before and I could if necessary. Rather go with the easiest option available. --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?
Ø You could test it for free, you just need to register Ø to https://scc.suse.com/login Ø After that, you have an access for 60 days to SLES Repo. What happens at the end of the trial? Software stops working? I can understand how SUSE can charge for support, but not for the software itself. Corosync, Pacemaker, and DRBD are all open source. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Installing on SLES 12 -- Where's the Repos?
We've been a Red Hat/CentOS shop for 10+ years and have installed Corosync+Pacemaker+DRBD dozens of times using the repositories, all for free. We are now trying out our first SLES 12 server, and I'm looking for the repos. Where the heck are they? I went looking, and all I can find is the SLES "High Availability Extension," which I must pay $700/year for? No freaking way! This is Linux we're talking about, right? There's got to be an easy way to install the cluster without paying for a subscription... right? Someone talk me off the ledge here. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Azure Resource Agent
The license would be GPL, I suppose, whatever enthusiasts and community contributors usually do. And yes, it would be fun to know I contributed something to the repo. -- Eric Robinson > -Original Message- > From: Kristoffer Grönlund [mailto:kgronl...@suse.com] > Sent: Monday, September 18, 2017 3:10 AM > To: Eric Robinson <eric.robin...@psmnv.com>; Cluster Labs - All topics > related to open-source clustering welcomed <users@clusterlabs.org> > Subject: Re: [ClusterLabs] Azure Resource Agent > > Eric Robinson <eric.robin...@psmnv.com> writes: > > > This is obviously beta as it currently only works with a manual failover. I > need to add some code to handle an actual node crash or power-plug test. > > > > Feedback, suggestions, improvements are welcome. If someone who > knows awk wants to clean up my azure client calls, that would be a good > place to start. > > Hi, > > Great to see an initial agent for managing IPs on Azure! First of all, I would > ask: What is your license for the code? Would you be interested in getting an > agent based on this version included in the upstream resource-agents > repository? > > Cheers, > Kristoffer > > > > > -- > > > > #!/bin/sh > > # > > # OCF parameters are as below > > # OCF_RESKEY_ip > > > > > ## > > > # > > # Initialization: > > > > : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} > > . ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs > > DEBUG_LEVEL=2 > > MY_HOSTNAME=$(hostname -s) > > SCRIPT_NAME=$(basename $0) > > > > > ## > > > # > > > > meta_data() { > > logIt "debug1: entered: meta_data()" > > cat < > > > > name="AZaddr2"> 1.0 > > > > > > Resource agent for managing IP configs in Azure. > > > > > > Short descrption/ > > > > > > > > > The > > IPv4 (dotted quad notation) example IPv4 "192.168.1.1". > > > > IPv4 address > default="" /> > > > > > > > > > > > > > > > > > timeout="20s" /> END > > logIt "leaving: exiting: meta_data()" > > return $OCF_SUCCESS > > } > > > > azip_query() { > > > > logIt "debug1: entered: azip_query()" > > logIt "debug1: checking to determine if an Azure ipconfig > > named > '$AZ_IPCONFIG_NAME' exists for the interface" > > logIt "debug1: executing: az network nic ip-config show > > --name > $AZ_IPCONFIG_NAME --nic-name $AZ_NIC_NAME -g $AZ_RG_NAME 2>&1" > > R=$(az network nic ip-config show --name $AZ_IPCONFIG_NAME --nic- > name $AZ_NIC_NAME -g $AZ_RG_NAME 2>&1) > > logIt "debug2: $R" > > R2=$(echo "$R"|grep "does not exist") > > if [ -n "$R2" ]; then > > logIt "debug1: ipconfig named > > '$AZ_IPCONFIG_NAME' > does not exist" > > return $OCF_NOT_RUNNING > > else > > R2=$(echo "$R"|grep "Succeeded") > > if [ -n "$R2" ]; then > > logIt "debug1: ipconfig > > '$AZ_IPCONFIG_NAME' > exists" > > return $OCF_SUCCESS > > else > > logIt "debug1: not sure how > > this happens" > > return $OCF_ERR_GENERIC > > fi > > fi > > logIt "debug1: exiting: azip_query()" > > } > > > > azip_usage() { > > cat < > usage: $0 {start|stop|status|monitor|validate-all|meta-data} > > > > Expects to have a fully populated OCF RA-compliant environment set. > > END > > return $OCF_SUCCESS > > } > > > > azip_start() { > > > > logIt "debug1: entered: azip_start()" > > > > #--if a matching ipconfig alrea
Re: [ClusterLabs] Azure Resource Agent
Forgot to mention that it's called AZaddr and is intended to be dependent on IPaddr2 (or vice versa) and live in /usr/lib/ocf/resource.d/heartbeat. -- Eric Robinson From: Eric Robinson [mailto:eric.robin...@psmnv.com] Sent: Friday, September 15, 2017 3:56 PM To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org> Subject: [ClusterLabs] Azure Resource Agent This sender failed our fraud detection checks and may not be who they appear to be. Learn about spoofing<http://aka.ms/LearnAboutSpoofing> Feedback<http://aka.ms/SafetyTipsFeedback> Greetings, all -- If anyone's interested, I wrote a resource agent that works with Microsoft Azure. I'm no expert at shell scripting, so I'm certain it needs a great deal of improvement, but I've done some testing and it works with a 2-node cluster in my Azure environment. Offhand, I don't know any reason why it wouldn't work with larger clusters, too. My colocation stack looks like this: mysql -> azure_ip -> cluster_ip -> filesystem -> drbd Failover takes up to 4 minutes because it takes that long for the Azure IP address de-association and re-association to complete. None of the delay is the fault of the cluster itself. Right now the script burps a bunch of debug output to syslog, which is helpful if you feel like you're waiting forever for the cluster to failover, you can look at /var/log/messages and see that you're waiting for the Azure cloud to finish something. To eliminate the debug messages, set DEBUG_LEVEL to 0. The agent requires the Azure client to be installed and the nodes to have been logged into the cloud. It currently only works with one NIC per VM, and two ipconfigs per NIC (one of which is the floating cluster IP). This is obviously beta as it currently only works with a manual failover. I need to add some code to handle an actual node crash or power-plug test. Feedback, suggestions, improvements are welcome. If someone who knows awk wants to clean up my azure client calls, that would be a good place to start. -- #!/bin/sh # # OCF parameters are as below # OCF_RESKEY_ip ### # Initialization: : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} . ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs DEBUG_LEVEL=2 MY_HOSTNAME=$(hostname -s) SCRIPT_NAME=$(basename $0) ### meta_data() { logIt "debug1: entered: meta_data()" cat < 1.0 Resource agent for managing IP configs in Azure. Short descrption/ The IPv4 (dotted quad notation) example IPv4 "192.168.1.1". IPv4 address END logIt "leaving: exiting: meta_data()" return $OCF_SUCCESS } azip_query() { logIt "debug1: entered: azip_query()" logIt "debug1: checking to determine if an Azure ipconfig named '$AZ_IPCONFIG_NAME' exists for the interface" logIt "debug1: executing: az network nic ip-config show --name $AZ_IPCONFIG_NAME --nic-name $AZ_NIC_NAME -g $AZ_RG_NAME 2>&1" R=$(az network nic ip-config show --name $AZ_IPCONFIG_NAME --nic-name $AZ_NIC_NAME -g $AZ_RG_NAME 2>&1) logIt "debug2: $R" R2=$(echo "$R"|grep "does not exist") if [ -n "$R2" ]; then logIt "debug1: ipconfig named '$AZ_IPCONFIG_NAME' does not exist" return $OCF_NOT_RUNNING else R2=$(echo "$R"|grep "Succeeded") if [ -n "$R2" ]; then logIt "debug1: ipconfig '$AZ_IPCONFIG_NAME' exists" return $OCF_SUCCESS else logIt "debug1: not sure how this happens" return $OCF_ERR_GENERIC fi fi logIt "debug1: exiting: azip_query()" } azip_usage() { cat <___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Azure Resource Agent
Greetings, all -- If anyone's interested, I wrote a resource agent that works with Microsoft Azure. I'm no expert at shell scripting, so I'm certain it needs a great deal of improvement, but I've done some testing and it works with a 2-node cluster in my Azure environment. Offhand, I don't know any reason why it wouldn't work with larger clusters, too. My colocation stack looks like this: mysql -> azure_ip -> cluster_ip -> filesystem -> drbd Failover takes up to 4 minutes because it takes that long for the Azure IP address de-association and re-association to complete. None of the delay is the fault of the cluster itself. Right now the script burps a bunch of debug output to syslog, which is helpful if you feel like you're waiting forever for the cluster to failover, you can look at /var/log/messages and see that you're waiting for the Azure cloud to finish something. To eliminate the debug messages, set DEBUG_LEVEL to 0. The agent requires the Azure client to be installed and the nodes to have been logged into the cloud. It currently only works with one NIC per VM, and two ipconfigs per NIC (one of which is the floating cluster IP). This is obviously beta as it currently only works with a manual failover. I need to add some code to handle an actual node crash or power-plug test. Feedback, suggestions, improvements are welcome. If someone who knows awk wants to clean up my azure client calls, that would be a good place to start. -- #!/bin/sh # # OCF parameters are as below # OCF_RESKEY_ip ### # Initialization: : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} . ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs DEBUG_LEVEL=2 MY_HOSTNAME=$(hostname -s) SCRIPT_NAME=$(basename $0) ### meta_data() { logIt "debug1: entered: meta_data()" cat < 1.0 Resource agent for managing IP configs in Azure. Short descrption/ The IPv4 (dotted quad notation) example IPv4 "192.168.1.1". IPv4 address END logIt "leaving: exiting: meta_data()" return $OCF_SUCCESS } azip_query() { logIt "debug1: entered: azip_query()" logIt "debug1: checking to determine if an Azure ipconfig named '$AZ_IPCONFIG_NAME' exists for the interface" logIt "debug1: executing: az network nic ip-config show --name $AZ_IPCONFIG_NAME --nic-name $AZ_NIC_NAME -g $AZ_RG_NAME 2>&1" R=$(az network nic ip-config show --name $AZ_IPCONFIG_NAME --nic-name $AZ_NIC_NAME -g $AZ_RG_NAME 2>&1) logIt "debug2: $R" R2=$(echo "$R"|grep "does not exist") if [ -n "$R2" ]; then logIt "debug1: ipconfig named '$AZ_IPCONFIG_NAME' does not exist" return $OCF_NOT_RUNNING else R2=$(echo "$R"|grep "Succeeded") if [ -n "$R2" ]; then logIt "debug1: ipconfig '$AZ_IPCONFIG_NAME' exists" return $OCF_SUCCESS else logIt "debug1: not sure how this happens" return $OCF_ERR_GENERIC fi fi logIt "debug1: exiting: azip_query()" } azip_usage() { cat <___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Warning: Data Corruption Issue Discovered in DRBD 8.4 and 9.0
Problem: Under high write load, DRBD exhibits data corruption. In repeated tests over a month-long period, file corruption occurred after 700-900 GB of data had been written to the DRBD volume. Testing Platform: 2 x Dell PowerEdge R610 servers 32GB RAM 6 x Samsung SSD 840 Pro 512GB (latest firmware) Dell H200 JBOD Controller SUSE Linux Enterprise Server 12 SP2 (kernel 4.4.74-92.32) Gigabit network, 900 Mbps throughput, < 1ms latency, 0 packet loss Initial Setup: Create 2 RAID-0 software arrays using either mdadm or LVM On Array 1: sda5 through sdf5, create DRBD replicated volume (drbd0) with an ext4 filesystem On Array 2: sda6 through sdf6, create LVM logical volume with an ext4 filesystem Procedure: Download and build the TrimTester SSD burn-in and TRIM verification tool from Algolia (https://github.com/algolia/trimtester). Run TrimTester against the filesystem on drbd0, wait for corruption to occur Run TrimTester against the non-drbd backed filesystem, wait for corruption to occur Results: In multiple tests over a period of a month, TrimTester would report file corruption when run against the DRBD volume after 700-900 GB of data had been written. The error would usually appear within an hour or two. However, when running it against the non-DRBD volume on the same physical drives, no corruption would occur. We could let the burn-in run for 15+ hours and write 20+ TB of data without a problem. Results were the same with DRBD 8.4 and 9.0. We also tried disabling the TRIM-testing part of TrimTester and using it as a simple burn-in tool, just to make sure that SSD TRIM was not a factor. Conclusion: We are aware of some controversy surrounding the Samsung SSD 8XX series drives; however, the issues related to that controversy were resolved and no longer exist as of kernel 4.2. The 840 Pro drives are confirmed to support RZAT. Also, the data corruption would only occur when writing through the DRBD layer. It never occurred when bypassing the DRBD layer and writing directly to the drives, so we must conclude that DRBD has a data corruption bug under high write load. However, we would be more than happy to be proved wrong. -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Warning: Data Corruption Issue Discovered in DRBD 8.4 and 9.0
> I don't know the tool, but isn't the expectation a bit high that the tool > will trim > the correct blocks throuch drbd->LVM/mdadm->device? Why not use the tool > on the affected devices directly? > I did, and the corruption did not occur. It only happened when writing through the DRBD layer. Also, I disabled the TRIM function of the tools and merely used it as a drive burn-in without triggering any trim commands. Same results. --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Is there a Trick to Making Corosync Work on Microsoft Azure?
I created two nodes on Micrsoft Azure, but I can't get them to join a cluster. Any thoughts? OS: RHEL 6.9 Corosync version: 1.4.7-5.el6.x86_64 Node names: ha001a (172.28.0.4/23), ha001b (172.28.0.5/23) The nodes are on the same subnet and can ping and ssh to each other just fine by either host name or IP address. I have configured corosync to use unicast. corosync-cfgtool looks fine... [root@ha001b corosync]# corosync-cfgtool -s Printing ring status. Local node ID 2 RING ID 0 id = 172.28.0.5 status = ring 0 active with no faults ...but corosync-objctl only shows the local node... [root@ha001b corosync]# corosync-objctl |grep join totem.join=60 runtime.totem.pg.mrp.srp.memb_join_tx=1 runtime.totem.pg.mrp.srp.memb_join_rx=1 runtime.totem.pg.mrp.srp.members.2.join_count=1 runtime.totem.pg.mrp.srp.members.2.status=joined ...pcs status shows... Cluster name: ha001 Stack: cman Current DC: ha001b (version 1.1.15-5.el6-e174ec8) - partition with quorum Last updated: Wed Aug 23 18:04:33 2017 Last change: Wed Aug 23 17:51:07 2017 by root via cibadmin on ha001b 2 nodes and 0 resources configured Online: [ ha001b ] OFFLINE: [ ha001a ] No resources Daemon Status: cman: active/disabled corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled ...it shows the opposite on the other node... [root@ha001a ~]# corosync-objctl |grep join totem.join=60 runtime.totem.pg.mrp.srp.memb_join_tx=1 runtime.totem.pg.mrp.srp.memb_join_rx=1 runtime.totem.pg.mrp.srp.members.1.join_count=1 runtime.totem.pg.mrp.srp.members.1.status=joined [root@ha001a ~]# pcs status Cluster name: ha001 Stack: cman Current DC: ha001a (version 1.1.15-5.el6-e174ec8) - partition with quorum Last updated: Wed Aug 23 18:06:04 2017 Last change: Wed Aug 23 17:51:03 2017 by root via cibadmin on ha001a 2 nodes and 0 resources configured Online: [ ha001a ] OFFLINE: [ ha001b ] No resources Daemon Status: cman: active/disabled corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled ...here is my corosync.conf... compatibility: whitetank totem { version: 2 secauth: off interface { member { memberaddr: 172.28.0.4 } member { memberaddr: 172.28.0.5 } ringnumber: 0 bindnetaddr: 172.28.0.0 mcastport: 5405 ttl: 1 } transport: udpu } logging { fileline: off to_logfile: yes to_syslog: yes logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } I used tcpdump and I see a lot of traffic between them on port 2224, but nothing else. Is there an issue because the bindinetaddr is 172.28.0.0 but the members have a /23 mask? -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker in Azure
> Don't use Azure? ;) That would be my preference. But since I'm stuck with Azure (management decision) I need to come up with something. It appears there is an Azure API to make changes on-the-fly from a Linux box. Maybe I'll write a resource agent to change Azure and make IPaddr2 dependent on it. That might work? -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Pacemaker in Azure
I deployed a couple of cluster nodes in Azure and found out right away that floating a virtual IP address between nodes does not work because Azure does not honor IP changes made from within the VMs. IP changes must be made to virtual NICs in the Azure portal itself. Anybody know of an easy way around this limitation? -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker in Azure
I agree completely. Are you offering to make those changes? Because they would expand the capability of resource angent and would be a welcome addition. Also, full disclosure, I need to have something in place by the weekend, lol. From: Ken Gaillot <kgail...@redhat.com> Sent: Thursday, August 24, 2017 4:45:32 PM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] Pacemaker in Azure That would definitely be of wider interest. I could see modifying the IPaddr2 RA to take some new arguments for AWS/Azure parameters, and if those are configured, it would do the appropriate API requests. On Thu, 2017-08-24 at 23:27 +, Eric Robinson wrote: > Leon -- I will pay you one trillion samolians for that resource agent! > Any way we can get our hands on a copy? > > > > -- > Eric Robinson > > > > From: Leon Steffens [mailto:l...@steffensonline.com] > Sent: Thursday, August 24, 2017 3:48 PM > To: Cluster Labs - All topics related to open-source clustering > welcomed <users@clusterlabs.org> > Subject: Re: [ClusterLabs] Pacemaker in Azure > > > > That's what we did in AWS. The IPaddr2 resource agent does an arp > broadcast after changing the local IP but this does not work in AWS > (probably for the same reasons as Azure). > > > > > We created our own OCF resource agent that uses the Amazon APIs to > move the IP in AWS land and made that dependent on the IPaddr2 > resource, and it worked fine. > > > > > > > > > Leon Steffens > > > > > On Fri, Aug 25, 2017 at 8:34 AM, Eric Robinson > <eric.robin...@psmnv.com> wrote: > > > Don't use Azure? ;) > > That would be my preference. But since I'm stuck with Azure > (management decision) I need to come up with something. It > appears there is an Azure API to make changes on-the-fly from > a Linux box. Maybe I'll write a resource agent to change Azure > and make IPaddr2 dependent on it. That might work? > > -- > Eric Robinson > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot <kgail...@redhat.com> ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker in Azure
Oh, okay. I thought you meant some different ones. -- Eric Robinson Chief Information Officer Physician Select Management, LLC 775.885.2211 x 112 -Original Message- From: Kristoffer Grönlund [mailto:kgronl...@suse.com] Sent: Friday, August 25, 2017 9:56 AM To: Eric Robinson <eric.robin...@psmnv.com>; Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org> Subject: RE: [ClusterLabs] Pacemaker in Azure Eric Robinson <eric.robin...@psmnv.com> writes: > Hi Kristoffer -- > > If you would be willing to share your AWS ip control agent(s), I think those > would be very helpful to us and the community at large. I'll be happy to > share whatever we come up with in terms of an Azure agent when we're all done. I meant the agents that are in resource-agents already: https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/awsvip https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/awseip https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/aws-vpc-route53 You'll probably also be interested in fencing: There are agents for fencing both on AWS and Azure in the fence-agents repository. Cheers, Kristoffer > > -- > Eric Robinson > > -Original Message- > From: Kristoffer Grönlund [mailto:kgronl...@suse.com] > Sent: Friday, August 25, 2017 3:16 AM > To: Eric Robinson <eric.robin...@psmnv.com>; Cluster Labs - All topics > related to open-source clustering welcomed <users@clusterlabs.org> > Subject: Re: [ClusterLabs] Pacemaker in Azure > > Eric Robinson <eric.robin...@psmnv.com> writes: > >> I deployed a couple of cluster nodes in Azure and found out right away that >> floating a virtual IP address between nodes does not work because Azure does >> not honor IP changes made from within the VMs. IP changes must be made to >> virtual NICs in the Azure portal itself. Anybody know of an easy way around >> this limitation? > > You will need a custom IP control agent for Azure. We have a series of agents > for controlling IP addresses and domain names in AWS, but there is no agent > for Azure IP control yet. (At least as far as I am aware). > > Cheers, > Kristoffer > >> >> -- >> Eric Robinson >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > -- > // Kristoffer Grönlund > // kgronl...@suse.com -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker in Azure
Thanks. Leon sent me the same one earlier but I hadn't mentioned it yet (just got it a short while ago). I'll be able to use it as a template to build one for Azure. I have already installed the Azure CLI and it is working from my Linux cluster nodes, so I'm maybe a third of the way there. -- Eric Robinson > -Original Message- > From: Oyvind Albrigtsen [mailto:oalbr...@redhat.com] > Sent: Friday, August 25, 2017 12:17 AM > To: Cluster Labs - All topics related to open-source clustering welcomed > <users@clusterlabs.org> > Subject: Re: [ClusterLabs] Pacemaker in Azure > > There's the awsvip agent that can handle secondary private IP addresses this > way (to be used with order/colocation constraints with IPaddr2). > > https://github.com/ClusterLabs/resource- > agents/blob/master/heartbeat/awsvip > > There's also the awseip for Elastic IPs that can assign your Elastic IP to > hosts or > secondary private IPs. > > On 25/08/17 10:13 +1000, Leon Steffens wrote: > >Unfortunately I can't post the full resource agent here. > > > >In our search for solutions we did find a resource agent for managing > >AWS Elastic IPs: > >https://github.com/moomindani/aws-eip-resource- > agent/blob/master/eip. > >This was not what we wanted, but it will give you an idea of how it can > work. > > > >Our script manages secondary private IPs by using: > > > >aws ec2 assign-private-ip-addresses > >aws ec2 unassign-private-ip-addresses > >aws ec2 describe-network-interfaces > > > > > >There are a few things to consider: > >* The AWS call to assign IPs to an EC2 instance is asynchronous (or it > >was the last time I checked), so you have to wait a bit (or poll > >AWS/Azure until the IP is ready). > >* The IP change is slower than a normal VIP change on the machine, so > >expect a slightly longer outage. > > > > > >Leon > > >___ > >Users mailing list: Users@clusterlabs.org > >http://lists.clusterlabs.org/mailman/listinfo/users > > > >Project Home: http://www.clusterlabs.org Getting started: > >http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >Bugs: http://bugs.clusterlabs.org > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] ClusterLabs.Org Documentation Problem?
I have a BIG correction. If you follow the instructions titled, "Pacemaker 1.1 for Corosync 2.x," and NOT the ones entitled, "Pacemaker 1.1 for CMAN or Corosync 1.x," guess what? It installs cman anyway, and you spend a couple of days wondering why none of your changes to corosync.conf seem to be working. -- Eric Robinson -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Tuesday, August 22, 2017 11:52 PM To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org>; kgail...@redhat.com Subject: Re: [ClusterLabs] ClusterLabs.Org Documentation Problem? > Thanks for the reply. Yes, it's a bit confusing. I did end up using the > documentation for Corosync 2.X since that seemed newer, but it also assumed > CentOS/RHEL7 and systemd-based commands. It also incorporates cman, pcsd, > psmisc, and policycoreutils-pythonwhich, which are all new to me. If there is > anything I can do to assist with getting the documentation cleaned up, I'd be > more than glad to help. Just a small correction. Documentation shouldn't incorporate cman. Cman was used with corosync 1.x as a configuration layer and (more important) quorum provider. With Corosync 2.x quorum provider is already in corosync so no need for cman. > > -- > Eric Robinson > > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Tuesday, August 22, 2017 2:08 PM > To: Cluster Labs - All topics related to open-source clustering > welcomed <users@clusterlabs.org> > Subject: Re: [ClusterLabs] ClusterLabs.Org Documentation Problem? > > On Tue, 2017-08-22 at 19:40 +, Eric Robinson wrote: >> The documentation located here… >> >> >> >> http://clusterlabs.org/doc/ >> >> >> >> …is confusing because it offers two combinations: >> >> >> >> Pacemaker 1.0 for Corosync 1.x >> >> Pacemaker 1.1 for Corosync 2.x >> >> >> >> According to the documentation, if you use Corosync 1.x you need >> Pacemaker 1.0, but if you use Corosync 2.x then you need Pacemaker >> 1.1. >> >> >> >> However, on my Centos 6.9 system, when I do ‘yum install pacemaker >> corosync” I get the following versions: >> >> >> >> pacemaker-1.1.15-5.el6.x86_64 >> >> corosync-1.4.7-5.el6.x86_64 >> >> >> >> What’s the correct answer? Does Pacemaker 1.1.15 work with Corosync >> 1.4.7? If so, is the documentation at ClusterLabs misleading? >> >> >> >> -- >> Eric Robinson > > The page actually offers a third option ... "Pacemaker 1.1 for CMAN or > Corosync 1.x". That's the configuration used by CentOS 6. > > However, that's still a bit misleading; the documentation set for "Pacemaker > 1.1 for Corosync 2.x" is the only one that is updated, and it's mostly > independent of the underlying layer, so you should prefer that set. > > I plan to reorganize that page in the coming months, so I'll try to make it > clearer. > > -- > Ken Gaillot <kgail...@redhat.com> > > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Is there a Trick to Making Corosync Work on Microsoft Azure?
I figured out the cause. CMAN got installed by yum, and so none of my changes to corosync.conf had any effect, including the udpu directive. Now I just have to figure out how to enable unicast in cman. -- Eric Robinson From: Eric Robinson [mailto:eric.robin...@psmnv.com] Sent: Wednesday, August 23, 2017 3:16 PM To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org> Subject: [ClusterLabs] Is there a Trick to Making Corosync Work on Microsoft Azure? I created two nodes on Micrsoft Azure, but I can't get them to join a cluster. Any thoughts? OS: RHEL 6.9 Corosync version: 1.4.7-5.el6.x86_64 Node names: ha001a (172.28.0.4/23), ha001b (172.28.0.5/23) The nodes are on the same subnet and can ping and ssh to each other just fine by either host name or IP address. I have configured corosync to use unicast. corosync-cfgtool looks fine... [root@ha001b corosync]# corosync-cfgtool -s Printing ring status. Local node ID 2 RING ID 0 id = 172.28.0.5 status = ring 0 active with no faults ...but corosync-objctl only shows the local node... [root@ha001b corosync]# corosync-objctl |grep join totem.join=60 runtime.totem.pg.mrp.srp.memb_join_tx=1 runtime.totem.pg.mrp.srp.memb_join_rx=1 runtime.totem.pg.mrp.srp.members.2.join_count=1 runtime.totem.pg.mrp.srp.members.2.status=joined ...pcs status shows... Cluster name: ha001 Stack: cman Current DC: ha001b (version 1.1.15-5.el6-e174ec8) - partition with quorum Last updated: Wed Aug 23 18:04:33 2017 Last change: Wed Aug 23 17:51:07 2017 by root via cibadmin on ha001b 2 nodes and 0 resources configured Online: [ ha001b ] OFFLINE: [ ha001a ] No resources Daemon Status: cman: active/disabled corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled ...it shows the opposite on the other node... [root@ha001a ~]# corosync-objctl |grep join totem.join=60 runtime.totem.pg.mrp.srp.memb_join_tx=1 runtime.totem.pg.mrp.srp.memb_join_rx=1 runtime.totem.pg.mrp.srp.members.1.join_count=1 runtime.totem.pg.mrp.srp.members.1.status=joined [root@ha001a ~]# pcs status Cluster name: ha001 Stack: cman Current DC: ha001a (version 1.1.15-5.el6-e174ec8) - partition with quorum Last updated: Wed Aug 23 18:06:04 2017 Last change: Wed Aug 23 17:51:03 2017 by root via cibadmin on ha001a 2 nodes and 0 resources configured Online: [ ha001a ] OFFLINE: [ ha001b ] No resources Daemon Status: cman: active/disabled corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled ...here is my corosync.conf... compatibility: whitetank totem { version: 2 secauth: off interface { member { memberaddr: 172.28.0.4 } member { memberaddr: 172.28.0.5 } ringnumber: 0 bindnetaddr: 172.28.0.0 mcastport: 5405 ttl: 1 } transport: udpu } logging { fileline: off to_logfile: yes to_syslog: yes logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } I used tcpdump and I see a lot of traffic between them on port 2224, but nothing else. Is there an issue because the bindinetaddr is 172.28.0.0 but the members have a /23 mask? -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Is there a Trick to Making Corosync Work on Microsoft Azure?
I got it. From: Eric Robinson [mailto:eric.robin...@psmnv.com] Sent: Wednesday, August 23, 2017 6:51 PM To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org> Subject: Re: [ClusterLabs] Is there a Trick to Making Corosync Work on Microsoft Azure? I figured out the cause. CMAN got installed by yum, and so none of my changes to corosync.conf had any effect, including the udpu directive. Now I just have to figure out how to enable unicast in cman. -- Eric Robinson From: Eric Robinson [mailto:eric.robin...@psmnv.com] Sent: Wednesday, August 23, 2017 3:16 PM To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org<mailto:users@clusterlabs.org>> Subject: [ClusterLabs] Is there a Trick to Making Corosync Work on Microsoft Azure? I created two nodes on Micrsoft Azure, but I can't get them to join a cluster. Any thoughts? OS: RHEL 6.9 Corosync version: 1.4.7-5.el6.x86_64 Node names: ha001a (172.28.0.4/23), ha001b (172.28.0.5/23) The nodes are on the same subnet and can ping and ssh to each other just fine by either host name or IP address. I have configured corosync to use unicast. corosync-cfgtool looks fine... [root@ha001b corosync]# corosync-cfgtool -s Printing ring status. Local node ID 2 RING ID 0 id = 172.28.0.5 status = ring 0 active with no faults ...but corosync-objctl only shows the local node... [root@ha001b corosync]# corosync-objctl |grep join totem.join=60 runtime.totem.pg.mrp.srp.memb_join_tx=1 runtime.totem.pg.mrp.srp.memb_join_rx=1 runtime.totem.pg.mrp.srp.members.2.join_count=1 runtime.totem.pg.mrp.srp.members.2.status=joined ...pcs status shows... Cluster name: ha001 Stack: cman Current DC: ha001b (version 1.1.15-5.el6-e174ec8) - partition with quorum Last updated: Wed Aug 23 18:04:33 2017 Last change: Wed Aug 23 17:51:07 2017 by root via cibadmin on ha001b 2 nodes and 0 resources configured Online: [ ha001b ] OFFLINE: [ ha001a ] No resources Daemon Status: cman: active/disabled corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled ...it shows the opposite on the other node... [root@ha001a ~]# corosync-objctl |grep join totem.join=60 runtime.totem.pg.mrp.srp.memb_join_tx=1 runtime.totem.pg.mrp.srp.memb_join_rx=1 runtime.totem.pg.mrp.srp.members.1.join_count=1 runtime.totem.pg.mrp.srp.members.1.status=joined [root@ha001a ~]# pcs status Cluster name: ha001 Stack: cman Current DC: ha001a (version 1.1.15-5.el6-e174ec8) - partition with quorum Last updated: Wed Aug 23 18:06:04 2017 Last change: Wed Aug 23 17:51:03 2017 by root via cibadmin on ha001a 2 nodes and 0 resources configured Online: [ ha001a ] OFFLINE: [ ha001b ] No resources Daemon Status: cman: active/disabled corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled ...here is my corosync.conf... compatibility: whitetank totem { version: 2 secauth: off interface { member { memberaddr: 172.28.0.4 } member { memberaddr: 172.28.0.5 } ringnumber: 0 bindnetaddr: 172.28.0.0 mcastport: 5405 ttl: 1 } transport: udpu } logging { fileline: off to_logfile: yes to_syslog: yes logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } I used tcpdump and I see a lot of traffic between them on port 2224, but nothing else. Is there an issue because the bindinetaddr is 172.28.0.0 but the members have a /23 mask? -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Cannot connect to the drbdmanaged process using DBus
I'm sure someone has seen this before. What does it mean? ha11a:~ # drbdmanage init 198.51.100.65 You are going to initialize a new drbdmanage cluster. CAUTION! Note that: * Any previous drbdmanage cluster information may be removed * Any remaining resources managed by a previous drbdmanage installation that still exist on this system will no longer be managed by drbdmanage Confirm: yes/no: yes Empty drbdmanage control volume initialized on '/dev/drbd0'. Empty drbdmanage control volume initialized on '/dev/drbd1'. Error: Cannot connect to the drbdmanaged process using DBus The DBus subsystem returned the following error description: org.freedesktop.DBus.Error.Spawn.ChildExited: Launch helper exited with unknown return code 1 I'm using... drbd-9.0.9+git.bffac0d9-72.1.x86_64 drbd-kmp-default-9.0.9+git.bffac0d9_k4.4.76_1-72.1.x86_64 drbdmanage-0.99.5-5.1.noarch drbd-utils-9.0.0-56.1.x86_64 --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Where to Find pcs and pcsd for OpenSUSE LEAP 4.23
I installed corosync 2.4.3 and pacemaker 1.1.17 from the openSUSE Leap 4.23 repos, but I can't find pcs or pcsd. Anybody know where to download them from? --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Where to Find pcs and pcsd for OpenSUSE LEAP 4.23
Thanks much. I am experienced with crmsh because I have been using it for years, but I recently tried pcs and I really like the way it handles constraints. Would be nice if it worked on openSUSE. Oh well. --Eric From: Eric Ren [mailto:z...@suse.com] Sent: Monday, November 06, 2017 10:28 PM To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org>; Eric Robinson <eric.robin...@psmnv.com> Subject: Re: [ClusterLabs] Where to Find pcs and pcsd for OpenSUSE LEAP 4.23 Hi, On 11/07/2017 05:35 AM, Eric Robinson wrote: I installed corosync 2.4.3 and pacemaker 1.1.17 from the openSUSE Leap 4.23 repos, but I can't find pcs or pcsd. Anybody know where to download them from? openSUSE/SUSE uses CLI tool "crmsh" and web UI "hawk" to mange HA cluster. Please see "quick start" doc [1] and other HA docs under here [2]. [1] https://www.suse.com/documentation/sle-ha-12/install-quick/data/install-quick.html [2] https://www.suse.com/documentation/sle-ha-12/index.html Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Where to Find pcs and pcsd for OpenSUSE LEAP 4.23
> Which aspects of its constraints handling do you like, and why? I'm curious, > since I wasn't aware that it was significantly different from crmsh in this > respect. > Well, to be fair, in the past I have always configured my clusters by using 'crm configure edit' and building the config in a full-screen editor, and then saving it. Using that method, I have often had trouble getting my colocations and orderings to work properly. You have to use parenthesis and brackets to group things, and when you're done and save it, the cluster sometimes re-writes your statements for you. When you edit the config again, it looks different than what you typed and the resource dependencies are not what you wanted. It's very frustrating. With pcs, the colocation syntax 'constraint add resource1 with resource2' and order syntax 'resource2 then resource1" are very intuitive, and cumulative. I always get exactly what I want. The first time I configured a cluster with pcs I fell in love with it. --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] One volume is trimmable but the other is not?
> > I sent this to the drbd list too, but it’s possible that someone here > > may know. > > > > > > > > This is a WEIRD one. > > > > > > > > Why would one drbd volume be trimmable and the other one not? > > > > iirc drbd stores some of the config in the meta-data as well - like e.g. some > block-size I remember in particular - and that doesn't just depend on the > content of the current config-files but as well on the history (like already > connected and to whom). > Don't know if that helps in particular - just saying taking a look at > differences > on the replication-partners might be worth while. > > I know that it shows the maximum discard block-size 0 on one of the drbds > but that might be a configuration passed down by the lvm layer as well. > (provisioning_mode?) So searching for differences in the volume-groups or > volumes might make sense as well. > > Regards, > Klaus Thanks for your reply, Klaus. However, I don't think it's possible that anything could be getting "passed down" from LVM because the drbd devices are built directly on top of the raid arrays, with no LVM layer between... { on ha11a { device /dev/drbd1; disk /dev/md3; address 198.51.100.65:7789; meta-disk internal; } on ha11b { device /dev/drbd1; disk /dev/md3; address 198.51.100.66:7789; meta-disk internal; } } --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] One volume is trimmable but the other is not?
I sent this to the drbd list too, but it's possible that someone here may know. This is a WEIRD one. Why would one drbd volume be trimmable and the other one not? Here you can see me issuing the trim command against two different filesystems. It works on one but fails on the other. ha11a:~ # fstrim -v /ha01_mysql /ha01_mysql: 0 B (0 bytes) trimmed ha11a:~ # fstrim -v /ha02_mysql fstrim: /ha02_mysql: the discard operation is not supported Both filesystems are on the same server, two different drbd devices on two different mdraid arrays, but the same underlying physical drives. Yet it can be seen that discard is enabled on drbd0 but not on drbd1... NAMEDISC-ALN DISC-GRAN DISC-MAX DISC-ZERO sda0 512B 4G 1 ├─sda1 0 512B 4G 1 │ └─md00 128K 256M 0 ├─sda2 0 512B 4G 1 │ └─md10 128K 256M 0 ├─sda3 0 512B 4G 1 ├─sda4 0 512B 4G 1 ├─sda5 0 512B 4G 1 │ └─md201M 256M 0 │ └─drbd001M 128M 0 │ └─vg_on_drbd0-lv_on_drbd0 3932161M 128M 0 └─sda6 0 512B 4G 1 └─md301M 256M 0 └─drbd100B 0B 0 └─vg_on_drbd1-lv_on_drbd100B 0B 0 The filesystems are set up the same. (Note that I do not want automatic discard so that option is not enabled on either filesystem, but the problem is not the filesystem, since that relies on drbd, and you can see from lsblk that the drbd volume is the problem.) ha11a:~ # mount|grep drbd /dev/mapper/vg_on_drbd1-lv_on_drbd1 on /ha02_mysql type ext4 (rw,relatime,stripe=160,data=ordered) /dev/mapper/vg_on_drbd0-lv_on_drbd0 on /ha01_mysql type ext4 (rw,relatime,stripe=160,data=ordered) ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?
General question. I tried to set up a cman + corosync + pacemaker cluster using two corosync rings. When I start the cluster, everything works fine, except when I do a 'corosync-cfgtool -s' it only shows one ring. I tried manually editing the /etc/cluster/cluster.conf file adding two sections, but then cman complained that I didn't have a multicast address specified, even though I did. I tried editing the /etc/corosdync/corosync.conf file, and then I could get two rings, but the nodes would not both join the cluster. Bah! I did some reading and saw that cman didn't support multiple rings years ago. Did it never get updated? [sig] ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?
Thanks for the suggestion everyone. I'll give that a try. > -Original Message- > From: Jan Friesse [mailto:jfrie...@redhat.com] > Sent: Monday, February 12, 2018 8:49 AM > To: Cluster Labs - All topics related to open-source clustering welcomed > <users@clusterlabs.org>; Eric Robinson <eric.robin...@psmnv.com> > Subject: Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync > Rings? > > Eric, > > > General question. I tried to set up a cman + corosync + pacemaker > > cluster using two corosync rings. When I start the cluster, everything > > works fine, except when I do a 'corosync-cfgtool -s' it only shows one > > ring. I tried manually editing the /etc/cluster/cluster.conf file > > adding two > > AFAIK cluster.conf should be edited so altname is used. So something like in > this example: > https://access.redhat.com/documentation/en- > us/red_hat_enterprise_linux/6/html/cluster_administration/s1-config-rrp- > cli-ca > > I don't think you have to add altmulticast. > > Honza > > sections, but then cman complained that I didn't have a multicast address > specified, even though I did. I tried editing the /etc/corosdync/corosync.conf > file, and then I could get two rings, but the nodes would not both join the > cluster. Bah! I did some reading and saw that cman didn't support multiple > rings years ago. Did it never get updated? > > > > [sig] > > > > > > > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?
> > Thanks for the suggestion everyone. I'll give that a try. > > Sorry, I'm late on this, but I wrote a quick start doc describing this (amongs > other things) some time ago. See the following chapter: > > https://clusterlabs.github.io/PAF/Quick_Start-CentOS-6.html#cluster- > creation > I scanned through that page but I did not see where it talks about setting up multiple corosync rings. --Eric ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Why Won't Resources Move?
I have what seems to be a healthy cluster, but I can't get resources to move. Here's what's installed... [root@001db01a cluster]# yum list installed|egrep "pacem|coro" corosync.x86_64 2.4.3-2.el7_5.1 @updates corosynclib.x86_64 2.4.3-2.el7_5.1 @updates pacemaker.x86_64 1.1.18-11.el7_5.3 @updates pacemaker-cli.x86_64 1.1.18-11.el7_5.3 @updates pacemaker-cluster-libs.x86_641.1.18-11.el7_5.3 @updates pacemaker-libs.x86_641.1.18-11.el7_5.3 @updates Cluster status looks good... [root@001db01b cluster]# pcs status Cluster name: 001db01ab Stack: corosync Current DC: 001db01b (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Wed Aug 1 03:44:47 2018 Last change: Wed Aug 1 03:22:18 2018 by root via cibadmin on 001db01a 2 nodes configured 11 resources configured Online: [ 001db01a 001db01b ] Full list of resources: p_vip_clust01 (ocf::heartbeat:IPaddr2): Started 001db01b p_azip_clust01 (ocf::heartbeat:AZaddr2): Started 001db01b Master/Slave Set: ms_drbd0 [p_drbd0] Masters: [ 001db01b ] Slaves: [ 001db01a ] Master/Slave Set: ms_drbd1 [p_drbd1] Masters: [ 001db01b ] Slaves: [ 001db01a ] p_fs_clust01 (ocf::heartbeat:Filesystem):Started 001db01b p_fs_clust02 (ocf::heartbeat:Filesystem):Started 001db01b p_vip_clust02 (ocf::heartbeat:IPaddr2): Started 001db01b p_azip_clust02 (ocf::heartbeat:AZaddr2): Started 001db01b p_mysql_001(lsb:mysql_001):Started 001db01b Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled Constraints look like this... [root@001db01b cluster]# pcs constraint Location Constraints: Ordering Constraints: promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory) promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory) start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory) start p_vip_clust01 then start p_azip_clust01 (kind:Mandatory) start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory) start p_vip_clust02 then start p_azip_clust02 (kind:Mandatory) start p_vip_clust01 then start p_mysql_001 (kind:Mandatory) Colocation Constraints: p_azip_clust01 with p_vip_clust01 (score:INFINITY) p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master) p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master) p_vip_clust01 with p_fs_clust01 (score:INFINITY) p_vip_clust02 with p_fs_clust02 (score:INFINITY) p_azip_clust02 with p_vip_clust02 (score:INFINITY) p_mysql_001 with p_vip_clust01 (score:INFINITY) Ticket Constraints: But when I issue a move command, nothing at all happens. I see this in the log on one node... Aug 01 03:21:57 [16550] 001db01bcib: info: cib_perform_op: ++ /cib/configuration/constraints: Aug 01 03:21:57 [16550] 001db01bcib: info: cib_process_request: Completed cib_modify operation for section constraints: OK (rc=0, origin=001db01a/crm_resource/4, version=0.138.0) Aug 01 03:21:57 [16555] 001db01b crmd: info: abort_transition_graph: Transition aborted by rsc_location.cli-prefer-ms_drbd0 'create': Configuration change | cib=0.138.0 source=te_update_diff:456 path=/cib/configuration/constraints complete=true And I see this in the log on the other node... notice: p_drbd1_monitor_6:69196:stderr [ Error signing on to the CIB service: Transport endpoint is not connected ] Any thoughts? --Eric [sig] ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Won't Resources Move?
> -Original Message- > From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ken Gaillot > Sent: Wednesday, August 01, 2018 2:17 PM > To: Cluster Labs - All topics related to open-source clustering welcomed > > Subject: Re: [ClusterLabs] Why Won't Resources Move? > > On Wed, 2018-08-01 at 03:49 +, Eric Robinson wrote: > > I have what seems to be a healthy cluster, but I can’t get resources > > to move. > > > > Here’s what’s installed… > > > > [root@001db01a cluster]# yum list installed|egrep "pacem|coro" > > corosync.x86_64 2.4.3-2.el7_5.1 @updates > > corosynclib.x86_64 2.4.3-2.el7_5.1 @updates > > pacemaker.x86_64 1.1.18-11.el7_5.3 @updates > > pacemaker-cli.x86_64 1.1.18-11.el7_5.3 @updates > > pacemaker-cluster-libs.x86_64 1.1.18-11.el7_5.3 @updates > > pacemaker-libs.x86_64 1.1.18-11.el7_5.3 @updates > > > > Cluster status looks good… > > > > [root@001db01b cluster]# pcs status > > Cluster name: 001db01ab > > Stack: corosync > > Current DC: 001db01b (version 1.1.18-11.el7_5.3-2b07d5c5a9) - > > partition with quorum Last updated: Wed Aug 1 03:44:47 2018 Last > > change: Wed Aug 1 03:22:18 2018 by root via cibadmin on 001db01a > > > > 2 nodes configured > > 11 resources configured > > > > Online: [ 001db01a 001db01b ] > > > > Full list of resources: > > > > p_vip_clust01 (ocf::heartbeat:IPaddr2): Started 001db01b > > p_azip_clust01 (ocf::heartbeat:AZaddr2): Started 001db01b > > Master/Slave Set: ms_drbd0 [p_drbd0] > > Masters: [ 001db01b ] > > Slaves: [ 001db01a ] > > Master/Slave Set: ms_drbd1 [p_drbd1] > > Masters: [ 001db01b ] > > Slaves: [ 001db01a ] > > p_fs_clust01 (ocf::heartbeat:Filesystem): Started 001db01b > > p_fs_clust02 (ocf::heartbeat:Filesystem): Started 001db01b > > p_vip_clust02 (ocf::heartbeat:IPaddr2): Started 001db01b > > p_azip_clust02 (ocf::heartbeat:AZaddr2): Started 001db01b > > p_mysql_001 (lsb:mysql_001): Started 001db01b > > > > Daemon Status: > > corosync: active/disabled > > pacemaker: active/disabled > > pcsd: active/enabled > > > > Constraints look like this… > > > > [root@001db01b cluster]# pcs constraint Location Constraints: > > Ordering Constraints: > > promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory) > > promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory) > > start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory) > > start p_vip_clust01 then start p_azip_clust01 (kind:Mandatory) > > start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory) > > start p_vip_clust02 then start p_azip_clust02 (kind:Mandatory) > > start p_vip_clust01 then start p_mysql_001 (kind:Mandatory) > > Colocation Constraints: > > p_azip_clust01 with p_vip_clust01 (score:INFINITY) > > p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master) > > p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master) > > p_vip_clust01 with p_fs_clust01 (score:INFINITY) > > p_vip_clust02 with p_fs_clust02 (score:INFINITY) > > p_azip_clust02 with p_vip_clust02 (score:INFINITY) > > p_mysql_001 with p_vip_clust01 (score:INFINITY) Ticket Constraints: > > > > But when I issue a move command, nothing at all happens. > > > > I see this in the log on one node… > > > > Aug 01 03:21:57 [16550] 001db01b cib: info: > > cib_perform_op: ++ /cib/configuration/constraints: > id="cli-prefer-ms_drbd0" rsc="ms_drbd0" role="Started" > > node="001db01a" score="INFINITY"/> > > Aug 01 03:21:57 [16550] 001db01b cib: info: > > cib_process_request: Completed cib_modify operation for section > > constraints: OK (rc=0, origin=001db01a/crm_resource/4, > > version=0.138.0) > > Aug 01 03:21:57 [16555] 001db01b crmd: info: > > abort_transition_graph: Transition aborted by rsc_location.cli- > > prefer-ms_drbd0 'create': Configuration change | cib=0.138.0 > > source=te_update_diff:456 path=/cib/configuration/constraints > > complete=true > > > > And I see this in the log on the other node… > > > > notice: p_drbd1_monitor_6:69196:stderr [ Error signing on to the > > CIB service: Transport endpoint is not connected ] > > The message likely came from the resource agent calling crm_attribute to set > a node attribute. That message usually means the
Re: [ClusterLabs] Why Won't Resources Move?
> > The message likely came from the resource agent calling crm_attribute > > to set a node attribute. That message usually means the cluster isn't > > running on that node, so it's highly suspect. The cib might have > > crashed, which should be in the log as well. I'd look into that first. > > > I rebooted the server and afterwards I'm still getting tons of these... > > Aug 2 01:43:40 001db01a drbd(p_drbd1)[18628]: ERROR: ha02_mysql: Called > /usr/sbin/crm_master -Q -l reboot -v 1 Aug 2 01:43:40 001db01a > drbd(p_drbd0)[18627]: ERROR: ha01_mysql: Called /usr/sbin/crm_master -Q -l > reboot -v 1 Aug 2 01:43:40 001db01a drbd(p_drbd0)[18627]: ERROR: > ha01_mysql: Exit code 107 Aug 2 01:43:40 001db01a drbd(p_drbd1)[18628]: > ERROR: ha02_mysql: Exit code 107 Aug 2 01:43:40 001db01a > drbd(p_drbd0)[18627]: ERROR: ha01_mysql: Command output: > Aug 2 01:43:40 001db01a drbd(p_drbd1)[18628]: ERROR: ha02_mysql: > Command output: > Aug 2 01:43:40 001db01a lrmd[2025]: notice: > p_drbd0_monitor_6:18627:stderr [ Error signing on to the CIB service: > Transport endpoint is not connected ] Aug 2 01:43:40 001db01a lrmd[2025]: > notice: p_drbd1_monitor_6:18628:stderr [ Error signing on to the CIB > service: Transport endpoint is not connected ] > > Ken, Ironically, while researching this problem, I ran across the same question being asked back in November of 2017, and you made the same comment back then. https://lists.clusterlabs.org/pipermail/users/2017-November/013975.html And the solution turned out to be the same for me as it was for that guy. On the node where I was getting the errors, SELINUX was enforcing. I set it to permissive and the errors went away. --Eric ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Why Won't Resources Move?
> Hi! > > I'm not familiar with Redhat, but is tis normal?: > > > > corosync: active/disabled > > > pacemaker: active/disabled > > Regards, > Ulrich That's the default after a new install. I had not enabled them to start automatically yet. > > >>> Eric Robinson schrieb am 02.08.2018 um > >>> 03:44 in > Nachricht > rd03.prod.outlook.com> > > >> -Original Message- > >> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ken > Gaillot > >> Sent: Wednesday, August 01, 2018 2:17 PM > >> To: Cluster Labs - All topics related to open-source clustering > >> welcomed > >> Subject: Re: [ClusterLabs] Why Won't Resources Move? > >> > >> On Wed, 2018-08-01 at 03:49 +, Eric Robinson wrote: > >> > I have what seems to be a healthy cluster, but I can’t get > >> > resources to move. > >> > > >> > Here’s what’s installed… > >> > > >> > [root@001db01a cluster]# yum list installed|egrep "pacem|coro" > >> > corosync.x86_64 2.4.3-2.el7_5.1 @updates > >> > corosynclib.x86_64 2.4.3-2.el7_5.1 @updates > >> > pacemaker.x86_64 1.1.18-11.el7_5.3 @updates > >> > pacemaker-cli.x86_64 1.1.18-11.el7_5.3 @updates > >> > pacemaker-cluster-libs.x86_641.1.18-11.el7_5.3 @updates > >> > pacemaker-libs.x86_641.1.18-11.el7_5.3 @updates > >> > > >> > Cluster status looks good… > >> > > >> > [root@001db01b cluster]# pcs status Cluster name: 001db01ab > >> > Stack: corosync > >> > Current DC: 001db01b (version 1.1.18-11.el7_5.3-2b07d5c5a9) - > >> > partition with quorum Last updated: Wed Aug 1 03:44:47 2018 Last > >> > change: Wed Aug 1 03:22:18 2018 by root via cibadmin on 001db01a > >> > > >> > 2 nodes configured > >> > 11 resources configured > >> > > >> > Online: [ 001db01a 001db01b ] > >> > > >> > Full list of resources: > >> > > >> > p_vip_clust01 (ocf::heartbeat:IPaddr2): Started 001db01b > >> > p_azip_clust01 (ocf::heartbeat:AZaddr2): Started 001db01b > >> > Master/Slave Set: ms_drbd0 [p_drbd0] > >> > Masters: [ 001db01b ] > >> > Slaves: [ 001db01a ] > >> > Master/Slave Set: ms_drbd1 [p_drbd1] > >> > Masters: [ 001db01b ] > >> > Slaves: [ 001db01a ] > >> > p_fs_clust01 (ocf::heartbeat:Filesystem):Started 001db01b > >> > p_fs_clust02 (ocf::heartbeat:Filesystem):Started 001db01b > >> > p_vip_clust02 (ocf::heartbeat:IPaddr2): Started 001db01b > >> > p_azip_clust02 (ocf::heartbeat:AZaddr2): Started 001db01b > >> > p_mysql_001(lsb:mysql_001):Started 001db01b > >> > > >> > Daemon Status: > >> > corosync: active/disabled > >> > pacemaker: active/disabled > >> > pcsd: active/enabled > >> > > >> > Constraints look like this… > >> > > >> > [root@001db01b cluster]# pcs constraint Location Constraints: > >> > Ordering Constraints: > >> > promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory) > >> > promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory) > >> > start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory) > >> > start p_vip_clust01 then start p_azip_clust01 (kind:Mandatory) > >> > start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory) > >> > start p_vip_clust02 then start p_azip_clust02 (kind:Mandatory) > >> > start p_vip_clust01 then start p_mysql_001 (kind:Mandatory) > >> > Colocation Constraints: > >> > p_azip_clust01 with p_vip_clust01 (score:INFINITY) > >> > p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master) > >> > p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master) > >> > p_vip_clust01 with p_fs_clust01 (score:INFINITY) > >> > p_vip_clust02 with p_fs_clust02 (score:INFINITY) > >> > p_azip_clust02 with p_vip_clust02 (score:INFINITY) > >> > p_mysql_001 with p_vip_clust01 (score:INFINITY) Ticket Constraints: > >> > > >> > But when I issue a move command, nothing at all happens. > >> > > >> > I see this in the log on one node… > >> > > >> > Aug 01 03:21:57 [1655
[ClusterLabs] What am I Doing Wrong with Constraints?
I don't understand why a problem with a resource causes other resources above it in the dependency stack (or on the same level with it) to fail over. My dependency stack is: drbd -> filesystem -> floating_ip -> Azure virtual IP | -> MySQL_instance_1 -> MySQL_instance_2 Note that the MySQL instances are dependent on the floating IP, but not on each other. However, if one of the MySQL instances has a problem that causes it to go into a FAIL status, the whole cluster fails over. Or if the Azure virtual IP resource has a problem and I need to run a cleanup, the whole cluster fails over. Here's what my resources look like... [root@001db01b mysql]# pcs status Cluster name: 001db01ab Stack: corosync Current DC: 001db01b (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Mon Aug 6 12:52:44 2018 Last change: Mon Aug 6 12:18:38 2018 by root via cibadmin on 001db01a 2 nodes configured 11 resources configured Online: [ 001db01a 001db01b ] Full list of resources: p_vip_clust01 (ocf::heartbeat:IPaddr2): Started 001db01b p_azip_clust01 (ocf::heartbeat:AZaddr2): Started 001db01b Master/Slave Set: ms_drbd0 [p_drbd0] Masters: [ 001db01b ] Slaves: [ 001db01a ] Master/Slave Set: ms_drbd1 [p_drbd1] Masters: [ 001db01a ] Slaves: [ 001db01b ] p_fs_clust01 (ocf::heartbeat:Filesystem):Started 001db01b p_fs_clust02 (ocf::heartbeat:Filesystem):Started 001db01a p_vip_clust02 (ocf::heartbeat:IPaddr2): Started 001db01a p_azip_clust02 (ocf::heartbeat:AZaddr2): Started 001db01a p_mysql_001(lsb:mysql_001):Started 001db01b Daemon Status: corosync: active/enabled pacemaker: active/enabled Here's what my constraints look like... [root@001db01b mysql]# pcs constraint --full Location Constraints: Ordering Constraints: promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory) (id:order-ms_drbd0-p_fs_clust01-mandatory) promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory) (id:order-ms_drbd1-p_fs_clust02-mandatory) start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory) (id:order-p_fs_clust01-p_vip_clust01-mandatory) start p_vip_clust01 then start p_azip_clust01 (kind:Mandatory) (id:order-p_vip_clust01-p_azip_clust01-mandatory) start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory) (id:order-p_fs_clust02-p_vip_clust02-mandatory) start p_vip_clust02 then start p_azip_clust02 (kind:Mandatory) (id:order-p_vip_clust02-p_azip_clust02-mandatory) start p_vip_clust01 then start p_mysql_001 (kind:Mandatory) (id:order-p_vip_clust01-p_mysql_001-mandatory) Colocation Constraints: p_azip_clust01 with p_vip_clust01 (score:INFINITY) (id:colocation-p_azip_clust01-p_vip_clust01-INFINITY) p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master) (id:colocation-p_fs_clust01-ms_drbd0-INFINITY) p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master) (id:colocation-p_fs_clust02-ms_drbd1-INFINITY) p_vip_clust01 with p_fs_clust01 (score:INFINITY) (id:colocation-p_vip_clust01-p_fs_clust01-INFINITY) p_vip_clust02 with p_fs_clust02 (score:INFINITY) (id:colocation-p_vip_clust02-p_fs_clust02-INFINITY) p_azip_clust02 with p_vip_clust02 (score:INFINITY) (id:colocation-p_azip_clust02-p_vip_clust02-INFINITY) p_mysql_001 with p_vip_clust01 (score:INFINITY) (id:colocation-p_mysql_001-p_vip_clust01-INFINITY) Ticket Constraints: ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Different Times in the Corosync Log?
The corosync log show different times for lrmd messages than for cib or crmd messages. Note the 4 hour difference. What? Aug 20 13:08:27 [107884] 001store01acib: info: cib_perform_op: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='p_replicator']/lrm_rsc_op[@id='p_replicator_monitor_6']: @transition-magic=0:0;9:251:0:283f3d6c-2e91-4f61-95dd-306d3e1eb052, @call-id=361, @rc-code=0, @op-status=0, @exec-time=5 Aug 20 13:08:27 [107884] 001store01acib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=001store01a/crmd/2451, version=0.30.5) Aug 20 13:08:32 [107884] 001store01acib: info: cib_process_ping: Reporting our current digest to 001store01b: f33ca1999ceeb68d22f3171f03be2638 for 0.30.5 (0x55b38bd96280 0) Aug 20 17:08:45 [107886] 001store01a lrmd: warning: child_timeout_callback: p_azip_ftpclust01_monitor_0 process (PID 52488) timed out Aug 20 17:08:45 [107886] 001store01a lrmd: warning: operation_finished: p_azip_ftpclust01_monitor_0:52488 - timed out after 2ms Aug 20 13:08:45 [107889] 001store01a crmd:error: process_lrm_event: Result of probe operation for p_azip_ftpclust01 on 001store01a: Timed Out | call=359 key=p_azip_ftpclust01_monitor_0 timeout=2ms Aug 20 13:08:45 [107884] 001store01acib: info: cib_process_request: Forwarding cib_modify operation for section status to all (origin=local/crmd/2452) Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: Diff: --- 0.30.5 2 Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: Diff: +++ 0.30.6 (null) Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: + /cib: @num_updates=6 Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='p_azip_ftpclust01']/lrm_rsc_op[@id='p_azip_ftpclust01_last_0']: @transition-magic=2:1;3:251:7:283f3d6c-2e91-4f61-95dd-306d3e1eb052, @call-id=359, @rc-code=1, @op-status=2, @exec-time=20002 Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: ++ /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='p_azip_ftpclust01']: https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Different Times in the Corosync Log?
> Hi! > > I could guess that the processes run with different timezone settings (for > whatever reason). > > Regards, > Ulrich That would be my guess, too, but I cannot imagine how they ended up in that condition. > > >>> Eric Robinson schrieb am 21.08.2018 um > >>> 02:43 in > Nachricht > 3.prod.outlook.com> > > > The corosync log show different times for lrmd messages than for cib > > or crmd > > > messages. Note the 4 hour difference. What? > > > > > > Aug 20 13:08:27 [107884] 001store01acib: info: cib_perform_op: > > >+ > > > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@i > d=' > > p_replicator']/lrm_rsc_op[@id='p_replicator_monitor_6']: > > @transition‑magic=0:0;9:251:0:283f3d6c‑2e91‑4f61‑95dd‑306d3e1eb052, > @call‑id=361, > > @rc‑code=0, @op‑status=0, @exec‑time=5 > > Aug 20 13:08:27 [107884] 001store01acib: info: > > cib_process_request: Completed cib_modify operation for section > > status: OK (rc=0, origin=001store01a/crmd/2451, version=0.30.5) > > Aug 20 13:08:32 [107884] 001store01acib: info: cib_process_ping: > > >Reporting our current digest to 001store01b: > > f33ca1999ceeb68d22f3171f03be2638 for 0.30.5 (0x55b38bd96280 0) > > Aug 20 17:08:45 [107886] 001store01a lrmd: warning: > > child_timeout_callback: p_azip_ftpclust01_monitor_0 process (PID 52488) > > > timed out > > Aug 20 17:08:45 [107886] 001store01a lrmd: warning: > > operation_finished: p_azip_ftpclust01_monitor_0:52488 ‑ timed out > > after 2ms > > Aug 20 13:08:45 [107889] 001store01a crmd:error: > > process_lrm_event: Result of probe operation for p_azip_ftpclust01 on > > 001store01a: Timed Out | call=359 key=p_azip_ftpclust01_monitor_0 > > timeout=2ms > > Aug 20 13:08:45 [107884] 001store01acib: info: > > cib_process_request: Forwarding cib_modify operation for section > > status to all (origin=local/crmd/2452) > > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: > > >Diff: ‑‑‑ 0.30.5 2 > > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: > > >Diff: +++ 0.30.6 (null) > > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: > > >+ /cib: @num_updates=6 > > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: > > >+ > > > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@i > d=' > > p_azip_ftpclust01']/lrm_rsc_op[@id='p_azip_ftpclust01_last_0']: > > @transition‑magic=2:1;3:251:7:283f3d6c‑2e91‑4f61‑95dd‑306d3e1eb052, > @call‑id=359, > > @rc‑code=1, @op‑status=2, @exec‑time=20002 > > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: > > >++ > > > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@i > d=' > > p_azip_ftpclust01']: > operation_key="p_azip_ftpclust01_monitor_0" operation="monitor" > > crm‑debug‑origin="do_update_resource" crm_feature_set="3.0.14" > > transition‑key="3:251:7:283f3d6c‑2e91‑4f61‑95dd‑306d3e1eb052" > > transition‑magic="2:1;3:251:7:283f3d6c‑2e91‑4f61‑95dd‑306d3e1eb052" > > exit‑reason="" on_node="001 > > Aug 20 13:08:45 [107884] 001store01acib: info: > > cib_process_request: Completed cib_modify operation for section > > status: OK (rc=0, origin=001store01a/crmd/2452, version=0.30.6) > > Aug 20 13:08:45 [107889] 001store01a crmd: info: do_lrm_rsc_op: > > >Performing key=3:252:0:283f3d6c‑2e91‑4f61‑95dd‑306d3e1eb052 > > op=p_azip_ftpclust01_stop_0 > > Aug 20 13:08:45 [107884] 001store01acib: info: > > cib_process_request: Forwarding cib_modify operation for section > > status to all (origin=local/crmd/2453) > > Aug 20 17:08:45 [107886] 001store01a lrmd: info: log_execute: > > executing ‑ rsc:p_azip_ftpclust01 action:stop call_id:362 > > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: > > >Diff: ‑‑‑ 0.30.6 2 > > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op: > > >Diff: +++ 0.30.7 (null) > > > > [sig] > > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Different Times in the Corosync Log?
> -Original Message- > From: Users On Behalf Of Jan Pokorný > Sent: Tuesday, August 21, 2018 2:45 AM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Different Times in the Corosync Log? > > On 21/08/18 08:43 +, Eric Robinson wrote: > >> I could guess that the processes run with different timezone settings > >> (for whatever reason). > > > > That would be my guess, too, but I cannot imagine how they ended up in > > that condition. > > Hard to guess, the PIDs indicate the expected state of covering a very short > interval sequentially (i.e. no intermittent failure recovered with a restart > of > lrmd, AFAICT). In case it can have any bearing, how do you start pacemaker -- > systemd, initscript, as a corosync plugin, something else? Depends on how new the cluster is. With these, I start it with 'pcs cluster start'. > > -- > Nazdar, > Jan (Poki) ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Different Times in the Corosync Log?
> > Whoa, I think you win some sort of fubar prize. :-) It's always nice to feel special. > > AFAIK, any OS-level time or timezone change affects all processes equally. (I > occasionally deal with cluster logs where the OS time jumped backward or > forward, and all logs system-wide are equally > affected.) > Except when you're visiting insane-world, which I seem to be. > Some applications have their own timezone setting that can override the > system default, but pacemaker isn't one of them. It's even more bizarre when > you consider that the daemons here are the children of the same process > (pacemakerd), and thus have an identical set of environment variables and so > forth. (And as Jan pointed out, they appear to have been started within a > fraction of a second of each other.) > > Apparently there is a dateshift kernel module that can put particular > processes > in different apparent times, but I assume you'd know if you did that on > purpose. > :-) It does occur to me that the module would be a great prank to play on > someone (especially combined with a cron job that randomly altered the > configuration). > > If you figure this out, I'd love to hear what it was. Gremlins ... You'll be the second to know after me! > > On Tue, 2018-08-21 at 11:45 +0200, Jan Pokorný wrote: > > On 21/08/18 08:43 +, Eric Robinson wrote: > > > > I could guess that the processes run with different timezone > > > > settings (for whatever reason). > > > > > > That would be my guess, too, but I cannot imagine how they ended up > > > in that condition. > > > > Hard to guess, the PIDs indicate the expected state of covering a very > > short interval sequentially (i.e. no intermittent failure recovered > > with a restart of lrmd, AFAICT). In case it can have any bearing, how > > do you start pacemaker -- systemd, initscript, as a corosync plugin, > > something else? > -- > Ken Gaillot > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Increasing Token Timeout Safe By Itself?
I have a few corosync+pacemeker clusters in Azure. Occasionally, cluster nodes failover, possibly because of intermittent connectivity loss, but more likely because one or more nodes experiences high load and is not able to respond in a timely fashion. I want to make the clusters a little more resilient to such conditions (i.e., allow clusters more time to recover naturally before failing over). Is it a simple matter of increasing the totem.token timeout from the default value? Or are there other things that should be changes as well? And once the value is increased, how do I make it active without restarting the cluster? --Eric ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Increasing Token Timeout Safe By Itself?
> -Original Message- > From: Jan Friesse > Sent: Sunday, January 20, 2019 11:57 PM > To: Cluster Labs - All topics related to open-source clustering welcomed > ; Eric Robinson > Subject: Re: [ClusterLabs] Increasing Token Timeout Safe By Itself? > > Eric Robinson napsal(a): > > I have a few corosync+pacemeker clusters in Azure. Occasionally, > > cluster nodes failover, possibly because of intermittent connectivity > > loss, but more likely because one or more nodes experiences high load > > and is not able to respond in a timely fashion. I want to make the > > clusters a little more resilient to such conditions (i.e., allow > > clusters more time to recover naturally before failing over). Is it a > > simple matter of increasing the totem.token timeout from the default > > value? Or are > there other things that should be changes as well? And once the value is > increased, how do I make it > > Usually it is really enough to increase totem.token. Used token timeout is > computed based on this value (see corosync.conf man page for more > details). It's possible to get used value by executing "corosync-cmapctl > -g runtime.config.totem.token" command. > > active without restarting the cluster? > > You can ether edit config file (ideally on all nodes) and exec > "corosync-cfgtool > -R" (just on one node) or you can use "corosync-cmapctl -s totem.token u32 > $REQUIRED_VALUE" (ideally on all nodes). Also pcs/crmshell may also > support this functionality. > > Honza > Thank you very much for the feedback! --Eric ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?
> -Original Message- > From: Users On Behalf Of Andrei > Borzenkov > Sent: Wednesday, February 20, 2019 8:51 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When > Just One Fails? > > 20.02.2019 21:51, Eric Robinson пишет: > > > > The following should show OK in a fixed font like Consolas, but the > following setup is supposed to be possible, and is even referenced in the > ClusterLabs documentation. > > > > > > > > > > > > +--+ > > > > | mysql001 +--+ > > > > +--+ | > > > > +--+ | > > > > | mysql002 +--+ > > > > +--+ | > > > > +--+ | +-+ ++ +--+ > > > > | mysql003 +->+ floating ip +-->+ filesystem +-->+ blockdev | > > > > +--+ | +-+ ++ +--+ > > > > +--+ | > > > > | mysql004 +--+ > > > > +--+ | > > > > +--+ | > > > > | mysql005 +--+ > > > > +--+ > > > > > > > > In the layout above, the MySQL instances are dependent on the same > underlying service stack, but they are not dependent on each other. > Therefore, as I understand it, the failure of one MySQL instance should not > cause the failure of other MySQL instances if on-fail=ignore on-fail=stop. At > least, that’s the way it seems to me, but based on the thread, I guess it does > not behave that way. > > > > This works this way for monitor operation if you set on-fail=block. > Failed resource is left "as is". The only case when it does not work seems to > be stop operation; even with explicit on-fail=block it still attempts to > initiate > follow up actions. I still consider this a bug. > > If this is not a bug, this needs clear explanation in documentation. > > But please understand that assuming on-fail=block works you effectively > reduce your cluster to controlled start of resources during boot. As we have Or failover, correct? > seen, stopping of resource IP is blocked, meaning pacemaker also cannot > perform resource level recovery at all. And for mysql resources you explicitly > ignore any result of monitoring or failure to stop it. > And not having stonith also prevents pacemaker from handling node failure. > What leaves is at most restart of resources on another node during graceful > shutdown. > > It begs a question - what do you need such "cluster" for at all? Mainly to manage the other relevant resources: drbd, filesystem, and floating IP. I'm content to forego resource level recovery for MySQL services and monitor their health from outside the cluster and remediate them manually if necessary. I don't see an option if I want to avoid the sort of deadlock situation we talked about earlier. > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?
> -Original Message- > From: Users On Behalf Of Ulrich Windl > Sent: Tuesday, February 19, 2019 11:35 PM > To: users@clusterlabs.org > Subject: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When > Just One Fails? > > >>> Eric Robinson mailto:eric.robin...@psmnv.com>> > >>> schrieb am 19.02.2019 um > >>> 21:06 in > Nachricht > mailto:mn2pr03mb4845be22fada30b472174b79fa...@mn2pr03mb4845.namprd03.prod.outlook.com> > d03.prod.outlook.com<mailto:mn2pr03mb4845be22fada30b472174b79fa...@mn2pr03mb4845.namprd03.prod.outlook.com>> > > >> -Original Message- > >> From: Users > >> mailto:users-boun...@clusterlabs.org>> On > >> Behalf Of Ken Gaillot > >> Sent: Tuesday, February 19, 2019 10:31 AM > >> To: Cluster Labs - All topics related to open-source clustering > >> welcomed mailto:users@clusterlabs.org>> > >> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just > >> One Fails? > >> > >> On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote: > >> > > -Original Message- > >> > > From: Users > >> > > mailto:users-boun...@clusterlabs.org>> > >> > > On Behalf Of Andrei > >> > > Borzenkov > >> > > Sent: Sunday, February 17, 2019 11:56 AM > >> > > To: users@clusterlabs.org<mailto:users@clusterlabs.org> > >> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When > >> > > Just One Fails? > >> > > > >> > > 17.02.2019 0:44, Eric Robinson пишет: > >> > > > Thanks for the feedback, Andrei. > >> > > > > >> > > > I only want cluster failover to occur if the filesystem or drbd > >> > > > resources fail, > >> > > > >> > > or if the cluster messaging layer detects a complete node failure. > >> > > Is there a > >> > > way to tell PaceMaker not to trigger a cluster failover if any of > >> > > the p_mysql resources fail? > >> > > > > >> > > > >> > > Let's look at this differently. If all these applications depend > >> > > on each other, you should not be able to stop individual resource > >> > > in the first place - you need to group them or define dependency > >> > > so that stopping any resource would stop everything. > >> > > > >> > > If these applications are independent, they should not share > >> > > resources. > >> > > Each MySQL application should have own IP and own FS and own > >> > > block device for this FS so that they can be moved between > >> > > cluster nodes independently. > >> > > > >> > > Anything else will lead to troubles as you already observed. > >> > > >> > FYI, the MySQL services do not depend on each other. All of them > >> > depend on the floating IP, which depends on the filesystem, which > >> > depends on DRBD, but they do not depend on each other. Ideally, the > >> > failure of p_mysql_002 should not cause failure of other mysql > >> > resources, but now I understand why it happened. Pacemaker wanted > >> > to start it on the other node, so it needed to move the floating > >> > IP, filesystem, and DRBD primary, which had the cascade effect of > >> > stopping the other MySQL resources. > >> > > >> > I think I also understand why the p_vip_clust01 resource blocked. > >> > > >> > FWIW, we've been using Linux HA since 2006, originally Heartbeat, > >> > but then Corosync+Pacemaker. The past 12 years have been relatively > >> > problem free. This symptom is new for us, only within the past year. > >> > Our cluster nodes have many separate instances of MySQL running, so > >> > it is not practical to have that many filesystems, IPs, etc. We are > >> > content with the way things are, except for this new troubling > >> > behavior. > >> > > >> > If I understand the thread correctly, op-fail=stop will not work > >> > because the cluster will still try to stop the resources that are > >> > implied dependencies. > >> > > >> > Bottom line is, how do we configure the cluster in such a way that > >> > there are no cascading circumsta
Re: [ClusterLabs] Simulate Failure Behavior
> -Original Message- > From: Users On Behalf Of Ken Gaillot > Sent: Friday, February 22, 2019 5:06 PM > To: Cluster Labs - All topics related to open-source clustering welcomed > > Subject: Re: [ClusterLabs] Simulate Failure Behavior > > On Sat, 2019-02-23 at 00:28 +, Eric Robinson wrote: > > I want to mess around with different on-fail options and see how the > > cluster responds. I’m looking through the documentation, but I don’t > > see a way to simulate resource failure and observe behavior without > > actually failing over the mode. Isn’t there a way to have the cluster > > MODEL failure and simply report what it WOULD do? > > > > --Eric > > Yes, appropriately enough it is called crm_simulate :) Thanks. I knew about crm_simulate, but I thought that was really old stuff and might not apply in the pcs world. > > The documentation is not exactly great, but you see: > > https://wiki.clusterlabs.org/wiki/Using_crm_simulate > > along with the man page and: > > http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html- > single/Pacemaker_Administration/index.html#s-config-testing-changes > > -- > Ken Gaillot > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Simulate Failure Behavior
I want to mess around with different on-fail options and see how the cluster responds. I'm looking through the documentation, but I don't see a way to simulate resource failure and observe behavior without actually failing over the mode. Isn't there a way to have the cluster MODEL failure and simply report what it WOULD do? --Eric ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
: p_mysql_004 (class=lsb type=mysql_004) Operations: force-reload interval=0s timeout=15 (p_mysql_004-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_004-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_004-restart-interval-0s) start interval=0s timeout=15 (p_mysql_004-start-interval-0s) stop interval=0s timeout=15 (p_mysql_004-stop-interval-0s) Resource: p_mysql_005 (class=lsb type=mysql_005) Operations: force-reload interval=0s timeout=15 (p_mysql_005-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_005-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_005-restart-interval-0s) start interval=0s timeout=15 (p_mysql_005-start-interval-0s) stop interval=0s timeout=15 (p_mysql_005-stop-interval-0s) Resource: p_mysql_006 (class=lsb type=mysql_006) Operations: force-reload interval=0s timeout=15 (p_mysql_006-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_006-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_006-restart-interval-0s) start interval=0s timeout=15 (p_mysql_006-start-interval-0s) stop interval=0s timeout=15 (p_mysql_006-stop-interval-0s) Resource: p_mysql_007 (class=lsb type=mysql_007) Operations: force-reload interval=0s timeout=15 (p_mysql_007-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_007-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_007-restart-interval-0s) start interval=0s timeout=15 (p_mysql_007-start-interval-0s) stop interval=0s timeout=15 (p_mysql_007-stop-interval-0s) Resource: p_mysql_008 (class=lsb type=mysql_008) Operations: force-reload interval=0s timeout=15 (p_mysql_008-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_008-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_008-restart-interval-0s) start interval=0s timeout=15 (p_mysql_008-start-interval-0s) stop interval=0s timeout=15 (p_mysql_008-stop-interval-0s) Resource: p_mysql_622 (class=lsb type=mysql_622) Operations: force-reload interval=0s timeout=15 (p_mysql_622-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_622-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_622-restart-interval-0s) start interval=0s timeout=15 (p_mysql_622-start-interval-0s) stop interval=0s timeout=15 (p_mysql_622-stop-interval-0s) Stonith Devices: Fencing Levels: Location Constraints: Resource: p_vip_clust02 Enabled on: 001db01b (score:INFINITY) (role: Started) (id:cli-prefer-p_vip_clust02) Ordering Constraints: promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory) promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory) start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory) start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory) start p_vip_clust01 then start p_mysql_001 (kind:Mandatory) start p_vip_clust01 then start p_mysql_002 (kind:Mandatory) start p_vip_clust01 then start p_mysql_003 (kind:Mandatory) start p_vip_clust01 then start p_mysql_004 (kind:Mandatory) start p_vip_clust01 then start p_mysql_005 (kind:Mandatory) start p_vip_clust02 then start p_mysql_006 (kind:Mandatory) start p_vip_clust02 then start p_mysql_007 (kind:Mandatory) start p_vip_clust02 then start p_mysql_008 (kind:Mandatory) start p_vip_clust01 then start p_mysql_622 (kind:Mandatory) Colocation Constraints: p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master) p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master) p_vip_clust01 with p_fs_clust01 (score:INFINITY) p_vip_clust02 with p_fs_clust02 (score:INFINITY) p_mysql_001 with p_vip_clust01 (score:INFINITY) p_mysql_000 with p_vip_clust01 (score:INFINITY) p_mysql_002 with p_vip_clust01 (score:INFINITY) p_mysql_003 with p_vip_clust01 (score:INFINITY) p_mysql_004 with p_vip_clust01 (score:INFINITY) p_mysql_005 with p_vip_clust01 (score:INFINITY) p_mysql_006 with p_vip_clust02 (score:INFINITY) p_mysql_007 with p_vip_clust02 (score:INFINITY) p_mysql_008 with p_vip_clust02 (score:INFINITY) p_mysql_622 with p_vip_clust01 (score:INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: resource-stickiness: 100 Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: 001db01ab dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9 have-watchdog: false last-lrm-refresh: 1550347798 maintenance-mode: false no-quorum-policy: ignore stonith-enabled: false --Eric From: Users On Behalf Of Eric Robinson Sent: Saturday, February 16, 2019 12:34 PM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: [ClusterLabs] Why Do All The Services Go
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
Here are the relevant corosync logs. It appears that the stop action for resource p_mysql_002 failed, and that caused a cascading series of service changes. However, I don't understand why, since no other resources are dependent on p_mysql_002. [root@001db01a cluster]# cat corosync_filtered.log Feb 16 14:06:24 [3908] 001db01acib: info: cib_process_request: Forwarding cib_apply_diff operation for section 'all' to all (origin=local/cibadmin/2) Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: Diff: --- 0.345.30 2 Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: Diff: +++ 0.346.0 cc0da1b030418ec8b7c72db1115e2af1 Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: + /cib: @epoch=346, @num_updates=0 Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='p_mysql_002']: Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: ++ Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: ++ Feb 16 14:06:24 [3908] 001db01acib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=001db01a/cibadmin/2, version=0.346.0) Feb 16 14:06:24 [3913] 001db01a crmd: info: abort_transition_graph: Transition aborted by meta_attributes.p_mysql_002-meta_attributes 'create': Configuration change | cib=0.346.0 source=te_update_diff:456 path=/cib/configuration/resources/primitive[@id='p_mysql_002'] complete=true Feb 16 14:06:24 [3913] 001db01a crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph Feb 16 14:06:24 [3912] 001db01apengine: notice: unpack_config:On loss of CCM Quorum: Ignore Feb 16 14:06:24 [3912] 001db01apengine: info: determine_online_status: Node 001db01b is online Feb 16 14:06:24 [3912] 001db01apengine: info: determine_online_status: Node 001db01a is online Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_drbd0:0 active in master mode on 001db01b Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_drbd1:0 active on 001db01b Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_mysql_004 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_mysql_005 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_drbd0:1 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_drbd1:1 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_mysql_001 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_mysql_002 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_mysql_002 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 2 is already processed Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 1 is already processed Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 2 is already processed Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 1 is already processed Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: p_vip_clust01 (ocf::heartbeat:IPaddr2): Started 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: clone_print: Master/Slave Set: ms_drbd0 [p_drbd0] Feb 16 14:06:24 [3912] 001db01apengine: info: short_print: Masters: [ 001db01a ] Feb 16 14:06:24 [3912] 001db01apengine: info: short_print: Slaves: [ 001db01b ] Feb 16 14:06:24 [3912] 001db01apengine: info: clone_print: Master/Slave Set: ms_drbd1 [p_drbd1] Feb 16 14:06:24 [3912] 001db01apengine: info: short_print: Masters: [ 001db01b ] Feb 16 14:06:24 [3912] 001db01apengine: info: short_print: Slaves: [ 001db01a ] Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: p_fs_clust01(ocf::heartbeat:Filesystem):Started 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: p_fs_clust02(ocf::heartbeat:Filesystem):Started 001db01b Feb 16 14:06:24 [3912] 001db01apengine:
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
Thanks for the feedback, Andrei. I only want cluster failover to occur if the filesystem or drbd resources fail, or if the cluster messaging layer detects a complete node failure. Is there a way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql resources fail? > -Original Message- > From: Users On Behalf Of Andrei > Borzenkov > Sent: Saturday, February 16, 2019 1:34 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One > Fails? > > 17.02.2019 0:03, Eric Robinson пишет: > > Here are the relevant corosync logs. > > > > It appears that the stop action for resource p_mysql_002 failed, and that > caused a cascading series of service changes. However, I don't understand > why, since no other resources are dependent on p_mysql_002. > > > > You have mandatory colocation constraints for each SQL resource with VIP. it > means that to move SQL resource to another node pacemaker also must > move VIP to another node which in turn means it needs to move all other > dependent resources as well. > ... > > Feb 16 14:06:39 [3912] 001db01apengine: warning: > check_migration_threshold:Forcing p_mysql_002 away from 001db01a > after 100 failures (max=100) > ... > > Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * > > Stop > p_vip_clust01 ( 001db01a ) blocked > ... > > Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * > > Stop > p_mysql_001 ( 001db01a ) due to colocation with > p_vip_clust01 > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
> On Sat, Feb 16, 2019 at 09:33:42PM +0000, Eric Robinson wrote: > > I just noticed that. I also noticed that the lsb init script has a > > hard-coded stop timeout of 30 seconds. So if the init script waits > > longer than the cluster resource timeout of 15s, that would cause the > > Yes, you should use higher timeouts in pacemaker (45s for example). > > > resource to fail. However, I don't want cluster failover to be > > triggered by the failure of one of the MySQL resources. I only want > > cluster failover to occur if the filesystem or drbd resources fail, or > > if the cluster messaging layer detects a complete node failure. Is > > there a way to tell PaceMaker not to trigger cluster failover if any > > of the p_mysql resources fail? > > You can try playing with the on-fail option but I'm not sure how reliably this > whole setup will work without some form of fencing/stonith. > > https://clusterlabs.org/pacemaker/doc/en- > US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html Thanks for the tip. It looks like on-fail=ignore or on-fail=stop may be what I'm looking for, at least for the MySQL resources. > > -- > Valentin > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Why Do All The Services Go Down When Just One Fails?
These are the resources on our cluster. [root@001db01a ~]# pcs status Cluster name: 001db01ab Stack: corosync Current DC: 001db01a (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Sat Feb 16 15:24:55 2019 Last change: Sat Feb 16 15:10:21 2019 by root via cibadmin on 001db01b 2 nodes configured 18 resources configured Online: [ 001db01a 001db01b ] Full list of resources: p_vip_clust01 (ocf::heartbeat:IPaddr2): Started 001db01a Master/Slave Set: ms_drbd0 [p_drbd0] Masters: [ 001db01a ] Slaves: [ 001db01b ] Master/Slave Set: ms_drbd1 [p_drbd1] Masters: [ 001db01b ] Slaves: [ 001db01a ] p_fs_clust01 (ocf::heartbeat:Filesystem):Started 001db01a p_fs_clust02 (ocf::heartbeat:Filesystem):Started 001db01b p_vip_clust02 (ocf::heartbeat:IPaddr2): Started 001db01b p_mysql_001(lsb:mysql_001):Started 001db01a p_mysql_000(lsb:mysql_000):Started 001db01a p_mysql_002(lsb:mysql_002):Started 001db01a p_mysql_003(lsb:mysql_003):Started 001db01a p_mysql_004(lsb:mysql_004):Started 001db01a p_mysql_005(lsb:mysql_005):Started 001db01a p_mysql_006(lsb:mysql_006):Started 001db01b p_mysql_007(lsb:mysql_007):Started 001db01b p_mysql_008(lsb:mysql_008):Started 001db01b p_mysql_622(lsb:mysql_622):Started 001db01a Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Why is it that when one of the resources that start with p_mysql_* goes into a FAILED state, all the other MySQL services also stop? [root@001db01a ~]# pcs constraint Location Constraints: Resource: p_vip_clust02 Enabled on: 001db01b (score:INFINITY) (role: Started) Ordering Constraints: promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory) promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory) start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory) start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory) start p_vip_clust01 then start p_mysql_001 (kind:Mandatory) start p_vip_clust01 then start p_mysql_002 (kind:Mandatory) start p_vip_clust01 then start p_mysql_003 (kind:Mandatory) start p_vip_clust01 then start p_mysql_004 (kind:Mandatory) start p_vip_clust01 then start p_mysql_005 (kind:Mandatory) start p_vip_clust02 then start p_mysql_006 (kind:Mandatory) start p_vip_clust02 then start p_mysql_007 (kind:Mandatory) start p_vip_clust02 then start p_mysql_008 (kind:Mandatory) start p_vip_clust01 then start p_mysql_622 (kind:Mandatory) Colocation Constraints: p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master) p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master) p_vip_clust01 with p_fs_clust01 (score:INFINITY) p_vip_clust02 with p_fs_clust02 (score:INFINITY) p_mysql_001 with p_vip_clust01 (score:INFINITY) p_mysql_000 with p_vip_clust01 (score:INFINITY) p_mysql_002 with p_vip_clust01 (score:INFINITY) p_mysql_003 with p_vip_clust01 (score:INFINITY) p_mysql_004 with p_vip_clust01 (score:INFINITY) p_mysql_005 with p_vip_clust01 (score:INFINITY) p_mysql_006 with p_vip_clust02 (score:INFINITY) p_mysql_007 with p_vip_clust02 (score:INFINITY) p_mysql_008 with p_vip_clust02 (score:INFINITY) p_mysql_622 with p_vip_clust01 (score:INFINITY) Ticket Constraints: --Eric ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
> -Original Message- > From: Users On Behalf Of Valentin Vidic > Sent: Saturday, February 16, 2019 1:28 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One > Fails? > > On Sat, Feb 16, 2019 at 09:03:43PM +, Eric Robinson wrote: > > Here are the relevant corosync logs. > > > > It appears that the stop action for resource p_mysql_002 failed, and > > that caused a cascading series of service changes. However, I don't > > understand why, since no other resources are dependent on p_mysql_002. > > The stop failed because of a timeout (15s), so you can try to update that > value: > I just noticed that. I also noticed that the lsb init script has a hard-coded stop timeout of 30 seconds. So if the init script waits longer than the cluster resource timeout of 15s, that would cause the resource to fail. However, I don't want cluster failover to be triggered by the failure of one of the MySQL resources. I only want cluster failover to occur if the filesystem or drbd resources fail, or if the cluster messaging layer detects a complete node failure. Is there a way to tell PaceMaker not to trigger cluster failover if any of the p_mysql resources fail? > Result of stop operation for p_mysql_002 on 001db01a: Timed Out | > call=1094 key=p_mysql_002_stop_0 timeout=15000ms > > After the stop failed it should have fenced that node, but you don't have > fencing configured so it tries to move mysql_002 and all the other resources > related to it (vip, fs, drbd) to the other node. > Since other mysql resources depend on the same (vip, fs, drbd) they need to > be stopped first. > > -- > Valentin > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
I'm looking through the docs but I don't see how to set the on-fail value for a resource. > -Original Message- > From: Users On Behalf Of Eric Robinson > Sent: Saturday, February 16, 2019 1:47 PM > To: Cluster Labs - All topics related to open-source clustering welcomed > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One > Fails? > > > On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote: > > > I just noticed that. I also noticed that the lsb init script has a > > > hard-coded stop timeout of 30 seconds. So if the init script waits > > > longer than the cluster resource timeout of 15s, that would cause > > > the > > > > Yes, you should use higher timeouts in pacemaker (45s for example). > > > > > resource to fail. However, I don't want cluster failover to be > > > triggered by the failure of one of the MySQL resources. I only want > > > cluster failover to occur if the filesystem or drbd resources fail, > > > or if the cluster messaging layer detects a complete node failure. > > > Is there a way to tell PaceMaker not to trigger cluster failover if > > > any of the p_mysql resources fail? > > > > You can try playing with the on-fail option but I'm not sure how > > reliably this whole setup will work without some form of fencing/stonith. > > > > https://clusterlabs.org/pacemaker/doc/en- > > > US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html > > Thanks for the tip. It looks like on-fail=ignore or on-fail=stop may be what > I'm > looking for, at least for the MySQL resources. > > > > > -- > > Valentin > > ___ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
> -Original Message- > From: Users On Behalf Of Ken Gaillot > Sent: Tuesday, February 19, 2019 10:31 AM > To: Cluster Labs - All topics related to open-source clustering welcomed > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One > Fails? > > On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote: > > > -Original Message- > > > From: Users On Behalf Of Andrei > > > Borzenkov > > > Sent: Sunday, February 17, 2019 11:56 AM > > > To: users@clusterlabs.org > > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just > > > One Fails? > > > > > > 17.02.2019 0:44, Eric Robinson пишет: > > > > Thanks for the feedback, Andrei. > > > > > > > > I only want cluster failover to occur if the filesystem or drbd > > > > resources fail, > > > > > > or if the cluster messaging layer detects a complete node failure. > > > Is there a > > > way to tell PaceMaker not to trigger a cluster failover if any of > > > the p_mysql resources fail? > > > > > > > > > > Let's look at this differently. If all these applications depend on > > > each other, you should not be able to stop individual resource in > > > the first place - you need to group them or define dependency so > > > that stopping any resource would stop everything. > > > > > > If these applications are independent, they should not share > > > resources. > > > Each MySQL application should have own IP and own FS and own block > > > device for this FS so that they can be moved between cluster nodes > > > independently. > > > > > > Anything else will lead to troubles as you already observed. > > > > FYI, the MySQL services do not depend on each other. All of them > > depend on the floating IP, which depends on the filesystem, which > > depends on DRBD, but they do not depend on each other. Ideally, the > > failure of p_mysql_002 should not cause failure of other mysql > > resources, but now I understand why it happened. Pacemaker wanted to > > start it on the other node, so it needed to move the floating IP, > > filesystem, and DRBD primary, which had the cascade effect of stopping > > the other MySQL resources. > > > > I think I also understand why the p_vip_clust01 resource blocked. > > > > FWIW, we've been using Linux HA since 2006, originally Heartbeat, but > > then Corosync+Pacemaker. The past 12 years have been relatively > > problem free. This symptom is new for us, only within the past year. > > Our cluster nodes have many separate instances of MySQL running, so it > > is not practical to have that many filesystems, IPs, etc. We are > > content with the way things are, except for this new troubling > > behavior. > > > > If I understand the thread correctly, op-fail=stop will not work > > because the cluster will still try to stop the resources that are > > implied dependencies. > > > > Bottom line is, how do we configure the cluster in such a way that > > there are no cascading circumstances when a MySQL resource fails? > > Basically, if a MySQL resource fails, it fails. We'll deal with that > > on an ad-hoc basis. I don't want the whole cluster to barf. What about > > op-fail=ignore? Earlier, you suggested symmetrical=false might also do > > the trick, but you said it comes with its own can or worms. > > What are the downsides with op-fail=ignore or asymmetrical=false? > > > > --Eric > > Even adding on-fail=ignore to the recurring monitors may not do what you > want, because I suspect that even an ignored failure will make the node less > preferable for all the other resources. But it's worth testing. > > Otherwise, your best option is to remove all the recurring monitors from the > mysql resources, and rely on external monitoring (e.g. nagios, icinga, monit, > ...) to detect problems. This is probably a dumb question, but can we remove just the monitor operation but leave the resource configured in the cluster? If a node fails over, we do want the resources to start automatically on the new primary node. > -- > Ken Gaillot > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
> -Original Message- > From: Users On Behalf Of Andrei > Borzenkov > Sent: Sunday, February 17, 2019 11:56 AM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One > Fails? > > 17.02.2019 0:44, Eric Robinson пишет: > > Thanks for the feedback, Andrei. > > > > I only want cluster failover to occur if the filesystem or drbd resources > > fail, > or if the cluster messaging layer detects a complete node failure. Is there a > way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql > resources fail? > > > > Let's look at this differently. If all these applications depend on each > other, > you should not be able to stop individual resource in the first place - you > need to group them or define dependency so that stopping any resource > would stop everything. > > If these applications are independent, they should not share resources. > Each MySQL application should have own IP and own FS and own block > device for this FS so that they can be moved between cluster nodes > independently. > > Anything else will lead to troubles as you already observed. FYI, the MySQL services do not depend on each other. All of them depend on the floating IP, which depends on the filesystem, which depends on DRBD, but they do not depend on each other. Ideally, the failure of p_mysql_002 should not cause failure of other mysql resources, but now I understand why it happened. Pacemaker wanted to start it on the other node, so it needed to move the floating IP, filesystem, and DRBD primary, which had the cascade effect of stopping the other MySQL resources. I think I also understand why the p_vip_clust01 resource blocked. FWIW, we've been using Linux HA since 2006, originally Heartbeat, but then Corosync+Pacemaker. The past 12 years have been relatively problem free. This symptom is new for us, only within the past year. Our cluster nodes have many separate instances of MySQL running, so it is not practical to have that many filesystems, IPs, etc. We are content with the way things are, except for this new troubling behavior. If I understand the thread correctly, op-fail=stop will not work because the cluster will still try to stop the resources that are implied dependencies. Bottom line is, how do we configure the cluster in such a way that there are no cascading circumstances when a MySQL resource fails? Basically, if a MySQL resource fails, it fails. We'll deal with that on an ad-hoc basis. I don't want the whole cluster to barf. What about op-fail=ignore? Earlier, you suggested symmetrical=false might also do the trick, but you said it comes with its own can or worms. What are the downsides with op-fail=ignore or asymmetrical=false? --Eric > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Stupid DRBD/LVM Global Filter Question
Roger -- Thank you, sir. That does help. -Original Message- From: Roger Zhou Sent: Wednesday, October 30, 2019 2:56 AM To: Cluster Labs - All topics related to open-source clustering welcomed ; Eric Robinson Subject: Re: [ClusterLabs] Stupid DRBD/LVM Global Filter Question On 10/30/19 6:17 AM, Eric Robinson wrote: > If I have an LV as a backing device for a DRBD disk, can someone > explain why I need an LVM filter? It seems to me that we would want > the LV to be always active under both the primary and secondary DRBD > devices, and there should be no need or desire to have the LV > activated or deactivated by Pacemaker. What am I missing? Your understanding is correct. No need to use LVM resource agent from Pacemaker in your case. --Roger > > --Eric > > Disclaimer : This email and any files transmitted with it are > confidential and intended solely for intended recipients. If you are > not the named addressee you should not disseminate, distribute, copy > or alter this email. Any views or opinions presented in this email are > solely those of the author and might not represent those of Physician > Select Management. Warning: Although Physician Select Management has > taken reasonable precautions to ensure no viruses are present in this > email, the company cannot accept responsibility for any loss or damage > arising from the use of this email or attachments. > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Stupid DRBD/LVM Global Filter Question
If I have an LV as a backing device for a DRBD disk, can someone explain why I need an LVM filter? It seems to me that we would want the LV to be always active under both the primary and secondary DRBD devices, and there should be no need or desire to have the LV activated or deactivated by Pacemaker. What am I missing? --Eric Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
> -Original Message- > From: Users On Behalf Of Strahil Nikolov > Sent: Wednesday, February 5, 2020 1:59 PM > To: Andrei Borzenkov ; users@clusterlabs.org > Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster? > > On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov > wrote: > >05.02.2020 20:55, Eric Robinson пишет: > >> The two servers 001db01a and 001db01b were up and responsive. Neither > >had been rebooted and neither were under heavy load. There's no > >indication in the logs of loss of network connectivity. Any ideas on > >why both nodes seem to think the other one is at fault? > > > >The very fact that nodes lost connection to each other *is* indication > >of network problems. Your logs start too late, after any problem > >already happened. > > > >> > >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not > >an option at this time.) > >> > >> Log from 001db01a: > >> > >> Feb 5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed, > >forming new configuration. > >> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership > >(10.51.14.33:960) was formed. Members left: 2 > >> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive > >the leave message. failed: 2 > >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Node 001db01b state is > >now lost > >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Removing all 001db01b > >attributes for peer loss > >> Feb 5 08:01:03 001db01a cib[1522]: notice: Node 001db01b state is > >now lost > >> Feb 5 08:01:03 001db01a cib[1522]: notice: Purged 1 peer with id=2 > >and/or uname=001db01b from the membership cache > >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Purged 1 peer with > >id=2 and/or uname=001db01b from the membership cache > >> Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect > >node 2 to be down > >> Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Node 001db01b > >state is now lost > >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of > >001db01b not matched > >> Feb 5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb > >> 5 08:01:03 001db01a corosync[1306]: [MAIN ] Completed service > >synchronization, ready to provide service. > >> Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Purged 1 peer > >with id=2 and/or uname=001db01b from the membership cache > >> Feb 5 08:01:03 001db01a pacemakerd[1491]: notice: Node 001db01b > >state is now lost > >> Feb 5 08:01:03 001db01a crmd[1527]: notice: State transition S_IDLE > >-> S_POLICY_ENGINE > >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Node 001db01b state is > >now lost > >> Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect > >node 2 to be down > >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of > >001db01b not matched > >> Feb 5 08:01:03 001db01a pengine[1526]: notice: On loss of CCM > >Quorum: Ignore > >> > >> From 001db01b: > >> > >> Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership > >(10.51.14.34:960) was formed. Members left: 1 > >> Feb 5 08:01:03 001db01b crmd[1693]: notice: Our peer on the DC > >(001db01a) is dead > >> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Node 001db01a > >state is now lost > >> Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive > >the leave message. failed: 1 > >> Feb 5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2 Feb > >> 5 08:01:03 001db01b corosync[1455]: [MAIN ] Completed service > >synchronization, ready to provide service. > >> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Purged 1 peer > >with id=1 and/or uname=001db01a from the membership cache > >> Feb 5 08:01:03 001db01b pacemakerd[1678]: notice: Node 001db01a > >state is now lost > >> Feb 5 08:01:03 001db01b crmd[1693]: notice: State transition > >S_NOT_DC -> S_ELECTION > >> Feb 5 08:01:03 001db01b crmd[1693]: notice: Node 001db01a state is > >now lost > >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Node 001db01a state is > >now lost > >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Removing all 001db01a > >attributes for peer loss > >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Lost attribute writer > >001db01a > >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Purged 1 peer with > >id=1 and/or uname=00
[ClusterLabs] Why Do Nodes Leave the Cluster?
The two servers 001db01a and 001db01b were up and responsive. Neither had been rebooted and neither were under heavy load. There's no indication in the logs of loss of network connectivity. Any ideas on why both nodes seem to think the other one is at fault? (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not an option at this time.) Log from 001db01a: Feb 5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed, forming new configuration. Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership (10.51.14.33:960) was formed. Members left: 2 Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive the leave message. failed: 2 Feb 5 08:01:03 001db01a attrd[1525]: notice: Node 001db01b state is now lost Feb 5 08:01:03 001db01a attrd[1525]: notice: Removing all 001db01b attributes for peer loss Feb 5 08:01:03 001db01a cib[1522]: notice: Node 001db01b state is now lost Feb 5 08:01:03 001db01a cib[1522]: notice: Purged 1 peer with id=2 and/or uname=001db01b from the membership cache Feb 5 08:01:03 001db01a attrd[1525]: notice: Purged 1 peer with id=2 and/or uname=001db01b from the membership cache Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to be down Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Node 001db01b state is now lost Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of 001db01b not matched Feb 5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb 5 08:01:03 001db01a corosync[1306]: [MAIN ] Completed service synchronization, ready to provide service. Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Purged 1 peer with id=2 and/or uname=001db01b from the membership cache Feb 5 08:01:03 001db01a pacemakerd[1491]: notice: Node 001db01b state is now lost Feb 5 08:01:03 001db01a crmd[1527]: notice: State transition S_IDLE -> S_POLICY_ENGINE Feb 5 08:01:03 001db01a crmd[1527]: notice: Node 001db01b state is now lost Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to be down Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of 001db01b not matched Feb 5 08:01:03 001db01a pengine[1526]: notice: On loss of CCM Quorum: Ignore >From 001db01b: Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership (10.51.14.34:960) was formed. Members left: 1 Feb 5 08:01:03 001db01b crmd[1693]: notice: Our peer on the DC (001db01a) is dead Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Node 001db01a state is now lost Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive the leave message. failed: 1 Feb 5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2 Feb 5 08:01:03 001db01b corosync[1455]: [MAIN ] Completed service synchronization, ready to provide service. Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Purged 1 peer with id=1 and/or uname=001db01a from the membership cache Feb 5 08:01:03 001db01b pacemakerd[1678]: notice: Node 001db01a state is now lost Feb 5 08:01:03 001db01b crmd[1693]: notice: State transition S_NOT_DC -> S_ELECTION Feb 5 08:01:03 001db01b crmd[1693]: notice: Node 001db01a state is now lost Feb 5 08:01:03 001db01b attrd[1691]: notice: Node 001db01a state is now lost Feb 5 08:01:03 001db01b attrd[1691]: notice: Removing all 001db01a attributes for peer loss Feb 5 08:01:03 001db01b attrd[1691]: notice: Lost attribute writer 001db01a Feb 5 08:01:03 001db01b attrd[1691]: notice: Purged 1 peer with id=1 and/or uname=001db01a from the membership cache Feb 5 08:01:03 001db01b crmd[1693]: notice: State transition S_ELECTION -> S_INTEGRATION Feb 5 08:01:03 001db01b cib[1688]: notice: Node 001db01a state is now lost Feb 5 08:01:03 001db01b cib[1688]: notice: Purged 1 peer with id=1 and/or uname=001db01a from the membership cache Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) Feb 5 08:01:03 001db01b crmd[1693]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check Feb 5 08:01:03 001db01b pengine[1692]: notice: On loss of CCM Quorum: Ignore -Eric Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
> -Original Message- > From: Users On Behalf Of Andrei > Borzenkov > Sent: Wednesday, February 5, 2020 12:14 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster? > > 05.02.2020 20:55, Eric Robinson пишет: > > The two servers 001db01a and 001db01b were up and responsive. Neither > had been rebooted and neither were under heavy load. There's no indication > in the logs of loss of network connectivity. Any ideas on why both nodes > seem to think the other one is at fault? > > The very fact that nodes lost connection to each other *is* indication of > network problems. Your logs start too late, after any problem already > happened. > All the log messages before those are just normal repetitive stuff that always gets logged, even during normal production. The snippet I provided shows the first indication of anything unusual. Also, there is no other indication of network connectivity loss, and both servers are in Azure. > > > > (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not an > > option at this time.) > > > > Log from 001db01a: > > > > Feb 5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed, > forming new configuration. > > Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership > > (10.51.14.33:960) was formed. Members left: 2 Feb 5 08:01:03 001db01a > > corosync[1306]: [TOTEM ] Failed to receive the leave message. failed: > > 2 Feb 5 08:01:03 001db01a attrd[1525]: notice: Node 001db01b state > > is now lost Feb 5 08:01:03 001db01a attrd[1525]: notice: Removing > > all 001db01b attributes for peer loss Feb 5 08:01:03 001db01a > > cib[1522]: notice: Node 001db01b state is now lost Feb 5 08:01:03 > > 001db01a cib[1522]: notice: Purged 1 peer with id=2 and/or > > uname=001db01b from the membership cache Feb 5 08:01:03 001db01a > > attrd[1525]: notice: Purged 1 peer with id=2 and/or uname=001db01b > > from the membership cache Feb 5 08:01:03 001db01a crmd[1527]: > > warning: No reason to expect node 2 to be down Feb 5 08:01:03 001db01a > stonith-ng[1523]: notice: Node 001db01b state is now lost Feb 5 08:01:03 > 001db01a crmd[1527]: notice: Stonith/shutdown of 001db01b not matched > Feb 5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb 5 > 08:01:03 001db01a corosync[1306]: [MAIN ] Completed service > synchronization, ready to provide service. > > Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Purged 1 peer with > > id=2 and/or uname=001db01b from the membership cache Feb 5 08:01:03 > > 001db01a pacemakerd[1491]: notice: Node 001db01b state is now lost > > Feb 5 08:01:03 001db01a crmd[1527]: notice: State transition S_IDLE > > -> S_POLICY_ENGINE Feb 5 08:01:03 001db01a crmd[1527]: notice: Node > > 001db01b state is now lost Feb 5 08:01:03 001db01a crmd[1527]: > > warning: No reason to expect node 2 to be down Feb 5 08:01:03 > > 001db01a crmd[1527]: notice: Stonith/shutdown of 001db01b not matched > > Feb 5 08:01:03 001db01a pengine[1526]: notice: On loss of CCM > > Quorum: Ignore > > > > From 001db01b: > > > > Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership > > (10.51.14.34:960) was formed. Members left: 1 Feb 5 08:01:03 001db01b > > crmd[1693]: notice: Our peer on the DC (001db01a) is dead Feb 5 > > 08:01:03 001db01b stonith-ng[1689]: notice: Node 001db01a state is > > now lost Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to > > receive the leave message. failed: 1 Feb 5 08:01:03 001db01b > corosync[1455]: [QUORUM] Members[1]: 2 Feb 5 08:01:03 001db01b > corosync[1455]: [MAIN ] Completed service synchronization, ready to > provide service. > > Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Purged 1 peer with > > id=1 and/or uname=001db01a from the membership cache Feb 5 08:01:03 > > 001db01b pacemakerd[1678]: notice: Node 001db01a state is now lost > > Feb 5 08:01:03 001db01b crmd[1693]: notice: State transition > > S_NOT_DC -> S_ELECTION Feb 5 08:01:03 001db01b crmd[1693]: notice: > > Node 001db01a state is now lost Feb 5 08:01:03 001db01b attrd[1691]: > > notice: Node 001db01a state is now lost Feb 5 08:01:03 001db01b > > attrd[1691]: notice: Removing all 001db01a attributes for peer loss > > Feb 5 08:01:03 001db01b attrd[1691]: notice: Lost attribute writer > > 001db01a Feb 5 08:01:03 001db01b attrd[1691]: notice: Purged 1 peer > > with id=1 and/or uname=001db01a from the membership cache Feb 5 > > 08:01:03 001db01b crmd[1693]: notice: State transition S_ELECTION -> > > S_INTEGRATION Feb 5 08:01:03 001db01b cib[1688
Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
Hi Strahil – I can’t prove there was no network loss, but: 1. There were no dmesg indications of ethernet link loss. 2. Other than corosync, there are no other log messages about connectivity issues. 3. Wouldn’t pcsd say something about connectivity loss? 4. Both servers are in Azure. 5. There are many other servers in the same Azure subscription, including other corosync clusters, none of which had issues. So I guess it’s possible, but it seems unlikely. --Eric From: Users On Behalf Of Strahil Nikolov Sent: Wednesday, February 5, 2020 3:13 PM To: Cluster Labs - All topics related to open-source clustering welcomed ; Andrei Borzenkov Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster? Hi Erik, what has led you to think that there was no network loss ? Best Regards, Strahil Nikolov В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson mailto:eric.robin...@psmnv.com>> написа: > -Original Message- > From: Users > mailto:users-boun...@clusterlabs.org>> On > Behalf Of Strahil Nikolov > Sent: Wednesday, February 5, 2020 1:59 PM > To: Andrei Borzenkov mailto:arvidj...@gmail.com>>; > users@clusterlabs.org<mailto:users@clusterlabs.org> > Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster? > > On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov > mailto:arvidj...@gmail.com>> wrote: > >05.02.2020 20:55, Eric Robinson пишет: > >> The two servers 001db01a and 001db01b were up and responsive. Neither > >had been rebooted and neither were under heavy load. There's no > >indication in the logs of loss of network connectivity. Any ideas on > >why both nodes seem to think the other one is at fault? > > > >The very fact that nodes lost connection to each other *is* indication > >of network problems. Your logs start too late, after any problem > >already happened. > > > >> > >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not > >an option at this time.) > >> > >> Log from 001db01a: > >> > >> Feb 5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed, > >forming new configuration. > >> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership > >(10.51.14.33:960) was formed. Members left: 2 > >> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive > >the leave message. failed: 2 > >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Node 001db01b state is > >now lost > >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Removing all 001db01b > >attributes for peer loss > >> Feb 5 08:01:03 001db01a cib[1522]: notice: Node 001db01b state is > >now lost > >> Feb 5 08:01:03 001db01a cib[1522]: notice: Purged 1 peer with id=2 > >and/or uname=001db01b from the membership cache > >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Purged 1 peer with > >id=2 and/or uname=001db01b from the membership cache > >> Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect > >node 2 to be down > >> Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Node 001db01b > >state is now lost > >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of > >001db01b not matched > >> Feb 5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb > >> 5 08:01:03 001db01a corosync[1306]: [MAIN ] Completed service > >synchronization, ready to provide service. > >> Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Purged 1 peer > >with id=2 and/or uname=001db01b from the membership cache > >> Feb 5 08:01:03 001db01a pacemakerd[1491]: notice: Node 001db01b > >state is now lost > >> Feb 5 08:01:03 001db01a crmd[1527]: notice: State transition S_IDLE > >-> S_POLICY_ENGINE > >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Node 001db01b state is > >now lost > >> Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect > >node 2 to be down > >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of > >001db01b not matched > >> Feb 5 08:01:03 001db01a pengine[1526]: notice: On loss of CCM > >Quorum: Ignore > >> > >> From 001db01b: > >> > >> Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership > >(10.51.14.34:960) was formed. Members left: 1 > >> Feb 5 08:01:03 001db01b crmd[1693]: notice: Our peer on the DC > >(001db01a) is dead > >> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Node 001db01a > >state is now lost > >> Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive > >the leave message. fai
Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
Hi Strahil – I think you may be right about the token timeouts being too short. I’ve also noticed that periods of high load can cause drbd to disconnect. What would you recommend for changes to the timeouts? I’m running Red Hat’s Corosync Cluster Engine, version 2.4.3. The config is relatively simple. Corosync config looks like this… totem { version: 2 cluster_name: 001db01ab secauth: off transport: udpu } nodelist { node { ring0_addr: 001db01a nodeid: 1 } node { ring0_addr: 001db01b nodeid: 2 } } quorum { provider: corosync_votequorum two_node: 1 } logging { to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes } From: Users On Behalf Of Strahil Nikolov Sent: Wednesday, February 5, 2020 6:39 PM To: Cluster Labs - All topics related to open-source clustering welcomed ; Andrei Borzenkov Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster? Hi Andrei, don't trust Azure so much :D . I've seen stuff that was way more unbelievable. Can you check other systems in the same subnet reported any issues. Yet, pcs most probably won't report any short-term issues. I have noticed that RHEL7 defaults for token and consensus are quite small and any short-term disruption could cause an issue. Actually when I tested live migration on oVirt - the other hosts fenced the node that was migrated. What is your corosync config and OS version ? Best Regards, Strahil Nikolov В четвъртък, 6 февруари 2020 г., 01:44:55 ч. Гринуич+2, Eric Robinson mailto:eric.robin...@psmnv.com>> написа: Hi Strahil – I can’t prove there was no network loss, but: 1. There were no dmesg indications of ethernet link loss. 2. Other than corosync, there are no other log messages about connectivity issues. 3. Wouldn’t pcsd say something about connectivity loss? 4. Both servers are in Azure. 5. There are many other servers in the same Azure subscription, including other corosync clusters, none of which had issues. So I guess it’s possible, but it seems unlikely. --Eric From: Users mailto:users-boun...@clusterlabs.org>> On Behalf Of Strahil Nikolov Sent: Wednesday, February 5, 2020 3:13 PM To: Cluster Labs - All topics related to open-source clustering welcomed mailto:users@clusterlabs.org>>; Andrei Borzenkov mailto:arvidj...@gmail.com>> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster? Hi Erik, what has led you to think that there was no network loss ? Best Regards, Strahil Nikolov В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson mailto:eric.robin...@psmnv.com>> написа: > -Original Message- > From: Users > mailto:users-boun...@clusterlabs.org>> On > Behalf Of Strahil Nikolov > Sent: Wednesday, February 5, 2020 1:59 PM > To: Andrei Borzenkov mailto:arvidj...@gmail.com>>; > users@clusterlabs.org<mailto:users@clusterlabs.org> > Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster? > > On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov > mailto:arvidj...@gmail.com>> wrote: > >05.02.2020 20:55, Eric Robinson пишет: > >> The two servers 001db01a and 001db01b were up and responsive. Neither > >had been rebooted and neither were under heavy load. There's no > >indication in the logs of loss of network connectivity. Any ideas on > >why both nodes seem to think the other one is at fault? > > > >The very fact that nodes lost connection to each other *is* indication > >of network problems. Your logs start too late, after any problem > >already happened. > > > >> > >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not > >an option at this time.) > >> > >> Log from 001db01a: > >> > >> Feb 5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed, > >forming new configuration. > >> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership > >(10.51.14.33:960) was formed. Members left: 2 > >> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive > >the leave message. failed: 2 > >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Node 001db01b state is > >now lost > >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Removing all 001db01b > >attributes for peer loss > >> Feb 5 08:01:03 001db01a cib[1522]: notice: Node 001db01b state is > >now lost > >> Feb 5 08:01:03 001db01a cib[1522]: notice: Purged 1 peer with id=2 > >and/or uname=001db01b from the membership cache > >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Purged 1 peer with > >id=2 and/or uname=001db01b from the membership cache > >> Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expe
Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
Hi Nikolov -- > Defaults are 1s token, 1.2s consensus which is too small. > In Suse, token is 10s, while consensus is 1.2 * token -> 12s. > With these settings, cluster will not react for 22s. > > I think it's a good start for your cluster . > Don't forget to put the cluster in maintenance (pcs property set > maintenance-mode=true) before restarting the stack , or even better - get > some downtime. > > You can use the following article to run a simulation before removing the > maintenance: > https://www.suse.com/support/kb/doc/?id=7022764 > Thanks for the suggestions. Any thoughts on timeouts for DRBD? --Eric Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: Why Do Nodes Leave the Cluster?
> > > > I've done that with all my other clusters, but these two servers are > > in Azure, so the network is out of our control. > > Is a normal cluster supported to use corosync over Internet? I'm not sure > (because of the delays and possible packet losses). > > As with most things, the main concern is latency and loss. The latency between these two nodes is < 1ms, and loss is always 0%. --Eric Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Verifying DRBD Run-Time Configuration
If I want to know the current DRBD runtime settings such as timeout, ping-int, or connect-int, how do I check that? I'm assuming they may not be the same as what shows in the config file. --Eric Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
Hi Strahil -- I hope you won't mind if I revive this old question. In your comments below, you suggested using a 1s token with a 1.2s consensus. I currently have 2-node clusters (will soon install a qdevice). I was reading in the corosync.conf man page where it says... "For two node clusters, a consensus larger than the join timeout but less than token is safe. For three node or larger clusters, consensus should be larger than token." Do you still think the consensus should be 1.2 * token in a 2-node cluster? Why is a smaller consensus considered safe for 2-node clusters? Should I use a larger consensus anyway? --Eric > -Original Message- > From: Strahil Nikolov > Sent: Thursday, February 6, 2020 1:07 PM > To: Eric Robinson ; Cluster Labs - All topics > related to open-source clustering welcomed ; > Andrei Borzenkov > Subject: RE: [ClusterLabs] Why Do Nodes Leave the Cluster? > > On February 6, 2020 7:35:53 PM GMT+02:00, Eric Robinson > wrote: > >Hi Nikolov -- > > > >> Defaults are 1s token, 1.2s consensus which is too small. > >> In Suse, token is 10s, while consensus is 1.2 * token -> 12s. > >> With these settings, cluster will not react for 22s. > >> > >> I think it's a good start for your cluster . > >> Don't forget to put the cluster in maintenance (pcs property set > >> maintenance-mode=true) before restarting the stack , or even better > >- get > >> some downtime. > >> > >> You can use the following article to run a simulation before removing > >the > >> maintenance: > >> https://www.suse.com/support/kb/doc/?id=7022764 > >> > > > > > >Thanks for the suggestions. Any thoughts on timeouts for DRBD? > > > >--Eric > > > >Disclaimer : This email and any files transmitted with it are > >confidential and intended solely for intended recipients. If you are > >not the named addressee you should not disseminate, distribute, copy or > >alter this email. Any views or opinions presented in this email are > >solely those of the author and might not represent those of Physician > >Select Management. Warning: Although Physician Select Management has > >taken reasonable precautions to ensure no viruses are present in this > >email, the company cannot accept responsibility for any loss or damage > >arising from the use of this email or attachments. > > Hi Eric, > > The timeouts can be treated as 'how much time to wait before taking any > action'. The workload is not very important (HANA is something different). > > You can try with 10s (token) , 12s (consensus) and if needed you can adjust. > > Warning: Use a 3 node cluster or at least 2 drbd nodes + qdisk. The 2 node > cluster is vulnerable to split brain, especially when one of the nodes is > syncing (for example after a patching) and the source is > fenced/lost/disconnected. It's very hard to extract data from a semi-synced > drbd. > > Also, if you need guidance for the SELINUX, I can point you to my guide in the > centos forum. > > Best Regards, > Strahil Nikolov Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] qdevice up and running -- but questions
1. What command can I execute on the qdevice node which tells me which client nodes are connected and alive? 1. In the output of the pcs qdevice status command, what is the meaning of... Vote: ACK (ACK) 1. In the output of the pcs quorum status Command, what is the meaning of... Membership information -- Nodeid VotesQdevice Name 1 1A,V,NMW 001db03a 2 1A,V,NMW 001db03b (local) --Eric Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/