Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?
Thanks for the suggestion everyone. I'll give that a try. > -Original Message- > From: Jan Friesse [mailto:jfrie...@redhat.com] > Sent: Monday, February 12, 2018 8:49 AM > To: Cluster Labs - All topics related to open-source clustering welcomed >; Eric Robinson > Subject: Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync > Rings? > > Eric, > > > General question. I tried to set up a cman + corosync + pacemaker > > cluster using two corosync rings. When I start the cluster, everything > > works fine, except when I do a 'corosync-cfgtool -s' it only shows one > > ring. I tried manually editing the /etc/cluster/cluster.conf file > > adding two > > AFAIK cluster.conf should be edited so altname is used. So something like in > this example: > https://access.redhat.com/documentation/en- > us/red_hat_enterprise_linux/6/html/cluster_administration/s1-config-rrp- > cli-ca > > I don't think you have to add altmulticast. > > Honza > > sections, but then cman complained that I didn't have a multicast address > specified, even though I did. I tried editing the /etc/corosdync/corosync.conf > file, and then I could get two rings, but the nodes would not both join the > cluster. Bah! I did some reading and saw that cman didn't support multiple > rings years ago. Did it never get updated? > > > > [sig] > > > > > > > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
13.02.2018 16:41, Klaus Wenninger пишет: Let's put that differently. With fencing you can make the > loss-detection more aggressive and thus more prone to false-positives > without risking a split-brain situation. (Actually without fencing > you can never be really sure if the other side is really gone!) But > to be honest if you are really behind sub-second > detection/switchover I'm not sure if fencing - at least with the > current implementation in pacemaker and the current selection of > fencing-devices - will give you satisfactory results. > >> [Unfortunatly, I've no a hardware that implements fencing >> abilities nearby and can't try it myself] > > If you don't have any of the usual fencing-devices available you > might have some kind of a shared-disk that might be usable with SBD. > For a 2-node-cluster with a single shared-disk (as in your case if I > got it correctly) assure to pick an SBD-version that has > https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377. > > But again I doubt that this will work reliably with sub-second requirements. > Not saying I'm not interested in experiences/requirements with > pacemaker doing failovers in a sub-second or more relaxed > low-single-digit-second timeframe. Seeing this working reliably would > open up pacemaker for a completely new class of applications. > > Regards, Klaus I was in a sceptical mind too... especially when i've seen the monitor intervals of pacemaker resource agents :D So <1 sec timings for an issue_detection & resource_moves seems are unachiavable by facilities of the current cluster software. By the architectural reasons as well. Thank you for the support and proposals. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
13.02.2018 16:41, Klaus Wenninger пишет: Let's put that differently. With fencing you can make the > loss-detection more aggressive and thus more prone to false-positives > without risking a split-brain situation. (Actually without fencing > you can never be really sure if the other side is really gone!) But > to be honest if you are really behind sub-second > detection/switchover I'm not sure if fencing - at least with the > current implementation in pacemaker and the current selection of > fencing-devices - will give you satisfactory results. > If you don't have any of the usual fencing-devices available you > might have some kind of a shared-disk that might be usable with SBD. > For a 2-node-cluster with a single shared-disk (as in your case if I > got it correctly) assure to pick an SBD-version that has > https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377. > > But again I doubt that this will work reliably with sub-second requirements. > Not saying I'm not interested in experiences/requirements with > pacemaker doing failovers in a sub-second or more relaxed > low-single-digit-second timeframe. Seeing this working reliably would > open up pacemaker for a completely new class of applications. > > Regards, Klaus I was in a sceptical mind too... especially when i've seen the monitor intervals of pacemaker resource agents :D So <1 sec timings for an issue_detection & resource_moves seems are unachiavable by facilities of the current cluster software. By the architectural reasons as well. Thank you for the support and proposals. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 2018-02-13 05:46 AM, Maxim wrote: > 12.02.2018 19:31, Digimer пишет: >> Without fencing, all bets are off. Please enable it and see if the >> issue remains > Seems, i know [in theory] about the fencing ability and its importance > (although I've never configured it so far). > But i don't undestand how it would help in the situtions of the hard > reboot/shutdown. An availability cluster's job is to keep things running. To do this, there must be coordination between the nodes (otherwise, just run things everywhere and be done with it). Thus, when a node stops responding, it is critical that the lost node be put into a known state. If you allow assumptions to be made, you will eventually assume wrong. That could have consequences as "minor" as confusing switches/routers to as devastating as corrupted data. Fencing is not meant to speed up recovery, it is critical to ensuring recovery works at all. This is a common confusion (and people often mistakenly think that quorum is how you avoid this, which is incorrect). There is no replacement for fencing; You need it in any availability system. Without it, it is like driving without a seat-belt. https://www.alteeve.com/w/The_2-Node_Myth >> Changing EL6 to corosync 2 pushes further into uncharted waters. EL6 >> should be using the cman pluging with corosync 1. May I ask why you >> don't use EL7 if you want such a recent stack? > For historical reasons. Let's say so. I've another software that built > for RHEL 6 like OS and have to be installed on the cluster node. > EL 7 stack is already not so recent, but it's one the most stable and > least vulnearable, i suppose. And i understand the risks. > I will update pcs to the latest version when i find a bit of free time. > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On Tue, 2018-02-13 at 13:46 +0300, Maxim wrote: > 12.02.2018 19:31, Digimer пишет: > > > should be using the cman pluging with corosync 1. May I ask why > you > > don't use EL7 if you want such a recent stack? > For historical reasons. Let's say so. I've another software that > built > for RHEL 6 like OS and have to be installed on the cluster node. Compiling a newer corosync/pacemaker is a perfectly good solution in this situation, but just to give you more options: You could instead put the app inside a RHEL 6 container, and run it on RHEL 7 cluster hosts. The advantage of that approach is that the rest of your usual system services would be on more modern versions. With bundles (available in the newer pacemaker on RHEL 7), you can use your existing resource agent to launch the service inside the bundle, so the cluster can monitor it (as well as monitoring the container itself). Similarly, you could create a RHEL 6 VM and run it on RHEL 7 cluster hosts. You can add the remote-node option to the VM resource, to be able to launch and monitor the app inside it via its resource agent. -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 02/13/2018 01:28 PM, Maxim wrote: > 13.02.2018 14:03, Klaus Wenninger пишет: >> - fencing helps you turning the 'maybe the node is down - it doesn't > > respond within x milli-seconds' into certainty that your node is dead > > and won't interfere with the rest of the cluster > > > > Regards, Klaus > > It is clear. But will it force pacemaker to perceive that the node is > down faster? Let's put that differently. With fencing you can make the loss-detection more aggressive and thus more prone to false-positives without risking a split-brain situation. (Actually without fencing you can never be really sure if the other side is really gone!) But to be honest if you are really behind sub-second detection/switchover I'm not sure if fencing - at least with the current implementation in pacemaker and the current selection of fencing-devices - will give you satisfactory results. > [Unfortunatly, I've no a hardware that implements fencing abilities > nearby and can't try it myself] If you don't have any of the usual fencing-devices available you might have some kind of a shared-disk that might be usable with SBD. For a 2-node-cluster with a single shared-disk (as in your case if I got it correctly) assure to pick an SBD-version that has https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377. But again I doubt that this will work reliably with sub-second requirements. > > [Seems, it is the last question from my side that is devoted to this > topic] > > Thank you and Ken for the participation! > > Regards, > Maxim Not saying I'm not interested in experiences/requirements with pacemaker doing failovers in a sub-second or more relaxed low-single-digit-second timeframe. Seeing this working reliably would open up pacemaker for a completely new class of applications. Regards, Klaus ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
13.02.2018 14:03, Klaus Wenninger пишет: - fencing helps you turning the 'maybe the node is down - it doesn't > respond within x milli-seconds' into certainty that your node is dead > and won't interfere with the rest of the cluster > > Regards, Klaus It is clear. But will it force pacemaker to perceive that the node is down faster? [Unfortunatly, I've no a hardware that implements fencing abilities nearby and can't try it myself] [Seems, it is the last question from my side that is devoted to this topic] Thank you and Ken for the participation! Regards, Maxim ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 02/13/2018 11:46 AM, Maxim wrote: > 12.02.2018 19:31, Digimer пишет: >> Without fencing, all bets are off. Please enable it and see if the > > issue remains > Seems, i know [in theory] about the fencing ability and its importance > (although I've never configured it so far). > But i don't undestand how it would help in the situtions of the hard > reboot/shutdown. Actually in 2 ways: - you are strongly advised to use fencing - and thus the base of users using fencing is much higher and strange/unexpected behavior is thus much more likely with the less tested setups without fencing - fencing helps you turning the 'maybe the node is down - it doesn't respond within x milli-seconds' into certainty that your node is dead and won't interfere with the rest of the cluster Regards, Klaus > >> Changing EL6 to corosync 2 pushes further into uncharted waters. EL6 > > should be using the cman pluging with corosync 1. May I ask why you > > don't use EL7 if you want such a recent stack? > For historical reasons. Let's say so. I've another software that built > for RHEL 6 like OS and have to be installed on the cluster node. > EL 7 stack is already not so recent, but it's one the most stable and > least vulnearable, i suppose. And i understand the risks. > I will update pcs to the latest version when i find a bit of free time. > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
12.02.2018 19:31, Digimer пишет: Without fencing, all bets are off. Please enable it and see if the > issue remains Seems, i know [in theory] about the fencing ability and its importance (although I've never configured it so far). But i don't undestand how it would help in the situtions of the hard reboot/shutdown. Changing EL6 to corosync 2 pushes further into uncharted waters. EL6 > should be using the cman pluging with corosync 1. May I ask why you > don't use EL7 if you want such a recent stack? For historical reasons. Let's say so. I've another software that built for RHEL 6 like OS and have to be installed on the cluster node. EL 7 stack is already not so recent, but it's one the most stable and least vulnearable, i suppose. And i understand the risks. I will update pcs to the latest version when i find a bit of free time. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
12.02.2018 18:46, Klaus Wenninger пишет: > Maybe a few notes on the other way ;-) In general it is not easy to > have a reliable answer to the question if the other node is down > within just let's say 100ms. Think of network-hickups, scheduling > issues and alike ... But if you are willing to accept > false-positives you can reduce the token timeout of corosync instead > of having another script that tries to do the job corosync is (amonst > other things) made for (At least that is how I understood what you > are aiming to do.). > > Regards, Klaus Thank you again, Klaus. Your description helps me to recognize a situation better (i've overworked a bit and can't this this not so nontrivial think by myself =)). [ I've a scenario in the mind when an ability to mark a corosync ring as failed would be useful, but It doesn't relate to this topic. It implemention on a corosync side would require some additional functionality for "checking" (let's call them so) rings that can be used only for network checking (not for cluster data synchronization). And the brokage of all "checking" rings (or some more enhanced logic) would indicate that the node is down or has the split brain. Just an idea. ] No the ability? Ok, i try to deal with it ) ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org