Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 2018-02-13 05:46 AM, Maxim wrote: > 12.02.2018 19:31, Digimer пишет: >> Without fencing, all bets are off. Please enable it and see if the >> issue remains > Seems, i know [in theory] about the fencing ability and its importance > (although I've never configured it so far). > But i don't undestand how it would help in the situtions of the hard > reboot/shutdown. An availability cluster's job is to keep things running. To do this, there must be coordination between the nodes (otherwise, just run things everywhere and be done with it). Thus, when a node stops responding, it is critical that the lost node be put into a known state. If you allow assumptions to be made, you will eventually assume wrong. That could have consequences as "minor" as confusing switches/routers to as devastating as corrupted data. Fencing is not meant to speed up recovery, it is critical to ensuring recovery works at all. This is a common confusion (and people often mistakenly think that quorum is how you avoid this, which is incorrect). There is no replacement for fencing; You need it in any availability system. Without it, it is like driving without a seat-belt. https://www.alteeve.com/w/The_2-Node_Myth >> Changing EL6 to corosync 2 pushes further into uncharted waters. EL6 >> should be using the cman pluging with corosync 1. May I ask why you >> don't use EL7 if you want such a recent stack? > For historical reasons. Let's say so. I've another software that built > for RHEL 6 like OS and have to be installed on the cluster node. > EL 7 stack is already not so recent, but it's one the most stable and > least vulnearable, i suppose. And i understand the risks. > I will update pcs to the latest version when i find a bit of free time. > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On Tue, 2018-02-13 at 13:46 +0300, Maxim wrote: > 12.02.2018 19:31, Digimer пишет: > > > should be using the cman pluging with corosync 1. May I ask why > you > > don't use EL7 if you want such a recent stack? > For historical reasons. Let's say so. I've another software that > built > for RHEL 6 like OS and have to be installed on the cluster node. Compiling a newer corosync/pacemaker is a perfectly good solution in this situation, but just to give you more options: You could instead put the app inside a RHEL 6 container, and run it on RHEL 7 cluster hosts. The advantage of that approach is that the rest of your usual system services would be on more modern versions. With bundles (available in the newer pacemaker on RHEL 7), you can use your existing resource agent to launch the service inside the bundle, so the cluster can monitor it (as well as monitoring the container itself). Similarly, you could create a RHEL 6 VM and run it on RHEL 7 cluster hosts. You can add the remote-node option to the VM resource, to be able to launch and monitor the app inside it via its resource agent. -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 02/13/2018 01:28 PM, Maxim wrote: > 13.02.2018 14:03, Klaus Wenninger пишет: >> - fencing helps you turning the 'maybe the node is down - it doesn't > > respond within x milli-seconds' into certainty that your node is dead > > and won't interfere with the rest of the cluster > > > > Regards, Klaus > > It is clear. But will it force pacemaker to perceive that the node is > down faster? Let's put that differently. With fencing you can make the loss-detection more aggressive and thus more prone to false-positives without risking a split-brain situation. (Actually without fencing you can never be really sure if the other side is really gone!) But to be honest if you are really behind sub-second detection/switchover I'm not sure if fencing - at least with the current implementation in pacemaker and the current selection of fencing-devices - will give you satisfactory results. > [Unfortunatly, I've no a hardware that implements fencing abilities > nearby and can't try it myself] If you don't have any of the usual fencing-devices available you might have some kind of a shared-disk that might be usable with SBD. For a 2-node-cluster with a single shared-disk (as in your case if I got it correctly) assure to pick an SBD-version that has https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377. But again I doubt that this will work reliably with sub-second requirements. > > [Seems, it is the last question from my side that is devoted to this > topic] > > Thank you and Ken for the participation! > > Regards, > Maxim Not saying I'm not interested in experiences/requirements with pacemaker doing failovers in a sub-second or more relaxed low-single-digit-second timeframe. Seeing this working reliably would open up pacemaker for a completely new class of applications. Regards, Klaus ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 02/13/2018 11:46 AM, Maxim wrote: > 12.02.2018 19:31, Digimer пишет: >> Without fencing, all bets are off. Please enable it and see if the > > issue remains > Seems, i know [in theory] about the fencing ability and its importance > (although I've never configured it so far). > But i don't undestand how it would help in the situtions of the hard > reboot/shutdown. Actually in 2 ways: - you are strongly advised to use fencing - and thus the base of users using fencing is much higher and strange/unexpected behavior is thus much more likely with the less tested setups without fencing - fencing helps you turning the 'maybe the node is down - it doesn't respond within x milli-seconds' into certainty that your node is dead and won't interfere with the rest of the cluster Regards, Klaus > >> Changing EL6 to corosync 2 pushes further into uncharted waters. EL6 > > should be using the cman pluging with corosync 1. May I ask why you > > don't use EL7 if you want such a recent stack? > For historical reasons. Let's say so. I've another software that built > for RHEL 6 like OS and have to be installed on the cluster node. > EL 7 stack is already not so recent, but it's one the most stable and > least vulnearable, i suppose. And i understand the risks. > I will update pcs to the latest version when i find a bit of free time. > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 2018-02-12 08:15 AM, Klaus Wenninger wrote: > On 02/12/2018 01:02 PM, Maxim wrote: >> Hello, >> >> [Sorry for a message duplication. Web mail client ruined the >> formatting of the previous e-mail =( ] >> >> There is a simple configuration of two cluster nodes (built via RHEL 6 >> pcs interface) with multiple master/slave resources, disabled fencing >> and the single sync interface. > > fencing-disabled is probably due to it being a test-setup ... > RHEL 6 pcs being made for configuring a cman-pacemaker-setup > I'm not sure if it is advisable to do a setup for a corosync-2 pacemaker > setup with that. You've obviously edited corosync.conf to > reflect that ... Without fencing, all bets are off. Please enable it and see if the issue remains Changing EL6 to corosync 2 pushes further into uncharted waters. EL6 should be using the cman pluging with corosync 1. May I ask why you don't use EL7 if you want such a recent stack? >> All is ok mainly. But there is some problem of the cluster activity >> performance when the master node is powered off (hard): the slave node >> detects that the master one is down after about 100-3500 ms. And the >> main question is how to avoid this 3 sec delay that occurred sometimes. > > Kind of interesting that you ever get a detection below 2000ms with the > token-timeout set to that value. (Given you are doing a hard-shutdown > that doesn't give corosync time to sign off.) > You've derived these times from the corosync-logs!? > > Regards, > Klaus > >> >> On the slave node i have a little script that checks the connection to >> the master node. It detects a problem of a sync breakage within about >> 100 ms. But corosync requires a much more time sometimes to figure out >> the situation and mark the master node as offline one. It shows 'ok' >> ring status. >> >> If i understand correctly then >> 1 the pacemaker actions (crm_resource --move) will not perform until >> corosync is not refreshed its ring state >> 2 the detection of a problem (from a corosync side) can be speeded up >> via timeout tuning in the corosync.conf >> 3 there is no way to ask corosync to recheck its ring status or mark a >> ring as failed manually >> >> But maybe i'm missing something. >> >> All i want is to move resources faster. >> In my little script i tried to force the cluster software to move >> resources to the slave node. But i've no success so far. >> >> Could you please share your thoughts about the situation. >> Thank you in advance. >> >> >> Cluster software: >> corosync - 2.4.3 >> pacemaker - 1.1.18 >> libqb - 1.0.2 >> >> >> corosync.conf: >> totem { >> version: 2 >> secauth: off >> cluster_name: cluster >> transport: udpu >> token: 2000 >> } >> >> nodelist { >> node { >> ring0_addr: main-node >> nodeid: 1 >> } >> >> node { >> ring0_addr: reserve-node >> nodeid: 2 >> } >> } >> >> quorum { >> provider: corosync_votequorum >> two_node: 1 >> } >> >> >> Regards, >> Maxim. >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 02/12/2018 04:34 PM, Maxim wrote: > 12.02.2018 16:15, Klaus Wenninger пишет: >> On 02/12/2018 01:02 PM, Maxim wrote: > > fencing-disabled is probably due to it being a test-setup ... RHEL 6 > > pcs being made for configuring a cman-pacemaker-setup I'm not sure if > > it is advisable to do a setup for a corosync-2 pacemaker setup with > > that. You've obviously edited corosync.conf to reflect that ... > It is ok. Fencing is not required at the time. > It works well with latest stable corosync and pacemaker that were > built manually (not from RHEL 6 repos). > And the attached config was generated by this pcs (i've removed > 'logging' section from there to decrease a message size). > >> > >> > >> All is ok mainly. But there is some problem of the cluster > >> activity performance when the master node is powered off (hard): > >> the slave node detects that the master one is down after about > >> 100-3500 ms. And the main question is how to avoid this 3 sec delay > >> that occurred sometimes. > > > > Kind of interesting that you ever get a detection below 2000ms with > > the token-timeout set to that value. (Given you are doing a > > hard-shutdown that doesn't give corosync time to sign off.) You've > > derived these times from the corosync-logs!? > > > > Regards, Klaus > > > Not actually. After your message i've conduct some more investigations > with quite active logging on the master node to get the real time when > node is going down. And... you are right. The delay is close to 4 > seconds. So there is a [foating] bug in my script. > Thank you for your inside, Klaus =) > > Butneverthelessis there any mechanism to force the slave corosync "to > think" that the master corosync is down? > [I have seen the abilities of corosync-cfgtools but, seems, it doesn't > contain similar functionality] > Or maybe are there some another ways? Maybe a few notes on the other way ;-) In general it is not easy to have a reliable answer to the question if the other node is down within just let's say 100ms. Think of network-hickups, scheduling issues and alike ... But if you are willing to accept false-positives you can reduce the token timeout of corosync instead of having another script that tries to do the job corosync is (amonst other things) made for (At least that is how I understood what you are aiming to do.). Regards, Klaus > > Regards, Maxim > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 02/12/2018 01:02 PM, Maxim wrote: > Hello, > > [Sorry for a message duplication. Web mail client ruined the > formatting of the previous e-mail =( ] > > There is a simple configuration of two cluster nodes (built via RHEL 6 > pcs interface) with multiple master/slave resources, disabled fencing > and the single sync interface. fencing-disabled is probably due to it being a test-setup ... RHEL 6 pcs being made for configuring a cman-pacemaker-setup I'm not sure if it is advisable to do a setup for a corosync-2 pacemaker setup with that. You've obviously edited corosync.conf to reflect that ... > > All is ok mainly. But there is some problem of the cluster activity > performance when the master node is powered off (hard): the slave node > detects that the master one is down after about 100-3500 ms. And the > main question is how to avoid this 3 sec delay that occurred sometimes. Kind of interesting that you ever get a detection below 2000ms with the token-timeout set to that value. (Given you are doing a hard-shutdown that doesn't give corosync time to sign off.) You've derived these times from the corosync-logs!? Regards, Klaus > > On the slave node i have a little script that checks the connection to > the master node. It detects a problem of a sync breakage within about > 100 ms. But corosync requires a much more time sometimes to figure out > the situation and mark the master node as offline one. It shows 'ok' > ring status. > > If i understand correctly then > 1 the pacemaker actions (crm_resource --move) will not perform until > corosync is not refreshed its ring state > 2 the detection of a problem (from a corosync side) can be speeded up > via timeout tuning in the corosync.conf > 3 there is no way to ask corosync to recheck its ring status or mark a > ring as failed manually > > But maybe i'm missing something. > > All i want is to move resources faster. > In my little script i tried to force the cluster software to move > resources to the slave node. But i've no success so far. > > Could you please share your thoughts about the situation. > Thank you in advance. > > > Cluster software: > corosync - 2.4.3 > pacemaker - 1.1.18 > libqb - 1.0.2 > > > corosync.conf: > totem { > version: 2 > secauth: off > cluster_name: cluster > transport: udpu > token: 2000 > } > > nodelist { > node { > ring0_addr: main-node > nodeid: 1 > } > > node { > ring0_addr: reserve-node > nodeid: 2 > } > } > > quorum { > provider: corosync_votequorum > two_node: 1 > } > > > Regards, > Maxim. > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org