Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Digimer
On 2018-02-13 05:46 AM, Maxim wrote:
> 12.02.2018 19:31, Digimer пишет:
>> Without fencing, all bets are  off. Please enable it and see if the
>> issue remains
> Seems, i know [in theory] about the fencing ability and its importance
> (although I've never configured it so far).
> But i don't undestand how it would help in the situtions of the hard
> reboot/shutdown.

An availability cluster's job is to keep things running. To do this,
there must be coordination between the nodes (otherwise, just run things
everywhere and be done with it). Thus, when a node stops responding, it
is critical that the lost node be put into a known state.

If you allow assumptions to be made, you will eventually assume wrong.
That could have consequences as "minor" as confusing switches/routers to
as devastating as corrupted data.

Fencing is not meant to speed up recovery, it is critical to ensuring
recovery works at all.

This is a common confusion (and people often mistakenly think that
quorum is how you avoid this, which is incorrect). There is no
replacement for fencing; You need it in any availability system. Without
it, it is like driving without a seat-belt.

https://www.alteeve.com/w/The_2-Node_Myth

>> Changing EL6 to corosync 2  pushes further into uncharted waters. EL6
>> should be using the cman pluging with corosync 1. May I ask why you
>> don't use EL7 if you want such a recent stack?
> For historical reasons. Let's say so. I've another software that built
> for RHEL 6 like OS and have to be installed on the cluster node.
> EL 7 stack is already not so recent, but it's one the most stable and
> least vulnearable, i suppose. And i understand the risks.
> I will update pcs to the latest version when i find a bit of free time.
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Ken Gaillot
On Tue, 2018-02-13 at 13:46 +0300, Maxim wrote:
> 12.02.2018 19:31, Digimer пишет:
> >  > should be using the cman pluging with corosync 1. May I ask why
> you
>  > don't use EL7 if you want such a recent stack?
> For historical reasons. Let's say so. I've another software that
> built 
> for RHEL 6 like OS and have to be installed on the cluster node.

Compiling a newer corosync/pacemaker is a perfectly good solution in
this situation, but just to give you more options:

You could instead put the app inside a RHEL 6 container, and run it on
RHEL 7 cluster hosts. The advantage of that approach is that the rest
of your usual system services would be on more modern versions. With
bundles (available in the newer pacemaker on RHEL 7), you can use your
existing resource agent to launch the service inside the bundle, so the
cluster can monitor it (as well as monitoring the container itself).

Similarly, you could create a RHEL 6 VM and run it on RHEL 7 cluster
hosts. You can add the remote-node option to the VM resource, to be
able to launch and monitor the app inside it via its resource agent.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Klaus Wenninger
On 02/13/2018 01:28 PM, Maxim wrote:
> 13.02.2018 14:03, Klaus Wenninger пишет:
>> - fencing helps you turning the  'maybe the node is down - it doesn't
> > respond within x milli-seconds' into certainty that your node is dead
> > and won't interfere with the rest of the cluster
> >
> > Regards, Klaus
>
> It is clear. But will it force pacemaker to perceive that the node is
> down faster?

Let's put that differently. With fencing you can make the loss-detection
more
aggressive and thus more prone to false-positives without risking a
split-brain situation. (Actually without fencing you can never be really
sure if the other side is really gone!)
But to be honest if you are really behind sub-second detection/switchover
I'm not sure if fencing - at least with the current implementation in
pacemaker and the current selection of fencing-devices - will
give you satisfactory results.

> [Unfortunatly, I've no a hardware that implements fencing abilities
> nearby and can't try it myself]

If you don't have any of the usual fencing-devices available you might
have some kind of a shared-disk that might be usable with SBD.
For a 2-node-cluster with a single shared-disk (as in your case if I got
it correctly) assure to pick an SBD-version that has
https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377.
But again I doubt that this will work reliably with sub-second requirements.

>
> [Seems, it is the last question from my side that is devoted to this
> topic]
>
> Thank you and Ken for the participation!
>
> Regards,
> Maxim

Not saying I'm not interested in experiences/requirements with
pacemaker doing failovers in a sub-second or more relaxed
low-single-digit-second timeframe.
Seeing this working reliably would open up pacemaker for a
completely new class of applications.

Regards,
Klaus

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Klaus Wenninger
On 02/13/2018 11:46 AM, Maxim wrote:
> 12.02.2018 19:31, Digimer пишет:
>> Without fencing, all bets are  off. Please enable it and see if the
> > issue remains
> Seems, i know [in theory] about the fencing ability and its importance
> (although I've never configured it so far).
> But i don't undestand how it would help in the situtions of the hard
> reboot/shutdown.

Actually in 2 ways:

- you are strongly advised to use fencing - and thus the base of users using
  fencing is much higher and strange/unexpected behavior is thus much
  more likely with the less tested setups without fencing
- fencing helps you turning the 'maybe the node is down - it doesn't respond
  within x milli-seconds' into certainty that your node is dead and won't
  interfere with the rest of the cluster

Regards,
Klaus
>
>> Changing EL6 to corosync 2  pushes further into uncharted waters. EL6
> > should be using the cman pluging with corosync 1. May I ask why you
> > don't use EL7 if you want such a recent stack?
> For historical reasons. Let's say so. I've another software that built
> for RHEL 6 like OS and have to be installed on the cluster node.
> EL 7 stack is already not so recent, but it's one the most stable and
> least vulnearable, i suppose. And i understand the risks.
> I will update pcs to the latest version when i find a bit of free time.
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Digimer
On 2018-02-12 08:15 AM, Klaus Wenninger wrote:
> On 02/12/2018 01:02 PM, Maxim wrote:
>> Hello,
>>
>> [Sorry for a message duplication. Web mail client ruined the
>> formatting of the previous e-mail =( ]
>>
>> There is a simple configuration of two cluster nodes (built via RHEL 6
>> pcs interface) with multiple master/slave resources, disabled fencing
>> and the single sync interface.
> 
> fencing-disabled is probably due to it being a test-setup ...
> RHEL 6 pcs being made for configuring a cman-pacemaker-setup
> I'm not sure if it is advisable to do a setup for a corosync-2 pacemaker
> setup with that. You've obviously edited corosync.conf to
> reflect that ...

Without fencing, all bets are off. Please enable it and see if the issue
remains

Changing EL6 to corosync 2 pushes further into uncharted waters. EL6
should be using the cman pluging with corosync 1. May I ask why you
don't use EL7 if you want such a recent stack?

>> All is ok mainly. But there is some problem of the cluster activity
>> performance when the master node is powered off (hard): the slave node
>> detects that the master one is down after about 100-3500 ms. And the
>> main question is how to avoid this 3 sec delay that occurred sometimes.
> 
> Kind of interesting that you ever get a detection below 2000ms with the
> token-timeout set to that value. (Given you are doing a hard-shutdown
> that doesn't give corosync time to sign off.)
> You've derived these times from the corosync-logs!?
> 
> Regards,
> Klaus
> 
>>
>> On the slave node i have a little script that checks the connection to
>> the master node. It detects a problem of a sync breakage within about
>> 100 ms. But corosync requires a much more time sometimes to figure out
>> the situation and mark the master node as offline one. It shows 'ok'
>> ring status.
>>
>> If i understand correctly then
>> 1 the pacemaker actions (crm_resource --move) will not perform until
>> corosync is not refreshed its ring state
>> 2 the detection of a problem (from a corosync side) can be speeded up
>> via timeout tuning in the corosync.conf
>> 3 there is no way to ask corosync to recheck its ring status or mark a
>> ring as failed manually
>>
>> But maybe i'm missing something.
>>
>> All i want is to move resources faster.
>> In my little script i tried to force the cluster software to move
>> resources to the slave node. But i've no success so far.
>>
>> Could you please share your thoughts about the situation.
>> Thank you in advance.
>>
>>
>> Cluster software:
>> corosync - 2.4.3
>> pacemaker - 1.1.18
>> libqb - 1.0.2
>>
>>
>> corosync.conf:
>> totem {
>>   version: 2
>>   secauth: off
>>   cluster_name: cluster
>>   transport: udpu
>>   token: 2000
>> }
>>
>> nodelist {
>>  node {
>>  ring0_addr: main-node
>>  nodeid: 1
>>     }
>>
>>  node {
>>  ring0_addr: reserve-node
>>  nodeid: 2
>>  }
>> }
>>
>> quorum {
>>  provider: corosync_votequorum
>>  two_node: 1
>> }
>>
>>
>> Regards,
>> Maxim.
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Klaus Wenninger
On 02/12/2018 04:34 PM, Maxim wrote:
> 12.02.2018 16:15, Klaus Wenninger пишет:
>> On 02/12/2018 01:02 PM, Maxim  wrote:
> > fencing-disabled is probably due to it being a test-setup ... RHEL 6
> > pcs being made for configuring a cman-pacemaker-setup I'm not sure if
> > it is advisable to do a setup for a corosync-2 pacemaker setup with
> > that. You've obviously edited corosync.conf to reflect that ...
> It is ok. Fencing is not required at the time.
> It works well with latest stable corosync and pacemaker that were
> built manually (not from RHEL 6 repos).
> And the attached config was generated by this pcs (i've removed
> 'logging' section from there to decrease a message size).
>
>>
> >>
> >> All is ok mainly. But there is some problem of the cluster
> >> activity performance when the master node is powered off (hard):
> >> the slave node detects that the master one is down after about
> >> 100-3500 ms. And the main question is how to avoid this 3 sec delay
> >> that occurred sometimes.
> >
> > Kind of interesting that you ever get a detection below 2000ms with
> > the token-timeout set to that value. (Given you are doing a
> > hard-shutdown that doesn't give corosync time to sign off.) You've
> > derived these times from the corosync-logs!?
> >
> > Regards, Klaus
> >
> Not actually. After your message i've conduct some more investigations
> with quite active logging on the master node to get the real time when
> node is going down. And... you are right. The delay is close to 4
> seconds. So there is a [foating] bug in my script.
> Thank you for your inside, Klaus =)
>
> Butneverthelessis there any mechanism to force the slave corosync "to
> think" that the master corosync is down?
> [I have seen the abilities of corosync-cfgtools but, seems, it doesn't
> contain similar functionality]
> Or maybe are there some another ways?

Maybe a few notes on the other way ;-)
In general it is not easy to have a reliable answer
to the question if the other node is down within just
let's say 100ms.
Think of network-hickups, scheduling issues and
alike ...
But if you are willing to accept false-positives
you can reduce the token timeout of corosync
instead of having another script that tries to do
the job corosync is (amonst other things) made
for (At least that is how I understood what you
are aiming to do.).

Regards,
Klaus

>
> Regards, Maxim
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


  

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Klaus Wenninger
On 02/12/2018 01:02 PM, Maxim wrote:
> Hello,
>
> [Sorry for a message duplication. Web mail client ruined the
> formatting of the previous e-mail =( ]
>
> There is a simple configuration of two cluster nodes (built via RHEL 6
> pcs interface) with multiple master/slave resources, disabled fencing
> and the single sync interface.

fencing-disabled is probably due to it being a test-setup ...
RHEL 6 pcs being made for configuring a cman-pacemaker-setup
I'm not sure if it is advisable to do a setup for a corosync-2 pacemaker
setup with that. You've obviously edited corosync.conf to
reflect that ...
 
>
> All is ok mainly. But there is some problem of the cluster activity
> performance when the master node is powered off (hard): the slave node
> detects that the master one is down after about 100-3500 ms. And the
> main question is how to avoid this 3 sec delay that occurred sometimes.

Kind of interesting that you ever get a detection below 2000ms with the
token-timeout set to that value. (Given you are doing a hard-shutdown
that doesn't give corosync time to sign off.)
You've derived these times from the corosync-logs!?

Regards,
Klaus

>
> On the slave node i have a little script that checks the connection to
> the master node. It detects a problem of a sync breakage within about
> 100 ms. But corosync requires a much more time sometimes to figure out
> the situation and mark the master node as offline one. It shows 'ok'
> ring status.
>
> If i understand correctly then
> 1 the pacemaker actions (crm_resource --move) will not perform until
> corosync is not refreshed its ring state
> 2 the detection of a problem (from a corosync side) can be speeded up
> via timeout tuning in the corosync.conf
> 3 there is no way to ask corosync to recheck its ring status or mark a
> ring as failed manually
>
> But maybe i'm missing something.
>
> All i want is to move resources faster.
> In my little script i tried to force the cluster software to move
> resources to the slave node. But i've no success so far.
>
> Could you please share your thoughts about the situation.
> Thank you in advance.
>
>
> Cluster software:
> corosync - 2.4.3
> pacemaker - 1.1.18
> libqb - 1.0.2
>
>
> corosync.conf:
> totem {
>   version: 2
>   secauth: off
>   cluster_name: cluster
>   transport: udpu
>   token: 2000
> }
>
> nodelist {
>  node {
>  ring0_addr: main-node
>  nodeid: 1
>     }
>
>  node {
>  ring0_addr: reserve-node
>  nodeid: 2
>  }
> }
>
> quorum {
>  provider: corosync_votequorum
>  two_node: 1
> }
>
>
> Regards,
> Maxim.
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org