[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Maxim

13.02.2018 16:41, Klaus Wenninger пишет:

Let's put that differently.  With fencing you can make the

> loss-detection more aggressive and thus more prone to false-positives
> without risking a split-brain situation. (Actually without fencing
> you can never be really sure if the other side is really gone!) But
> to be honest if you are really behind sub-second
> detection/switchover I'm not sure if fencing - at least with the
> current implementation in pacemaker and the current selection of
> fencing-devices - will give you satisfactory results.
>
>> [Unfortunatly, I've no a hardware that implements fencing
>> abilities nearby and can't try it myself]
>
> If you don't have any of the usual fencing-devices available you
> might have some kind of a shared-disk that might be usable with SBD.
> For a 2-node-cluster with a single shared-disk (as in your case if I
> got it correctly) assure to pick an SBD-version that has
> 
https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377.

>
> But again I doubt that this will work reliably with sub-second 
requirements.




> Not saying I'm not interested in experiences/requirements with
> pacemaker doing failovers in a sub-second or more relaxed
> low-single-digit-second timeframe. Seeing this working reliably would
> open up pacemaker for a completely new class of applications.
>
> Regards, Klaus

I was in a sceptical mind too... especially when i've seen the monitor 
intervals of pacemaker resource agents :D
So <1 sec timings for an issue_detection & resource_moves seems are 
unachiavable by facilities of the current cluster software.

By the architectural reasons as well.
Thank you for the support and proposals.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Maxim

13.02.2018 16:41, Klaus Wenninger пишет:

Let's put that differently.  With fencing you can make the

> loss-detection more aggressive and thus more prone to false-positives
> without risking a split-brain situation. (Actually without fencing
> you can never be really sure if the other side is really gone!) But
> to be honest if you are really behind sub-second
> detection/switchover I'm not sure if fencing - at least with the
> current implementation in pacemaker and the current selection of
> fencing-devices - will give you satisfactory results.

> If you don't have any of the usual fencing-devices available you
> might have some kind of a shared-disk that might be usable with SBD.
> For a 2-node-cluster with a single shared-disk (as in your case if I
> got it correctly) assure to pick an SBD-version that has
> 
https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377.

>
> But again I doubt that this will work reliably with sub-second 
requirements.





> Not saying I'm not interested in experiences/requirements with
> pacemaker doing failovers in a sub-second or more relaxed
> low-single-digit-second timeframe. Seeing this working reliably would
> open up pacemaker for a completely new class of applications.
>
> Regards, Klaus

I was in a sceptical mind too... especially when i've seen the monitor 
intervals of pacemaker resource agents :D
So <1 sec timings for an issue_detection & resource_moves seems are 
unachiavable by facilities of the current cluster software. By the 
architectural reasons as well.


Thank you for the support and proposals.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Digimer
On 2018-02-13 05:46 AM, Maxim wrote:
> 12.02.2018 19:31, Digimer пишет:
>> Without fencing, all bets are  off. Please enable it and see if the
>> issue remains
> Seems, i know [in theory] about the fencing ability and its importance
> (although I've never configured it so far).
> But i don't undestand how it would help in the situtions of the hard
> reboot/shutdown.

An availability cluster's job is to keep things running. To do this,
there must be coordination between the nodes (otherwise, just run things
everywhere and be done with it). Thus, when a node stops responding, it
is critical that the lost node be put into a known state.

If you allow assumptions to be made, you will eventually assume wrong.
That could have consequences as "minor" as confusing switches/routers to
as devastating as corrupted data.

Fencing is not meant to speed up recovery, it is critical to ensuring
recovery works at all.

This is a common confusion (and people often mistakenly think that
quorum is how you avoid this, which is incorrect). There is no
replacement for fencing; You need it in any availability system. Without
it, it is like driving without a seat-belt.

https://www.alteeve.com/w/The_2-Node_Myth

>> Changing EL6 to corosync 2  pushes further into uncharted waters. EL6
>> should be using the cman pluging with corosync 1. May I ask why you
>> don't use EL7 if you want such a recent stack?
> For historical reasons. Let's say so. I've another software that built
> for RHEL 6 like OS and have to be installed on the cluster node.
> EL 7 stack is already not so recent, but it's one the most stable and
> least vulnearable, i suppose. And i understand the risks.
> I will update pcs to the latest version when i find a bit of free time.
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Ken Gaillot
On Tue, 2018-02-13 at 13:46 +0300, Maxim wrote:
> 12.02.2018 19:31, Digimer пишет:
> >  > should be using the cman pluging with corosync 1. May I ask why
> you
>  > don't use EL7 if you want such a recent stack?
> For historical reasons. Let's say so. I've another software that
> built 
> for RHEL 6 like OS and have to be installed on the cluster node.

Compiling a newer corosync/pacemaker is a perfectly good solution in
this situation, but just to give you more options:

You could instead put the app inside a RHEL 6 container, and run it on
RHEL 7 cluster hosts. The advantage of that approach is that the rest
of your usual system services would be on more modern versions. With
bundles (available in the newer pacemaker on RHEL 7), you can use your
existing resource agent to launch the service inside the bundle, so the
cluster can monitor it (as well as monitoring the container itself).

Similarly, you could create a RHEL 6 VM and run it on RHEL 7 cluster
hosts. You can add the remote-node option to the VM resource, to be
able to launch and monitor the app inside it via its resource agent.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Klaus Wenninger
On 02/13/2018 01:28 PM, Maxim wrote:
> 13.02.2018 14:03, Klaus Wenninger пишет:
>> - fencing helps you turning the  'maybe the node is down - it doesn't
> > respond within x milli-seconds' into certainty that your node is dead
> > and won't interfere with the rest of the cluster
> >
> > Regards, Klaus
>
> It is clear. But will it force pacemaker to perceive that the node is
> down faster?

Let's put that differently. With fencing you can make the loss-detection
more
aggressive and thus more prone to false-positives without risking a
split-brain situation. (Actually without fencing you can never be really
sure if the other side is really gone!)
But to be honest if you are really behind sub-second detection/switchover
I'm not sure if fencing - at least with the current implementation in
pacemaker and the current selection of fencing-devices - will
give you satisfactory results.

> [Unfortunatly, I've no a hardware that implements fencing abilities
> nearby and can't try it myself]

If you don't have any of the usual fencing-devices available you might
have some kind of a shared-disk that might be usable with SBD.
For a 2-node-cluster with a single shared-disk (as in your case if I got
it correctly) assure to pick an SBD-version that has
https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377.
But again I doubt that this will work reliably with sub-second requirements.

>
> [Seems, it is the last question from my side that is devoted to this
> topic]
>
> Thank you and Ken for the participation!
>
> Regards,
> Maxim

Not saying I'm not interested in experiences/requirements with
pacemaker doing failovers in a sub-second or more relaxed
low-single-digit-second timeframe.
Seeing this working reliably would open up pacemaker for a
completely new class of applications.

Regards,
Klaus

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Maxim

13.02.2018 14:03, Klaus Wenninger пишет:

- fencing helps you turning the  'maybe the node is down - it doesn't

> respond within x milli-seconds' into certainty that your node is dead
> and won't interfere with the rest of the cluster
>
> Regards, Klaus

It is clear. But will it force pacemaker to perceive that the node is 
down faster?
[Unfortunatly, I've no a hardware that implements fencing abilities 
nearby and can't try it myself]


[Seems, it is the last question from my side that is devoted to this topic]

Thank you and Ken for the participation!

Regards,
Maxim
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Klaus Wenninger
On 02/13/2018 11:46 AM, Maxim wrote:
> 12.02.2018 19:31, Digimer пишет:
>> Without fencing, all bets are  off. Please enable it and see if the
> > issue remains
> Seems, i know [in theory] about the fencing ability and its importance
> (although I've never configured it so far).
> But i don't undestand how it would help in the situtions of the hard
> reboot/shutdown.

Actually in 2 ways:

- you are strongly advised to use fencing - and thus the base of users using
  fencing is much higher and strange/unexpected behavior is thus much
  more likely with the less tested setups without fencing
- fencing helps you turning the 'maybe the node is down - it doesn't respond
  within x milli-seconds' into certainty that your node is dead and won't
  interfere with the rest of the cluster

Regards,
Klaus
>
>> Changing EL6 to corosync 2  pushes further into uncharted waters. EL6
> > should be using the cman pluging with corosync 1. May I ask why you
> > don't use EL7 if you want such a recent stack?
> For historical reasons. Let's say so. I've another software that built
> for RHEL 6 like OS and have to be installed on the cluster node.
> EL 7 stack is already not so recent, but it's one the most stable and
> least vulnearable, i suppose. And i understand the risks.
> I will update pcs to the latest version when i find a bit of free time.
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Maxim

12.02.2018 19:31, Digimer пишет:

Without fencing, all bets are  off. Please enable it and see if the

> issue remains
Seems, i know [in theory] about the fencing ability and its importance 
(although I've never configured it so far).
But i don't undestand how it would help in the situtions of the hard 
reboot/shutdown.



Changing EL6 to corosync 2  pushes further into uncharted waters. EL6

> should be using the cman pluging with corosync 1. May I ask why you
> don't use EL7 if you want such a recent stack?
For historical reasons. Let's say so. I've another software that built 
for RHEL 6 like OS and have to be installed on the cluster node.
EL 7 stack is already not so recent, but it's one the most stable and 
least vulnearable, i suppose. And i understand the risks.

I will update pcs to the latest version when i find a bit of free time.
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-13 Thread Maxim

12.02.2018 18:46, Klaus Wenninger пишет:
> Maybe a few notes on the other way ;-) In general it is not easy to
> have a reliable answer to the question if the other node is down
> within just let's say 100ms. Think of network-hickups, scheduling
> issues and alike ... But if you are willing to accept
> false-positives you can reduce the token timeout of corosync instead
> of having another script that tries to do the job corosync is (amonst
> other things) made for (At least that is how I understood what you
> are aiming to do.).
>
> Regards, Klaus

Thank you again, Klaus.
Your description helps me to recognize a situation better (i've 
overworked a bit and can't this this not so nontrivial think by myself =)).


[
I've a scenario in the mind when an ability to mark a corosync ring as 
failed would be useful, but It doesn't relate to this topic.
It implemention on a corosync side would require some additional 
functionality for "checking" (let's call them so) rings that can be used 
only for network checking (not for cluster data synchronization). And 
the brokage of all "checking" rings (or some more enhanced logic) would 
indicate that the node is down or has the split brain. Just an idea.

]

No the ability? Ok, i try to deal with it )
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Digimer
On 2018-02-12 08:15 AM, Klaus Wenninger wrote:
> On 02/12/2018 01:02 PM, Maxim wrote:
>> Hello,
>>
>> [Sorry for a message duplication. Web mail client ruined the
>> formatting of the previous e-mail =( ]
>>
>> There is a simple configuration of two cluster nodes (built via RHEL 6
>> pcs interface) with multiple master/slave resources, disabled fencing
>> and the single sync interface.
> 
> fencing-disabled is probably due to it being a test-setup ...
> RHEL 6 pcs being made for configuring a cman-pacemaker-setup
> I'm not sure if it is advisable to do a setup for a corosync-2 pacemaker
> setup with that. You've obviously edited corosync.conf to
> reflect that ...

Without fencing, all bets are off. Please enable it and see if the issue
remains

Changing EL6 to corosync 2 pushes further into uncharted waters. EL6
should be using the cman pluging with corosync 1. May I ask why you
don't use EL7 if you want such a recent stack?

>> All is ok mainly. But there is some problem of the cluster activity
>> performance when the master node is powered off (hard): the slave node
>> detects that the master one is down after about 100-3500 ms. And the
>> main question is how to avoid this 3 sec delay that occurred sometimes.
> 
> Kind of interesting that you ever get a detection below 2000ms with the
> token-timeout set to that value. (Given you are doing a hard-shutdown
> that doesn't give corosync time to sign off.)
> You've derived these times from the corosync-logs!?
> 
> Regards,
> Klaus
> 
>>
>> On the slave node i have a little script that checks the connection to
>> the master node. It detects a problem of a sync breakage within about
>> 100 ms. But corosync requires a much more time sometimes to figure out
>> the situation and mark the master node as offline one. It shows 'ok'
>> ring status.
>>
>> If i understand correctly then
>> 1 the pacemaker actions (crm_resource --move) will not perform until
>> corosync is not refreshed its ring state
>> 2 the detection of a problem (from a corosync side) can be speeded up
>> via timeout tuning in the corosync.conf
>> 3 there is no way to ask corosync to recheck its ring status or mark a
>> ring as failed manually
>>
>> But maybe i'm missing something.
>>
>> All i want is to move resources faster.
>> In my little script i tried to force the cluster software to move
>> resources to the slave node. But i've no success so far.
>>
>> Could you please share your thoughts about the situation.
>> Thank you in advance.
>>
>>
>> Cluster software:
>> corosync - 2.4.3
>> pacemaker - 1.1.18
>> libqb - 1.0.2
>>
>>
>> corosync.conf:
>> totem {
>>   version: 2
>>   secauth: off
>>   cluster_name: cluster
>>   transport: udpu
>>   token: 2000
>> }
>>
>> nodelist {
>>  node {
>>  ring0_addr: main-node
>>  nodeid: 1
>>     }
>>
>>  node {
>>  ring0_addr: reserve-node
>>  nodeid: 2
>>  }
>> }
>>
>> quorum {
>>  provider: corosync_votequorum
>>  two_node: 1
>> }
>>
>>
>> Regards,
>> Maxim.
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Klaus Wenninger
On 02/12/2018 04:34 PM, Maxim wrote:
> 12.02.2018 16:15, Klaus Wenninger пишет:
>> On 02/12/2018 01:02 PM, Maxim  wrote:
> > fencing-disabled is probably due to it being a test-setup ... RHEL 6
> > pcs being made for configuring a cman-pacemaker-setup I'm not sure if
> > it is advisable to do a setup for a corosync-2 pacemaker setup with
> > that. You've obviously edited corosync.conf to reflect that ...
> It is ok. Fencing is not required at the time.
> It works well with latest stable corosync and pacemaker that were
> built manually (not from RHEL 6 repos).
> And the attached config was generated by this pcs (i've removed
> 'logging' section from there to decrease a message size).
>
>>
> >>
> >> All is ok mainly. But there is some problem of the cluster
> >> activity performance when the master node is powered off (hard):
> >> the slave node detects that the master one is down after about
> >> 100-3500 ms. And the main question is how to avoid this 3 sec delay
> >> that occurred sometimes.
> >
> > Kind of interesting that you ever get a detection below 2000ms with
> > the token-timeout set to that value. (Given you are doing a
> > hard-shutdown that doesn't give corosync time to sign off.) You've
> > derived these times from the corosync-logs!?
> >
> > Regards, Klaus
> >
> Not actually. After your message i've conduct some more investigations
> with quite active logging on the master node to get the real time when
> node is going down. And... you are right. The delay is close to 4
> seconds. So there is a [foating] bug in my script.
> Thank you for your inside, Klaus =)
>
> Butneverthelessis there any mechanism to force the slave corosync "to
> think" that the master corosync is down?
> [I have seen the abilities of corosync-cfgtools but, seems, it doesn't
> contain similar functionality]
> Or maybe are there some another ways?

Maybe a few notes on the other way ;-)
In general it is not easy to have a reliable answer
to the question if the other node is down within just
let's say 100ms.
Think of network-hickups, scheduling issues and
alike ...
But if you are willing to accept false-positives
you can reduce the token timeout of corosync
instead of having another script that tries to do
the job corosync is (amonst other things) made
for (At least that is how I understood what you
are aiming to do.).

Regards,
Klaus

>
> Regards, Maxim
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


  

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Maxim

12.02.2018 16:15, Klaus Wenninger пишет:

On 02/12/2018 01:02 PM, Maxim  wrote:

> fencing-disabled is probably due to it being a test-setup ... RHEL 6
> pcs being made for configuring a cman-pacemaker-setup I'm not sure if
> it is advisable to do a setup for a corosync-2 pacemaker setup with
> that. You've obviously edited corosync.conf to reflect that ...
It is ok. Fencing is not required at the time.
It works well with latest stable corosync and pacemaker that were built 
manually (not from RHEL 6 repos).
And the attached config was generated by this pcs (i've removed 
'logging' section from there to decrease a message size).





>>
>> All is ok mainly. But there is some problem of the cluster
>> activity performance when the master node is powered off (hard):
>> the slave node detects that the master one is down after about
>> 100-3500 ms. And the main question is how to avoid this 3 sec delay
>> that occurred sometimes.
>
> Kind of interesting that you ever get a detection below 2000ms with
> the token-timeout set to that value. (Given you are doing a
> hard-shutdown that doesn't give corosync time to sign off.) You've
> derived these times from the corosync-logs!?
>
> Regards, Klaus
>
Not actually. After your message i've conduct some more investigations 
with quite active logging on the master node to get the real time when 
node is going down. And... you are right. The delay is close to 4 
seconds. So there is a [foating] bug in my script.

Thank you for your inside, Klaus =)

Butneverthelessis there any mechanism to force the slave corosync "to 
think" that the master corosync is down?
[I have seen the abilities of corosync-cfgtools but, seems, it doesn't 
contain similar functionality]

Or maybe are there some another ways?

Regards, Maxim


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Klaus Wenninger
On 02/12/2018 01:02 PM, Maxim wrote:
> Hello,
>
> [Sorry for a message duplication. Web mail client ruined the
> formatting of the previous e-mail =( ]
>
> There is a simple configuration of two cluster nodes (built via RHEL 6
> pcs interface) with multiple master/slave resources, disabled fencing
> and the single sync interface.

fencing-disabled is probably due to it being a test-setup ...
RHEL 6 pcs being made for configuring a cman-pacemaker-setup
I'm not sure if it is advisable to do a setup for a corosync-2 pacemaker
setup with that. You've obviously edited corosync.conf to
reflect that ...
 
>
> All is ok mainly. But there is some problem of the cluster activity
> performance when the master node is powered off (hard): the slave node
> detects that the master one is down after about 100-3500 ms. And the
> main question is how to avoid this 3 sec delay that occurred sometimes.

Kind of interesting that you ever get a detection below 2000ms with the
token-timeout set to that value. (Given you are doing a hard-shutdown
that doesn't give corosync time to sign off.)
You've derived these times from the corosync-logs!?

Regards,
Klaus

>
> On the slave node i have a little script that checks the connection to
> the master node. It detects a problem of a sync breakage within about
> 100 ms. But corosync requires a much more time sometimes to figure out
> the situation and mark the master node as offline one. It shows 'ok'
> ring status.
>
> If i understand correctly then
> 1 the pacemaker actions (crm_resource --move) will not perform until
> corosync is not refreshed its ring state
> 2 the detection of a problem (from a corosync side) can be speeded up
> via timeout tuning in the corosync.conf
> 3 there is no way to ask corosync to recheck its ring status or mark a
> ring as failed manually
>
> But maybe i'm missing something.
>
> All i want is to move resources faster.
> In my little script i tried to force the cluster software to move
> resources to the slave node. But i've no success so far.
>
> Could you please share your thoughts about the situation.
> Thank you in advance.
>
>
> Cluster software:
> corosync - 2.4.3
> pacemaker - 1.1.18
> libqb - 1.0.2
>
>
> corosync.conf:
> totem {
>   version: 2
>   secauth: off
>   cluster_name: cluster
>   transport: udpu
>   token: 2000
> }
>
> nodelist {
>  node {
>  ring0_addr: main-node
>  nodeid: 1
>     }
>
>  node {
>  ring0_addr: reserve-node
>  nodeid: 2
>  }
> }
>
> quorum {
>  provider: corosync_votequorum
>  two_node: 1
> }
>
>
> Regards,
> Maxim.
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Maxim

Hello,

[Sorry for a message duplication. Web mail client ruined the formatting 
of the previous e-mail =( ]


There is a simple configuration of two cluster nodes (built via RHEL 6 
pcs interface) with multiple master/slave resources, disabled fencing 
and the single sync interface.


All is ok mainly. But there is some problem of the cluster activity 
performance when the master node is powered off (hard): the slave node 
detects that the master one is down after about 100-3500 ms. And the 
main question is how to avoid this 3 sec delay that occurred sometimes.


On the slave node i have a little script that checks the connection to 
the master node. It detects a problem of a sync breakage within about 
100 ms. But corosync requires a much more time sometimes to figure out 
the situation and mark the master node as offline one. It shows 'ok' 
ring status.


If i understand correctly then
1 the pacemaker actions (crm_resource --move) will not perform until 
corosync is not refreshed its ring state
2 the detection of a problem (from a corosync side) can be speeded up 
via timeout tuning in the corosync.conf
3 there is no way to ask corosync to recheck its ring status or mark a 
ring as failed manually


But maybe i'm missing something.

All i want is to move resources faster.
In my little script i tried to force the cluster software to move 
resources to the slave node. But i've no success so far.


Could you please share your thoughts about the situation.
Thank you in advance.


Cluster software:
corosync - 2.4.3
pacemaker - 1.1.18
libqb - 1.0.2


corosync.conf:
totem {
  version: 2
  secauth: off
  cluster_name: cluster
  transport: udpu
  token: 2000
}

nodelist {
 node {
 ring0_addr: main-node
 nodeid: 1
}

 node {
 ring0_addr: reserve-node
 nodeid: 2
 }
}

quorum {
 provider: corosync_votequorum
 two_node: 1
}


Regards,
Maxim.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread называется как хочется

Hello

There is a simple configuration of two cluster nodes (built via RHEL 6 pcs
interface) with multiple master/slave resources, disabled fencing and the single
sync interface.

All is ok mainly. But there is some problem of the cluster activity performance
when the master node is powered off (hard): the slave node detects that the
master one is down after about 100-3500 ms. And the main question is how to 
avoid
this 3 sec delay that occurred sometimes.
On the slave node i have a little script that checks the connection to the 
master
node. It detects a problem of a sync breakage within about 100 ms.But corosync
requires a much more time sometimes to figure out the situation and mark the
master node as offline one. It shows 'ok' ring status.

If i understand correctly then 1 the pacemaker actions (crm_resource --move) 
will
not perform until corosync is not refreshed its ring state2 the detection of a
problem (from a corosync side) can be speeded up via timeout tuning in the
corosync.conf
3 there is no way to ask corosync to recheck its ring status or mark a ring as
failed manually

But maybe i'm missing something.
All i want is to move resources faster.In my little script i tried to force the
cluster software to move resources to the slave node. But i've no success so 
far.

Could you please share your thoughts about the situation.Thank you in advance.

Cluster software:
corosync - 2.4.3pacemaker - 1.1.18libqb - 1.0.2

corosync.conf:totem {
version: 2
secauth: off
cluster_name: cluster
transport: udpu
token: 2000
}nodelist {
node {
ring0_addr: main-node
nodeid: 1
}node {
ring0_addr: reserve-node
nodeid: 2
}
}quorum {
provider: corosync_votequorum
two_node: 1
}


Regards, Maxim.
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org