Re: [ClusterLabs] When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-07 Thread Shermal Fernando
The whole cluster will fail if the DC (crm daemon) is frozen due to CPU 
starvation or hanging while trying to perform a IO operation.  
Please share some thoughts on this issue.

Regards,
Shermal Fernando







-Original Message-
From: Klaus Wenninger [mailto:kwenn...@redhat.com] 
Sent: Monday, September 05, 2016 6:42 PM
To: users@clusterlabs.org; develop...@clusterlabs.org
Subject: Re: [ClusterLabs] When the DC crmd is frozen, cluster decisions are 
delayed infinitely

On 09/03/2016 08:42 PM, Shermal Fernando wrote:
>
> Hi,
>
>  
>
> Currently our system have 99.96% uptime. But our goal is to increase 
> it beyond 99.999%. Now we are studying the 
> reliability/performance/features of pacemaker to replace the existing 
> clustering solution.
>
>  
>
> While testing pacemaker, I have encountered a problem. If the DC (crm
> daemon) is frozen by sending the SIGSTOP signal, crmds in other 
> machines never start election to elect a new DC. Therefore fail-overs, 
> resource restartings and other cluster decisions will be delayed until 
> the DC is unfrozen.
>
> Is this the default behavior of pacemaker or is it due to a 
> misconfiguration? Is there any way to avoid this single point of failure?
>
>  
>
> For the testing, we use Pacemaker 1.1.12 with Corosync 2.3.3 in SLES
> 12 SP1 operation system.
>

Guess I can reproduce that with pacemaker 1.1.15 & corosync 2.3.6.
I'm having sbd with pacemaker-watcher running as well on the nodes.
As the node-health is not updated and the cib can be read sbd is happy - as to 
be expected.
Maybe we could at least add something into sbd-pacemaker-watcher to detect the 
issue ... thinking ...

Regards,
Klaus

>  
>
>  
>
> Regards,
>
> Shermal Fernando
>
>  
>
>  
>
>  
>
>  
>
>  
>
>  
>
>  
>
> This e-mail transmission (inclusive of any attachments) is strictly 
> confidential and intended solely for the ordinary user of the e-mail 
> address to which it was addressed. It may contain legally privileged 
> and/or CONFIDENTIAL information. The unauthorized use, disclosure, 
> distribution printing and/or copying of this e-mail or any information 
> it contains is prohibited and could, in certain circumstances, 
> constitute an offence. If you have received this e-mail in error or 
> are not an intended recipient please inform the sender of the email 
> and MillenniumIT immediately by return e-mail or telephone (+94-11) 
> 2416000. We advise that in keeping with good computing practice, the 
> recipient of this e-mail should ensure that it is virus free. We do 
> not accept responsibility for any virus that may be transferred by way 
> of this e-mail. E-mail may be susceptible to data corruption, 
> interception and unauthorized amendment, and we do not accept 
> liability for any such corruption, interception or amendment or any 
> consequences thereof.
>
> www.millenniumit.com 
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org 
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-05 Thread Klaus Wenninger
On 09/03/2016 08:42 PM, Shermal Fernando wrote:
>
> Hi,
>
>  
>
> Currently our system have 99.96% uptime. But our goal is to increase
> it beyond 99.999%. Now we are studying the
> reliability/performance/features of pacemaker to replace the existing
> clustering solution.
>
>  
>
> While testing pacemaker, I have encountered a problem. If the DC (crm
> daemon) is frozen by sending the SIGSTOP signal, crmds in other
> machines never start election to elect a new DC. Therefore fail-overs,
> resource restartings and other cluster decisions will be delayed until
> the DC is unfrozen.
>
> Is this the default behavior of pacemaker or is it due to a
> misconfiguration? Is there any way to avoid this single point of failure?
>
>  
>
> For the testing, we use Pacemaker 1.1.12 with Corosync 2.3.3 in SLES
> 12 SP1 operation system.
>

Guess I can reproduce that with pacemaker 1.1.15 & corosync 2.3.6.
I'm having sbd with pacemaker-watcher running as well on the nodes.
As the node-health is not updated and the cib can be read sbd is
happy - as to be expected.
Maybe we could at least add something into sbd-pacemaker-watcher
to detect the issue ... thinking ...

Regards,
Klaus

>  
>
>  
>
> Regards,
>
> Shermal Fernando
>
>  
>
>  
>
>  
>
>  
>
>  
>
>  
>
>  
>
> This e-mail transmission (inclusive of any attachments) is strictly
> confidential and intended solely for the ordinary user of the e-mail
> address to which it was addressed. It may contain legally privileged
> and/or CONFIDENTIAL information. The unauthorized use, disclosure,
> distribution printing and/or copying of this e-mail or any information
> it contains is prohibited and could, in certain circumstances,
> constitute an offence. If you have received this e-mail in error or
> are not an intended recipient please inform the sender of the email
> and MillenniumIT immediately by return e-mail or telephone (+94-11)
> 2416000. We advise that in keeping with good computing practice, the
> recipient of this e-mail should ensure that it is virus free. We do
> not accept responsibility for any virus that may be transferred by way
> of this e-mail. E-mail may be susceptible to data corruption,
> interception and unauthorized amendment, and we do not accept
> liability for any such corruption, interception or amendment or any
> consequences thereof.
>
> www.millenniumit.com 
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-03 Thread Shermal Fernando
Hi,

Currently our system have 99.96% uptime. But our goal is to increase it beyond 
99.999%. Now we are studying the reliability/performance/features of pacemaker 
to replace the existing clustering solution.

While testing pacemaker, I have encountered a problem. If the DC (crm daemon) 
is frozen by sending the SIGSTOP signal, crmds in other machines never start 
election to elect a new DC. Therefore fail-overs, resource restartings and 
other cluster decisions will be delayed until the DC is unfrozen.
Is this the default behavior of pacemaker or is it due to a misconfiguration? 
Is there any way to avoid this single point of failure?

For the testing, we use Pacemaker 1.1.12 with Corosync 2.3.3 in SLES 12 SP1 
operation system.


Regards,
Shermal Fernando









This e-mail transmission (inclusive of any attachments) is strictly 
confidential and intended solely for the ordinary user of the e-mail address to 
which it was addressed. It may contain legally privileged and/or CONFIDENTIAL 
information. The unauthorized use, disclosure, distribution printing and/or 
copying of this e-mail or any information it contains is prohibited and could, 
in certain circumstances, constitute an offence. If you have received this 
e-mail in error or are not an intended recipient please inform the sender of 
the email and MillenniumIT immediately by return e-mail or telephone (+94-11) 
2416000. We advise that in keeping with good computing practice, the recipient 
of this e-mail should ensure that it is virus free. We do not accept 
responsibility for any virus that may be transferred by way of this e-mail. 
E-mail may be susceptible to data corruption, interception and unauthorized 
amendment, and we do not accept liability for any such corruption, interception 
or amendment or any consequences thereof.  www.millenniumit.com 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org