Re: [ClusterLabs] Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-10-06 Thread Klaus Wenninger
On 10/05/2016 04:22 PM, renayama19661...@ybb.ne.jp wrote:
> Hi All,
>
>>> If a user uses sbd, can the cluster evade a problem of SIGSTOP of crmd?
>>  
>> As pointed out earlier, maybe crmd should feed a watchdog. Then stopping 
>> crmd 
>> will reboot the node (unless the watchdog fails).
>
> Thank you for comment.
>
> We examine watchdog of crmd, too.
> In addition, I comment after examination advanced.

Was thinking of doing a small test implementation going
a little in the direction Lars Ellenberg had been pointing out.

a couple of thoughts I had so far:

- add an API (via DBus or libqb - favoring libqb atm) to sbd
  an application can use to create a watchdog within sbd

- parameters for the first are a name and a timeout

- first use-case would be crmd observation

- later on we could think of removing pacemaker dependencies
  from sbd by moving the actual implementation of
  pacemaker-watcher and probably cluster-watcher as well
  into pacemaker - using the new API

- this of course creates sbd dependency within pacemaker so
  that it would make sense to offer a simpler and self-contained
  implementation within pacemaker as an alternative

  thus it would be favorable to have the dependency
  within a non-compulsory pacemaker-rpm so that
  we can offer an alternative that doesn't use sbd
  at maybe the cost of being less reliable or one
  that owns a hardware-watchdog by itself for systems
  where this is still unused.

  - e.g. via some kind of plugin (Andrew forgive me -
   no pils ;-) )
  - or via an additional daemon

What did you have in mind?
Maybe it makes sense to synchronize...

Regards,
Klaus
 
>
>
> Best Regards,
> Hideo Yamauchi.
>
>
>
> - Original Message -
>> From: Ulrich Windl 
>> To: users@clusterlabs.org; renayama19661...@ybb.ne.jp
>> Cc: 
>> Date: 2016/10/5, Wed 23:08
>> Subject: Antw: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, 
>> cluster decisions are delayed infinitely
>>
>   schrieb am 21.09.2016 um 11:52 
>> in Nachricht
>> <876439.61305...@web200311.mail.ssk.yahoo.co.jp>:
>>>  Hi All,
>>>
>>>  Was the final conclusion given about this problem?
>>>
>>>  If a user uses sbd, can the cluster evade a problem of SIGSTOP of crmd?
>> As pointed out earlier, maybe crmd should feed a watchdog. Then stopping 
>> crmd 
>> will reboot the node (unless the watchdog fails).
>>
>>>  We are interested in this problem, too.
>>>
>>>  Best Regards,
>>>
>>>  Hideo Yamauchi.
>>>
>>>
>>>  ___
>>>  Users mailing list: Users@clusterlabs.org 
>>>  http://clusterlabs.org/mailman/listinfo/users 
>>>
>>>  Project Home: http://www.clusterlabs.org 
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>  Bugs: http://bugs.clusterlabs.org 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-10-05 Thread renayama19661014
Hi All,

>> If a user uses sbd, can the cluster evade a problem of SIGSTOP of crmd?
> 
> As pointed out earlier, maybe crmd should feed a watchdog. Then stopping crmd 
> will reboot the node (unless the watchdog fails).


Thank you for comment.

We examine watchdog of crmd, too.
In addition, I comment after examination advanced.


Best Regards,
Hideo Yamauchi.



- Original Message -
> From: Ulrich Windl 
> To: users@clusterlabs.org; renayama19661...@ybb.ne.jp
> Cc: 
> Date: 2016/10/5, Wed 23:08
> Subject: Antw: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, 
> cluster decisions are delayed infinitely
> 
   schrieb am 21.09.2016 um 11:52 
> in Nachricht
> <876439.61305...@web200311.mail.ssk.yahoo.co.jp>:
>>  Hi All,
>> 
>>  Was the final conclusion given about this problem?
>> 
>>  If a user uses sbd, can the cluster evade a problem of SIGSTOP of crmd?
> 
> As pointed out earlier, maybe crmd should feed a watchdog. Then stopping crmd 
> will reboot the node (unless the watchdog fails).
> 
>> 
>>  We are interested in this problem, too.
>> 
>>  Best Regards,
>> 
>>  Hideo Yamauchi.
>> 
>> 
>>  ___
>>  Users mailing list: Users@clusterlabs.org 
>>  http://clusterlabs.org/mailman/listinfo/users 
>> 
>>  Project Home: http://www.clusterlabs.org 
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>  Bugs: http://bugs.clusterlabs.org 
> 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Digimer
On 08/09/16 09:32 PM, Klaus Wenninger wrote:
> On 09/08/2016 02:28 PM, Ulrich Windl wrote:
> Klaus Wenninger  schrieb am 08.09.2016 um 09:13 in
>> Nachricht <4c828344-44da-1d93-b43f-a305cfaa5...@redhat.com>:
>>> On 09/08/2016 08:55 AM, Digimer wrote:
 On 08/09/16 03:47 PM, Ulrich Windl wrote:
 Shermal Fernando  schrieb am 08.09.2016 um 
 06:41 
>>> in
> Nachricht
> <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
>> The whole cluster will fail if the DC (crm daemon) is frozen due to CPU 
>> starvation or hanging while trying to perform a IO operation.  
>> Please share some thoughts on this issue.
> What is "the whole cluster will fail"? If the DC times out, some recovery 
>>> will take place.
 Yup. The starved node should be declared lost by corosync, the remaining
 nodes reform and if they're still quorate, the hung node should be
 fenced. Recovery occur and life goes on.
>>> Didn't happen in my test (SIGSTOP to crmd).
>>> Might be a configuration mistake though...
>>> Even had sbd with a watchdog active (amongst
>>> other - real - fencing devices).
>>> Thinking if it might make sense so tickle the
>>> crmd-API from sbd-pacemaker-watcher ...
>> OK, so we mix "DC" and crmd. crmd is just a part of the DC. I guess if 
>> corosync is up and happy, but crmd is silent, the cluster just thinks that 
>> the DC has nothing to say.
>> But I still wonder what will happen if crmd is goinf to send some reply to a 
>> command.
> 
> Just lost accuracy during discussion. We did stop crmd on the DC.

Corosync (via totem protocol's token timeouts) declares node death.
Pacemaker reacts to the change in membership by checking if the
remaining nodes/new cluster is quorate and, if so, initiates fencing. If
corosync doesn't lose the peer, the cluster won't reform and fencing (at
the membership layer) won't be triggered.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Klaus Wenninger
On 09/08/2016 02:28 PM, Ulrich Windl wrote:
 Klaus Wenninger  schrieb am 08.09.2016 um 09:13 in
> Nachricht <4c828344-44da-1d93-b43f-a305cfaa5...@redhat.com>:
>> On 09/08/2016 08:55 AM, Digimer wrote:
>>> On 08/09/16 03:47 PM, Ulrich Windl wrote:
>>> Shermal Fernando  schrieb am 08.09.2016 um 
>>> 06:41 
>> in
 Nachricht
 <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
> The whole cluster will fail if the DC (crm daemon) is frozen due to CPU 
> starvation or hanging while trying to perform a IO operation.  
> Please share some thoughts on this issue.
 What is "the whole cluster will fail"? If the DC times out, some recovery 
>> will take place.
>>> Yup. The starved node should be declared lost by corosync, the remaining
>>> nodes reform and if they're still quorate, the hung node should be
>>> fenced. Recovery occur and life goes on.
>> Didn't happen in my test (SIGSTOP to crmd).
>> Might be a configuration mistake though...
>> Even had sbd with a watchdog active (amongst
>> other - real - fencing devices).
>> Thinking if it might make sense so tickle the
>> crmd-API from sbd-pacemaker-watcher ...
> OK, so we mix "DC" and crmd. crmd is just a part of the DC. I guess if 
> corosync is up and happy, but crmd is silent, the cluster just thinks that 
> the DC has nothing to say.
> But I still wonder what will happen if crmd is goinf to send some reply to a 
> command.

Just lost accuracy during discussion. We did stop crmd on the DC.

>
>>> Unless you don't have fencing, then may $deity of mercy. ;)
>>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>
>
>


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org