Re: [ClusterLabs] Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
On 10/05/2016 04:22 PM, renayama19661...@ybb.ne.jp wrote: > Hi All, > >>> If a user uses sbd, can the cluster evade a problem of SIGSTOP of crmd? >> >> As pointed out earlier, maybe crmd should feed a watchdog. Then stopping >> crmd >> will reboot the node (unless the watchdog fails). > > Thank you for comment. > > We examine watchdog of crmd, too. > In addition, I comment after examination advanced. Was thinking of doing a small test implementation going a little in the direction Lars Ellenberg had been pointing out. a couple of thoughts I had so far: - add an API (via DBus or libqb - favoring libqb atm) to sbd an application can use to create a watchdog within sbd - parameters for the first are a name and a timeout - first use-case would be crmd observation - later on we could think of removing pacemaker dependencies from sbd by moving the actual implementation of pacemaker-watcher and probably cluster-watcher as well into pacemaker - using the new API - this of course creates sbd dependency within pacemaker so that it would make sense to offer a simpler and self-contained implementation within pacemaker as an alternative thus it would be favorable to have the dependency within a non-compulsory pacemaker-rpm so that we can offer an alternative that doesn't use sbd at maybe the cost of being less reliable or one that owns a hardware-watchdog by itself for systems where this is still unused. - e.g. via some kind of plugin (Andrew forgive me - no pils ;-) ) - or via an additional daemon What did you have in mind? Maybe it makes sense to synchronize... Regards, Klaus > > > Best Regards, > Hideo Yamauchi. > > > > - Original Message - >> From: Ulrich Windl >> To: users@clusterlabs.org; renayama19661...@ybb.ne.jp >> Cc: >> Date: 2016/10/5, Wed 23:08 >> Subject: Antw: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, >> cluster decisions are delayed infinitely >> > schrieb am 21.09.2016 um 11:52 >> in Nachricht >> <876439.61305...@web200311.mail.ssk.yahoo.co.jp>: >>> Hi All, >>> >>> Was the final conclusion given about this problem? >>> >>> If a user uses sbd, can the cluster evade a problem of SIGSTOP of crmd? >> As pointed out earlier, maybe crmd should feed a watchdog. Then stopping >> crmd >> will reboot the node (unless the watchdog fails). >> >>> We are interested in this problem, too. >>> >>> Best Regards, >>> >>> Hideo Yamauchi. >>> >>> >>> ___ >>> Users mailing list: Users@clusterlabs.org >>> http://clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
Hi All, >> If a user uses sbd, can the cluster evade a problem of SIGSTOP of crmd? > > As pointed out earlier, maybe crmd should feed a watchdog. Then stopping crmd > will reboot the node (unless the watchdog fails). Thank you for comment. We examine watchdog of crmd, too. In addition, I comment after examination advanced. Best Regards, Hideo Yamauchi. - Original Message - > From: Ulrich Windl > To: users@clusterlabs.org; renayama19661...@ybb.ne.jp > Cc: > Date: 2016/10/5, Wed 23:08 > Subject: Antw: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, > cluster decisions are delayed infinitely > schrieb am 21.09.2016 um 11:52 > in Nachricht > <876439.61305...@web200311.mail.ssk.yahoo.co.jp>: >> Hi All, >> >> Was the final conclusion given about this problem? >> >> If a user uses sbd, can the cluster evade a problem of SIGSTOP of crmd? > > As pointed out earlier, maybe crmd should feed a watchdog. Then stopping crmd > will reboot the node (unless the watchdog fails). > >> >> We are interested in this problem, too. >> >> Best Regards, >> >> Hideo Yamauchi. >> >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
On 08/09/16 09:32 PM, Klaus Wenninger wrote: > On 09/08/2016 02:28 PM, Ulrich Windl wrote: > Klaus Wenninger schrieb am 08.09.2016 um 09:13 in >> Nachricht <4c828344-44da-1d93-b43f-a305cfaa5...@redhat.com>: >>> On 09/08/2016 08:55 AM, Digimer wrote: On 08/09/16 03:47 PM, Ulrich Windl wrote: Shermal Fernando schrieb am 08.09.2016 um 06:41 >>> in > Nachricht > <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>: >> The whole cluster will fail if the DC (crm daemon) is frozen due to CPU >> starvation or hanging while trying to perform a IO operation. >> Please share some thoughts on this issue. > What is "the whole cluster will fail"? If the DC times out, some recovery >>> will take place. Yup. The starved node should be declared lost by corosync, the remaining nodes reform and if they're still quorate, the hung node should be fenced. Recovery occur and life goes on. >>> Didn't happen in my test (SIGSTOP to crmd). >>> Might be a configuration mistake though... >>> Even had sbd with a watchdog active (amongst >>> other - real - fencing devices). >>> Thinking if it might make sense so tickle the >>> crmd-API from sbd-pacemaker-watcher ... >> OK, so we mix "DC" and crmd. crmd is just a part of the DC. I guess if >> corosync is up and happy, but crmd is silent, the cluster just thinks that >> the DC has nothing to say. >> But I still wonder what will happen if crmd is goinf to send some reply to a >> command. > > Just lost accuracy during discussion. We did stop crmd on the DC. Corosync (via totem protocol's token timeouts) declares node death. Pacemaker reacts to the change in membership by checking if the remaining nodes/new cluster is quorate and, if so, initiates fencing. If corosync doesn't lose the peer, the cluster won't reform and fencing (at the membership layer) won't be triggered. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
On 09/08/2016 02:28 PM, Ulrich Windl wrote: Klaus Wenninger schrieb am 08.09.2016 um 09:13 in > Nachricht <4c828344-44da-1d93-b43f-a305cfaa5...@redhat.com>: >> On 09/08/2016 08:55 AM, Digimer wrote: >>> On 08/09/16 03:47 PM, Ulrich Windl wrote: >>> Shermal Fernando schrieb am 08.09.2016 um >>> 06:41 >> in Nachricht <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>: > The whole cluster will fail if the DC (crm daemon) is frozen due to CPU > starvation or hanging while trying to perform a IO operation. > Please share some thoughts on this issue. What is "the whole cluster will fail"? If the DC times out, some recovery >> will take place. >>> Yup. The starved node should be declared lost by corosync, the remaining >>> nodes reform and if they're still quorate, the hung node should be >>> fenced. Recovery occur and life goes on. >> Didn't happen in my test (SIGSTOP to crmd). >> Might be a configuration mistake though... >> Even had sbd with a watchdog active (amongst >> other - real - fencing devices). >> Thinking if it might make sense so tickle the >> crmd-API from sbd-pacemaker-watcher ... > OK, so we mix "DC" and crmd. crmd is just a part of the DC. I guess if > corosync is up and happy, but crmd is silent, the cluster just thinks that > the DC has nothing to say. > But I still wonder what will happen if crmd is goinf to send some reply to a > command. Just lost accuracy during discussion. We did stop crmd on the DC. > >>> Unless you don't have fencing, then may $deity of mercy. ;) >>> >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org