Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
On 05/10/2017 01:54 PM, Ken Gaillot wrote: > On 05/10/2017 12:26 PM, Dimitri Maziuk wrote: >> - fencing in 2-node clusters does not work reliably without fixed delay > > Not quite. Fixed delay allows a particular method for avoiding a death > match in a two-node cluster. Pacemaker's built-in random delay > capability is another method. Deathmatch is one problem, killing the wrong node (2 nodes, no quorum) is another. Fixed delay is digimer's attempt to alleviate the latter, so... apples and fruits not entirely unlike apples. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
On 05/10/2017 12:26 PM, Dimitri Maziuk wrote: > > i remember that digimer often campaigns for a fence delay in a 2-node > cluster. > ... > But ... a random delay does not seem to > be a reliable solution. > >> Some fence agents implement a delay parameter of their own, to set a >> fixed delay. I believe that's what digimer uses. > > Is it just me or does this sound like catch-22: > - pacemaker does not work reliably without fencing Correct -- more specifically, some failure scenarios can't be safely handled without fencing. > - fencing in 2-node clusters does not work reliably without fixed delay Not quite. Fixed delay allows a particular method for avoiding a death match in a two-node cluster. Pacemaker's built-in random delay capability is another method. > - code that ships with pacemaker does not implement fixed delay. Fence agents are used with pacemaker but not shipped as part of it. They have their own packages distributed separately. Anyone can write a fence agent and make it available to the community. It would be nice if every fence agent supported a delay parameter, but there's no requirement to do so, and even if there were, it would just be a guideline -- it's up to the developer. There's certainly an argument to be made for supporting a fixed delay at the pacemaker level. There's an idea floating around to do this based on node health, which could allow a lot of flexibility. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
i remember that digimer often campaigns for a fence delay in a 2-node cluster. ... But ... a random delay does not seem to be a reliable solution. > Some fence agents implement a delay parameter of their own, to set a > fixed delay. I believe that's what digimer uses. Is it just me or does this sound like catch-22: - pacemaker does not work reliably without fencing - fencing in 2-node clusters does not work reliably without fixed delay - code that ships with pacemaker does not implement fixed delay. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
On 05/10/2017 12:20 AM, Kristoffer Grönlund wrote: > "Lentes, Bernd" writes: > >> - On May 8, 2017, at 9:20 PM, Bernd Lentes >> bernd.len...@helmholtz-muenchen.de wrote: >> >>> Hi, >>> >>> i remember that digimer often campaigns for a fence delay in a 2-node >>> cluster. >>> E.g. here: >>> http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html >>> In my eyes it makes sense, so i try to establish that. I have two HP >>> servers, >>> each with an ILO card. >>> I have to use the stonith:external/ipmi agent, the stonith:external/riloe >>> refused to work. >>> >>> But i don't have a delay parameter there. >>> crm ra info stonith:external/ipmi: >>> >>> ... >>> pcmk_delay_max (time, [0s]): Enable random delay for stonith actions and >>> specify >>> the maximum of random delay >>>This prevents double fencing when using slow devices such as sbd. >>>Use this to enable random delay for stonith actions and specify the >>> maximum of >>>random delay. >>> ... >>> >>> This is the only delay parameter i can use. But a random delay does not >>> seem to >>> be a reliable solution. >>> >>> The stonith:ipmilan agent also provides just a random delay. Same with the >>> riloe >>> agent. >>> >>> How did anyone solve this problem ? >>> >>> Or do i have to edit the RA (I will get practice in that :-))? >>> >>> >> >> crm ra info stonith:external/ipmi says there exists a parameter >> pcmk_delay_max. >> Having a look in /usr/lib64/stonith/plugins/external/ipmi i don't find >> anything about delay. >> Also "crm_resource --show-metadata=stonith:external/ipmi" does not say >> anything about a delay. >> >> Is this "pcmk_delay_max" not implemented ? From where does "crm ra info >> stonith:external/ipmi" get this info ? >> > > pcmk_delay_max is implemented by Pacemaker. crmsh gets the information > about available parameters by querying stonithd directly. > > Cheers, > Kristoffer The various pcmk_* parameters are documented in the stonithd(7) man page. Some fence agents implement a delay parameter of their own, to set a fixed delay. I believe that's what digimer uses. >> >> Bernd >> >> >> Helmholtz Zentrum Muenchen >> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) >> Ingolstaedter Landstr. 1 >> 85764 Neuherberg >> www.helmholtz-muenchen.de >> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe >> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons >> Enhsen >> Registergericht: Amtsgericht Muenchen HRB 6466 >> USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
On 05/09/2017 10:34 PM, Attila Megyeri wrote: > > Actually I found some more details: > > > > there are two resources: A and B > > > > resource B depends on resource A (when the RA monitors B, if will fail > if A is not running properly) > > > > If I stop resource A, the next monitor operation of „B” will fail. > Interestingly, this check happens immediately after A is stopped. > > > > B is configured to restart if monitor fails. Start timeout is rather > long, 180 seconds. So pacemaker tries to restart B, and waits. > > > > If I want to start „A”, nothing happens until the start operation of > „B” fails – typically several minutes. > > > > > > Is this the right behavior? > > It appears that pacemaker is blocked until resource B is being > started, and I cannot really start its dependency… > > Shouldn’t it be possible to start a resource while another resource is > also starting? > As long as resources don't depend on each other parallel starting should work/happen. The number of parallel actions executed is derived from the number of cores and when load is detected some kind of throttling kicks in (in fact reduction of the operations executed in parallel with the aim to reduce the load induced by pacemaker). When throttling kicks in you should get log messages (there is in fact a parallel discussion going on ...). No idea if throttling might be a reason here but maybe worth considering at least. Another reason why certain things happen with quite some delay I've observed is that obviously some situations are just resolved when the cluster-recheck-interval triggers a pengine run in addition to those triggered by changes. You might easily verify this by changing the cluster-recheck-interval. Regards, Klaus > > > > > Thanks, > > Attila > > > > > > *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com] > *Sent:* Tuesday, May 9, 2017 9:53 PM > *To:* users@clusterlabs.org; kgail...@redhat.com > *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond > > > > Hi Ken, all, > > > > > > We ran into an issue very similar to the one described in > https://bugzilla.redhat.com/show_bug.cgi?id=1430112 / [Intel 7.4 Bug] > Pacemaker occasionally takes minutes to respond > > > > But in our case we are not using fencing/stonith at all. > > > > Many times when I want to start/stop/cleanup a resource, it takes tens > of seconds (or even minutes) till the command gets executed. The logs > show nothing in that period, the redundant rings show no fault. > > > > Could this be the same issue? > > > > Any hints on how to troubleshoot this? > > It is pacemaker 1.1.10, corosync 2.3.3 > > > > > > Cheers, > > Attila > > > > > > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- Klaus Wenninger Senior Software Engineer, EMEA ENG Openstack Infrastructure Red Hat kwenn...@redhat.com ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org