Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-10 Thread Dimitri Maziuk
On 05/10/2017 01:54 PM, Ken Gaillot wrote:
> On 05/10/2017 12:26 PM, Dimitri Maziuk wrote:

>> - fencing in 2-node clusters does not work reliably without fixed delay
> 
> Not quite. Fixed delay allows a particular method for avoiding a death
> match in a two-node cluster. Pacemaker's built-in random delay
> capability is another method.

Deathmatch is one problem, killing the wrong node (2 nodes, no quorum)
is another. Fixed delay is digimer's attempt to alleviate the latter,
so... apples and fruits not entirely unlike apples.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-10 Thread Ken Gaillot
On 05/10/2017 12:26 PM, Dimitri Maziuk wrote:
> 
> i remember that digimer often campaigns for a fence delay in a 2-node  
> cluster.
> ...
> But  ... a random delay does not seem to
> be a reliable solution.
> 
>> Some fence agents implement a delay parameter of their own, to set a
>> fixed delay. I believe that's what digimer uses.
> 
> Is it just me or does this sound like catch-22:
> - pacemaker does not work reliably without fencing

Correct -- more specifically, some failure scenarios can't be safely
handled without fencing.

> - fencing in 2-node clusters does not work reliably without fixed delay

Not quite. Fixed delay allows a particular method for avoiding a death
match in a two-node cluster. Pacemaker's built-in random delay
capability is another method.

> - code that ships with pacemaker does not implement fixed delay.

Fence agents are used with pacemaker but not shipped as part of it. They
have their own packages distributed separately. Anyone can write a fence
agent and make it available to the community.

It would be nice if every fence agent supported a delay parameter, but
there's no requirement to do so, and even if there were, it would just
be a guideline -- it's up to the developer.

There's certainly an argument to be made for supporting a fixed delay at
the pacemaker level. There's an idea floating around to do this based on
node health, which could allow a lot of flexibility.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-10 Thread Dimitri Maziuk

 i remember that digimer often campaigns for a fence delay in a 2-node  
 cluster.
...
 But  ... a random delay does not seem to
 be a reliable solution.

> Some fence agents implement a delay parameter of their own, to set a
> fixed delay. I believe that's what digimer uses.

Is it just me or does this sound like catch-22:
- pacemaker does not work reliably without fencing
- fencing in 2-node clusters does not work reliably without fixed delay
- code that ships with pacemaker does not implement fixed delay.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-10 Thread Ken Gaillot
On 05/10/2017 12:20 AM, Kristoffer Grönlund wrote:
> "Lentes, Bernd"  writes:
> 
>> - On May 8, 2017, at 9:20 PM, Bernd Lentes 
>> bernd.len...@helmholtz-muenchen.de wrote:
>>
>>> Hi,
>>>
>>> i remember that digimer often campaigns for a fence delay in a 2-node  
>>> cluster.
>>> E.g. here: 
>>> http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html
>>> In my eyes it makes sense, so i try to establish that. I have two HP 
>>> servers,
>>> each with an ILO card.
>>> I have to use the stonith:external/ipmi agent, the stonith:external/riloe
>>> refused to work.
>>>
>>> But i don't have a delay parameter there.
>>> crm ra info stonith:external/ipmi:
>>>
>>> ...
>>> pcmk_delay_max (time, [0s]): Enable random delay for stonith actions and 
>>> specify
>>> the maximum of random delay
>>>This prevents double fencing when using slow devices such as sbd.
>>>Use this to enable random delay for stonith actions and specify the 
>>> maximum of
>>>random delay.
>>> ...
>>>
>>> This is the only delay parameter i can use. But a random delay does not 
>>> seem to
>>> be a reliable solution.
>>>
>>> The stonith:ipmilan agent also provides just a random delay. Same with the 
>>> riloe
>>> agent.
>>>
>>> How did anyone solve this problem ?
>>>
>>> Or do i have to edit the RA (I will get practice in that :-))?
>>>
>>>
>>
>> crm ra info stonith:external/ipmi says there exists a parameter 
>> pcmk_delay_max.
>> Having a look in  /usr/lib64/stonith/plugins/external/ipmi i don't find 
>> anything about delay.
>> Also "crm_resource --show-metadata=stonith:external/ipmi" does not say 
>> anything about a delay.
>>
>> Is this "pcmk_delay_max" not implemented ? From where does "crm ra info 
>> stonith:external/ipmi" get this info ?
>>
> 
> pcmk_delay_max is implemented by Pacemaker. crmsh gets the information
> about available parameters by querying stonithd directly.
> 
> Cheers,
> Kristoffer

The various pcmk_* parameters are documented in the stonithd(7) man page.

Some fence agents implement a delay parameter of their own, to set a
fixed delay. I believe that's what digimer uses.

>>
>> Bernd
>>  
>>
>> Helmholtz Zentrum Muenchen
>> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
>> Ingolstaedter Landstr. 1
>> 85764 Neuherberg
>> www.helmholtz-muenchen.de
>> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
>> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons 
>> Enhsen
>> Registergericht: Amtsgericht Muenchen HRB 6466
>> USt-IdNr: DE 129521671

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond

2017-05-10 Thread Klaus Wenninger
On 05/09/2017 10:34 PM, Attila Megyeri wrote:
>
> Actually I found some more details:
>
>  
>
> there are two resources: A and B
>
>  
>
> resource B depends on resource A (when the RA monitors B, if will fail
> if A is not running properly)
>
>  
>
> If I stop resource A, the next monitor operation of „B” will fail.
> Interestingly, this check happens immediately after A is stopped.
>
>  
>
> B is configured to restart if monitor fails. Start timeout is rather
> long, 180 seconds. So pacemaker tries to restart B, and waits.
>
>  
>
> If I want to start „A”, nothing happens until the start operation of
> „B” fails – typically several minutes.
>
>  
>
>  
>
> Is this the right behavior?
>
> It appears that pacemaker is blocked until resource B is being
> started, and I cannot really start its dependency…
>
> Shouldn’t it be possible to start a resource while another resource is
> also starting?
>

As long as resources don't depend on each other parallel starting should
work/happen.

The number of parallel actions executed is derived from the number of
cores and
when load is detected some kind of throttling kicks in (in fact reduction of
the operations executed in parallel with the aim to reduce the load induced
by pacemaker). When throttling kicks in you should get log messages (there
is in fact a parallel discussion going on ...).
No idea if throttling might be a reason here but maybe worth considering
at least.

Another reason why certain things happen with quite some delay I've observed
is that obviously some situations are just resolved when the
cluster-recheck-interval
triggers a pengine run in addition to those triggered by changes.
You might easily verify this by changing the cluster-recheck-interval.

Regards,
Klaus

>  
>
>  
>
> Thanks,
>
> Attila
>
>  
>
>  
>
> *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com]
> *Sent:* Tuesday, May 9, 2017 9:53 PM
> *To:* users@clusterlabs.org; kgail...@redhat.com
> *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond
>
>  
>
> Hi Ken, all,
>
>  
>
>  
>
> We ran into an issue very similar to the one described in
> https://bugzilla.redhat.com/show_bug.cgi?id=1430112 /  [Intel 7.4 Bug]
> Pacemaker occasionally takes minutes to respond
>
>  
>
> But  in our case we are not using fencing/stonith at all.
>
>  
>
> Many times when I want to start/stop/cleanup a resource, it takes tens
> of seconds (or even minutes) till the command gets executed. The logs
> show nothing in that period, the redundant rings show no fault.
>
>  
>
> Could this be the same issue?
>
>  
>
> Any hints on how to troubleshoot this?
>
> It is  pacemaker 1.1.10, corosync 2.3.3
>
>  
>
>  
>
> Cheers,
>
> Attila
>
>  
>
>  
>
>  
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Openstack Infrastructure

Red Hat

kwenn...@redhat.com   


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org