Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] delaying start of a resource

2020-12-17 Thread Andrei Borzenkov
18.12.2020 10:09, Ulrich Windl пишет:
 Andrei Borzenkov  schrieb am 18.12.2020 um 08:01 in
> Nachricht :
>> 17.12.2020 21:30, Ken Gaillot пишет:
>>>
>>> This reminded me that some IPMI implementations return "success" for
>>> commands before they've actually been completed. This is why
>>> fence_ipmilan has a "power_wait" parameter that defaults to 2 seconds.
>>>
>>
>> But on this case we also do not know whether command has been completed
>> successfully or not. I'd say in this case the only safe way is to use
>> poweroff and verify in stonith agent that node is actually powered off
>> before returning success.
> 
> As I wrote in my message, the other node showind that a node has left would be
> an indication that fencing was successful

You got it backwards. The fencing starts when pacemaker gets indication
that other node has left.

> IF there was a valid network
> connection up to the fencing event. Thus I think a redundant network is rather
> important. The user should be able to tell whether fencing actually does work;
> maybe not from syslog, but from other indicators.

Completely wrong. Fencing is needed exactly when there is no possibility
to get information about the other node and there is no way to verify
other node state using "normal" means.

Redundant network helps to avoid unnecessary fencing, it is not
replacement for fencing.

> Also if the network outage were simulated by using a node-specific blackhole
> route (blocking just the other node(s)), the node could be queried (for
> example) by a ping from a third note to see whether and when it actually wend
> down.
> 

And? How should two isolated pacemaker instance now communicate and
coordinate activity even if there is connectivity via some of oter
networks available on nodes? Using multiple rings utilizing all
available networks falls under "redundant network".

> Regards,
> Ulrich
> 
>>
>>> The best thing would be to do some manual testing using ipmitool or
>>> whatnot to turn off the power, and observe how long it takes between
>>> when the command returns and the server actually is powered down. Then
>>> set power_wait to a comfortable margin above that. Or just keep raising
>>> power_wait until the problem goes away :)
>>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] delaying start of a resource

2020-12-17 Thread Gabriele Bulfon
Would a change of network class on one node be ok?
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 17 dicembre 2020 12.26.29 CET
Oggetto: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] delaying start of a 
resource


>>> Gabriele Bulfon  schrieb am 17.12.2020 um 09:14 in
Nachricht <2080536991.1106.1608192888030@www>:
> I see, but then I have to issues:
> 
> 1. it is a dual node server, the HA interface is internal, I have no way to 
> unplug it, that's why I tried turning it down

You could block traffic using iptables or a "blackhole" route for example.

> 
> 2. even in case I could test it by unplugging it, there is still the 
> possibility that someone turns the interface down, causing a bad situation 
> for the zpool...so I would like to understand why xstha2 decided to turn on 
> IP and zpool when stonish of xstha1 was not yet done...

What should a HA software do when an admin turns down the interface?
I'm afraid there is no HA software against adinistrator errors.
It's important to understand that HA software helps against errors from 
hardware or software, but not against configuration errors (which an ifdown is).

> 
> 
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
> 
> 
> 
> 
> 
> 
> --
> 
> Da: Ulrich Windl 
> A: users@clusterlabs.org 
> Data: 17 dicembre 2020 7.48.46 CET
> Oggetto: [ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource
> 
> 
 Gabriele Bulfon  schrieb am 16.12.2020 um 15:56 in
> Nachricht <386755316.773.1608130588146@www>:
>> Thanks, here are the logs, there are infos about how it tried to start 
>> resources on the nodes.
>> Keep in mind the node1 was already running the resources, and I simulated a 
>> problem by turning down the ha interface.
> 
> Please note that "turning down" an interface is NOT a realistic test; 
> realistic would be to unplug the cable.
> 
>> 
>> Gabriele
>> 
>> 
>> Sonicle S.r.l. : http://www.sonicle.com 
>> Music: http://www.gabrielebulfon.com 
>> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> 
>> Da: Ulrich Windl 
>> A: users@clusterlabs.org 
>> Data: 16 dicembre 2020 15.45.36 CET
>> Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource
>> 
>> 
> Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 in
>> Nachricht <1523391015.734.1608129155836@www>:
>>> Hi, I have now a two node cluster using stonith with different 
>>> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
>>> problems.
>>> 
>>> Though, there is still one problem: once node 2 delays its stonith action 
>>> for 10 seconds, and node 1 just 1, node 2 does not delay start of 
>>> resources, 
> 
>> 
>>> so it happens that while it's not yet powered off by node 1 (and waiting 
>>> its 
> 
>> 
>>> dalay to power off node 1) it actually starts resources, causing a moment 
>>> of 
> 
>> 
>>> few seconds where both NFS IP and ZFS pool (!) is mounted by both!
>> 
>> AFAIK pacemaker will not start resources on a node that is scheduled for 
>> stonith. Even more: Pacemaker will tra to stop resources on a node scheduled 
> 
>> for stonith to start them elsewhere.
>> 
>>> How can I delay node 2 resource start until the delayed stonith action is 
>>> done? Or how can I just delay the resource start so I can make it larger 
>> than 
>>> its pcmk_delay_base?
>> 
>> We probably need to see logs and configs to understand.
>> 
>>> 
>>> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
>>> to set this flag (cib-bootstrap-options is not happy with it...).
>> 
>> I think it's on by default, so you must have set it to false.
>> In crm shell it is "configure# property stonith-enabled=...".
>> 
>> Regards,
>> Ulrich
>> 
>> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/