Re: [ClusterLabs] Antw: Re: Gracefully stop nodes one by one with disk-less sbd

2019-08-12 Thread Roger Zhou

On 8/12/19 2:48 PM,  Ulrich Windl  wrote:
 Andrei Borzenkov  schrieb am 09.08.2019 um 18:40 in
> Nachricht <217d10d8-022c-eaf6-28ae-a4f58b2f9...@gmail.com>:
>> 09.08.2019 16:34, Yan Gao пишет:

[...]

>>
>> Lack of cluster wide shutdown mode was mentioned more than once on this
>> list. I guess the only workaround is to use higher level tools which
>> basically simply try to stop cluster on all nodes at once. 

I try to think of ssh/pssh to the involved nodes and stop diskless SBD
daemons.  However, SBD is not able to be teared down on it own. It is
deeply tied up with pacemaker and corosync and has to be stop all
together. Or, to hack SBD dependency otherwise.

>> It is still
>> susceptible to race condition.
> 
> Are there any concrete plans to implement a clean solution?
> 

I can think of Yet Another Feature to disable diskless SBD on-purpose.
eg. to let SBD understands "stonith-enabled=false" at the cluster wide.


Cheers,
Roger
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Re: Gracefully stop nodes one by one with disk-less sbd

2019-08-12 Thread Andrei Borzenkov


Отправлено с iPhone

12 авг. 2019 г., в 9:48, Ulrich Windl  
написал(а):

 Andrei Borzenkov  schrieb am 09.08.2019 um 18:40 in
> Nachricht <217d10d8-022c-eaf6-28ae-a4f58b2f9...@gmail.com>:
>> 09.08.2019 16:34, Yan Gao пишет:
>>> Hi,
>>> 
>>> With disk-less sbd,  it's fine to stop cluster service from the cluster 
>>> nodes all at the same time.
>>> 
>>> But if to stop the nodes one by one, for example with a 3-node cluster, 
>>> after stopping the 2nd node, the only remaining node resets itself with:
>>> 
>> 
>> That is sort of documented in SBD manual page:
>> 
>> --><--
>> However, while the cluster is in such a degraded state, it can
>> neither successfully fence nor be shutdown cleanly (as taking the
>> cluster below the quorum threshold will immediately cause all remaining
>> nodes to self-fence).
>> --><--
>> 
>> SBD in shared-nothing mode is basically always in such degraded state
>> and cannot tolerate loss of quorum.
> 
> So with a shared device it'S different?

Yes, as long as shared device is accessible.


> I was wondering whether
> "no-quorum-policy=freeze" would still work with the recent sbd...
> 

It will with shared device.

>> 
>> 
>> 
>>> Aug 09 14:30:20 opensuse150-1 sbd[1079]:   pcmk:debug: 
>>> notify_parent: Not notifying parent: state transient (2)
>>> Aug 09 14:30:20 opensuse150-1 sbd[1080]:cluster:debug: 
>>> notify_parent: Notifying parent: healthy
>>> Aug 09 14:30:20 opensuse150-1 sbd[1078]:  warning: inquisitor_child: 
>>> Latency: No liveness for 4 s exceeds threshold of 3 s (healthy servants:
> 0)
>>> 
>>> I can think of the way to manipulate quorum with last_man_standing and 
>>> potentially also auto_tie_breaker, not to mention 
>>> last_man_standing_window would also be a factor... But is there a better 
>>> solution?
>>> 
>> 
>> Lack of cluster wide shutdown mode was mentioned more than once on this
>> list. I guess the only workaround is to use higher level tools which
>> basically simply try to stop cluster on all nodes at once. It is still
>> susceptible to race condition.
> 
> Are there any concrete plans to implement a clean solution?
> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Gracefully stop nodes one by one with disk-less sbd

2019-08-11 Thread Ulrich Windl
>>> Andrei Borzenkov  schrieb am 09.08.2019 um 18:40 in
Nachricht <217d10d8-022c-eaf6-28ae-a4f58b2f9...@gmail.com>:
> 09.08.2019 16:34, Yan Gao пишет:
>> Hi,
>> 
>> With disk-less sbd,  it's fine to stop cluster service from the cluster 
>> nodes all at the same time.
>> 
>> But if to stop the nodes one by one, for example with a 3-node cluster, 
>> after stopping the 2nd node, the only remaining node resets itself with:
>> 
> 
> That is sort of documented in SBD manual page:
> 
> --><--
> However, while the cluster is in such a degraded state, it can
> neither successfully fence nor be shutdown cleanly (as taking the
> cluster below the quorum threshold will immediately cause all remaining
> nodes to self-fence).
> --><--
> 
> SBD in shared-nothing mode is basically always in such degraded state
> and cannot tolerate loss of quorum.

So with a shared device it'S different? I was wondering whether
"no-quorum-policy=freeze" would still work with the recent sbd...

> 
> 
> 
>> Aug 09 14:30:20 opensuse150-1 sbd[1079]:   pcmk:debug: 
>> notify_parent: Not notifying parent: state transient (2)
>> Aug 09 14:30:20 opensuse150-1 sbd[1080]:cluster:debug: 
>> notify_parent: Notifying parent: healthy
>> Aug 09 14:30:20 opensuse150-1 sbd[1078]:  warning: inquisitor_child: 
>> Latency: No liveness for 4 s exceeds threshold of 3 s (healthy servants:
0)
>> 
>> I can think of the way to manipulate quorum with last_man_standing and 
>> potentially also auto_tie_breaker, not to mention 
>> last_man_standing_window would also be a factor... But is there a better 
>> solution?
>> 
> 
> Lack of cluster wide shutdown mode was mentioned more than once on this
> list. I guess the only workaround is to use higher level tools which
> basically simply try to stop cluster on all nodes at once. It is still
> susceptible to race condition.

Are there any concrete plans to implement a clean solution?

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/