[ClusterLabs] Antw: [EXT] Re: Q: constrain or delay "probes"?

Ulrich Windl Sun, 07 Mar 2021 23:10:37 -0800

>>> Reid Wahl <nw...@redhat.com> schrieb am 05.03.2021 um 21:22 in Nachricht
<capiuu991o08dnavkm9bc8n9bk-+nh9e0_f25o6ddis5wzwg...@mail.gmail.com>:
> On Fri, Mar 5, 2021 at 10:13 AM Ken Gaillot <kgail...@redhat.com> wrote:
> 
>> On Fri, 2021-03-05 at 11:39 +0100, Ulrich Windl wrote:
>> > Hi!
>> >
>> > I'm unsure what actually causes a problem I see (a resource was
>> > "detected running" when it actually was not), but I'm sure some probe
>> > started on cluster node start cannot provide a useful result until
>> > some other resource has been started. AFAIK there is no way to make a
>> > probe obey odering or colocation constraints, so the only work-around
>> > seems to be a delay. However I'm unsure whether probes can actually
>> > be delayed.
>> >
>> > Ideas?
>>
>> Ordered probes are a thorny problem that we've never been able to come
>> up with a general solution for. We do order certain probes where we
>> have enough information to know it's safe. The problem is that it is
>> very easy to introduce ordering loops.
>>
>> I don't remember if there any workarounds.
>>
> 
> Maybe as a workaround:
>   - Add an ocf:pacemaker:attribute resource after-and-with rsc1
>   - Then configure a location rule for rsc2 with resource-discovery=never
> and score=-INFINITY with expression (in pseudocode) "attribute is not set
> to active value"
> 
> I haven't tested but that might cause rsc2's probe to wait until rsc1 is
> active.
> 
> And of course, use the usual constraints/rules to ensure rsc2's probe only
> runs on rsc1's node.
> 
> 
>> > Despite of that I wonder whether some probe/monitor returncode like
>> > OCF_NOT_READY would make sense if the operation detects that it
>> > cannot return a current status (so both "running" and "stopped" would
>> > be as inadequate as "starting" and "stopping" would be (despite of
>> > the fact that the latter two do not exist)).
>>
> 
> This seems logically reasonable, independent of any implementation
> complexity and considerations of what we would do with that return code.


Thanks for the proposal!
The actual problem I was facing was that the cluster claimed some resource 
would be running on two nodes at the same time, when actually one node had been 
stopped properly (with all the resources). The bad state in the CIB was most 
likely due to a software bug in pacemaker, but probes on re-starting the node 
seemed not to prevent pacemaker from doing a really wrong "recovery action".
My hope was that probes might update the CIB before some stupid action is being 
dopne. Maybe it's just another software bug...

Regards,
Ulrich

> 
> 
>> > Regards,
>> > Ulrich
>> --
>> Ken Gaillot <kgail...@redhat.com>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>
>>
> 
> -- 
> Regards,
> 
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA




_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: [EXT] Re: Q: constrain or delay "probes"?

Reply via email to