On 11/07/2019 14:16, Ken Gaillot wrote: > On Thu, 2019-07-11 at 10:39 +0100, lejeczek wrote: >> On 10/07/2019 15:50, Ken Gaillot wrote: >>> On Wed, 2019-07-10 at 11:26 +0100, lejeczek wrote: >>>> hi guys, possibly @devel if they pop in here. >>>> >>>> is there, will there be, a way to make cluster deal with failed >>>> resources in such a way that cluster would try not to give up on >>>> failed >>>> resources? >>>> >>>> I understand that as of now the only way is user's manual >>>> intervention >>>> (under which I'd include any scripted ways outside of the >>>> cluster) if >>>> we >>>> need to bring back up a failed resource. >>>> >>>> many thanks, L. >>> Not sure what you mean ... the default behavior is to try >>> restarting a >>> failed resource up to 1,000,000 times on the same node, then try >>> starting it on a different node, and not give up until all nodes >>> have >>> failed to start it. >>> >>> This is affected by on-fail, migration-threshold, failure-timeout, >>> and >>> start-failure-is-fatal. >>> >>> If you're talking about a resource that failed because the entire >>> node >>> failed, then fencing comes into play. >> Apologies for I was not clear enough while wording my question, I see >> that now. When I said - make cluster deal with failed resources - I >> meant a resource which failed in the (whole) cluster, failed on every >> node. >> >> If that happens I see that only my (user manual) intervention can >> make >> cluster peep at the resource again and I wonder if this is me unaware >> that there are ways it can be done, that cluster will not need me and >> by >> itself would do something, will not give up. >> >> My case is: a systemd resource which whether successful or not is >> determined by a mechanism outside of the cluster, it can only >> successfully start on one single node. When that node reboots then >> cluster fails this resource, when that node rebooted and is up again >> the >> failed resource remains in failed state. >> >> Hopefully I manged to make it bit clearer this time. >> >> Many thanks, L. > Ah, yes. failure-timeout is the only way to handle that. Keep in mind > it is not guaranteed to be checked more frequently than the cluster- > recheck-interval.
fantastic! Is "cluster-recheck-interval" tough on the cluster? Is okey to take it down from default 15min? thanks, L.
pEpkey.asc
Description: application/pgp-keys
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/