On 05/20/2016 10:40 AM, Adam Spiers wrote: > Ken Gaillot <kgail...@redhat.com> wrote: >> Just musing a bit ... on-fail + migration-threshold could have been >> designed to be more flexible: >> >> hard-fail-threshold: When an operation fails this many times, the >> cluster will consider the failure to be a "hard" failure. Until this >> many failures, the cluster will try to recover the resource on the same >> node. > > How is this different to migration-threshold, other than in name? > >> hard-fail-action: What to do when the operation reaches >> hard-fail-threshold ("ban" would work like current "restart" i.e. move >> to another node, and ignore/block/stop/standby/fence would work the same >> as now) > > And I'm not sure I understand how this is different to / more flexible > than what we can do with on-fail now? > >> That would allow fence etc. to be done only after a specified number of >> retries. Ah, hindsight ... > > Isn't that possible now, e.g. with migration-threshold=3 and > on-fail=fence? I feel like I'm missing something.
migration-threshold only applies when on-fail=restart. If on-fail=fence or something else, that action always applies after the first failure. So hard-fail-threshold would indeed be the same as migration-threshold, but applied to all actions (and would be renamed, since the resource won't migrate in the other cases). >>> - neutron-l3-agent RA detects that the agent is unhealthy, and iff it >>> fails to restart it, we want to trigger migration of any routers on >>> that l3-agent to a healthy l3-agent. Currently we wait for the >>> connection between the agent and the neutron server to time out, >>> which is unpleasantly slow. This case is more of a requirement than >>> an optimization, because we really don't want to migrate routers to >>> another node unless we have to, because a) it takes time, and b) is >>> disruptive enough that we don't want to have to migrate them back >>> soon after if we discover we can successfully recover the unhealthy >>> l3-agent. >>> >>> - Remove a failed backend from an haproxy-fronted service if >>> it can't be restarted. >>> >>> - Notify any other service (OpenStack or otherwise) where the failing >>> local resource is a backend worker for some central service. I >>> guess ceilometer, cinder, mistral etc. are all potential >>> examples of this. > > Any thoughts on the sanity of these? Beyond my expertise. But sounds reasonable. _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org