Re: [openstack-dev] [heat] Convergence: Detecting and handling worker failures

Joshua Harlow Wed, 30 Sep 2015 08:22:15 -0700

Clint Byrum wrote:

Excerpts from Anant Patil's message of 2015-09-30 00:10:52 -0700:

Hi,


One of remaining items in convergence is detecting and handling engine
(the engine worker) failures, and here are my thoughts.

Background: Since the work is distributed among heat engines, by some
means heat needs to detect the failure and pick up the tasks from failed
engine and re-distribute or run the task again.

One of the simple way is to poll the DB to detect the liveliness by
checking the table populated by heat-manage. Each engine records its
presence periodically by updating current timestamp. All the engines
will have a periodic task for checking the DB for liveliness of other
engines. Each engine will check for timestamp updated by other engines
and if it finds one which is older than the periodicity of timestamp
updates, then it detects a failure. When this happens, the remaining
engines, as and when they detect the failures, will try to acquire the
lock for in-progress resources that were handled by the engine which
died. They will then run the tasks to completion.

Another option is to use a coordination library like the community owned
tooz (http://docs.openstack.org/developer/tooz/) which supports
distributed locking and leader election. We use it to elect a leader
among heat engines and that will be responsible for running periodic
tasks for checking state of each engine and distributing the tasks to
other engines when one fails. The advantage, IMHO, will be simplified
heat code. Also, we can move the timeout task to the leader which will
run time out for all the stacks and sends signal for aborting operation
when timeout happens. The downside: an external resource like
Zookeper/memcached etc are needed for leader election.


It's becoming increasingly clear that OpenStack services in general need
to look at distributed locking primitives. There's a whole spec for that
right now:

https://review.openstack.org/#/c/209661/

As the author of said spec (Chronicles of a DLM) I fully agree that weshouldn't be reinventing this (again, and again). Also as the author ofthat spec, I'd like to encourage others to get involved in adding theiruse-cases/stories to it. I have done some initial analysis of projectsand documented some of the recreation of DLM like things in it, and I'mvery much open to including others stories as well. In the end I hope wecan pick a DLM (ideally a single one) that has a wide community, isstructurally sound, is easily useable & operable, is open and will helpachieve and grow (what I think are) the larger long-term goals (andhealth) of many openstack projects.


Nicely formatted RST (for the latest uploaded spec) also viewable at:

http://docs-draft.openstack.org/61/209661/22/check/gate-openstack-specs-docs/ced42e7//doc/build/html/specs/chronicles-of-a-dlm.html#chronicles-of-a-distributed-lock-manager


I suggest joining that conversation, and embracing a DLM as the way to
do this.

Also, the leader election should be per-stack, and the leader selection
should be heavily weighted based on a consistent hash algorithm so that
you get even distribution of stacks to workers. You can look at how
Ironic breaks up all of the nodes that way. They're using a similar lock
to the one Heat uses now, so the two projects can collaborate nicely on
a real solution.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [heat] Convergence: Detecting and handling worker failures

Reply via email to