On Wed, 2015-09-30 at 02:29 -0700, Clint Byrum wrote: > Excerpts from Anant Patil's message of 2015-09-30 00:10:52 -0700: > > Hi, > > > > One of remaining items in convergence is detecting and handling engine > > (the engine worker) failures, and here are my thoughts. > > > > Background: Since the work is distributed among heat engines, by some > > means heat needs to detect the failure and pick up the tasks from failed > > engine and re-distribute or run the task again. > > > > One of the simple way is to poll the DB to detect the liveliness by > > checking the table populated by heat-manage. Each engine records its > > presence periodically by updating current timestamp. All the engines > > will have a periodic task for checking the DB for liveliness of other > > engines. Each engine will check for timestamp updated by other engines > > and if it finds one which is older than the periodicity of timestamp > > updates, then it detects a failure. When this happens, the remaining > > engines, as and when they detect the failures, will try to acquire the > > lock for in-progress resources that were handled by the engine which > > died. They will then run the tasks to completion. > > > > Another option is to use a coordination library like the community owned > > tooz (http://docs.openstack.org/developer/tooz/) which supports > > distributed locking and leader election. We use it to elect a leader > > among heat engines and that will be responsible for running periodic > > tasks for checking state of each engine and distributing the tasks to > > other engines when one fails. The advantage, IMHO, will be simplified > > heat code. Also, we can move the timeout task to the leader which will > > run time out for all the stacks and sends signal for aborting operation > > when timeout happens. The downside: an external resource like > > Zookeper/memcached etc are needed for leader election. > > > > It's becoming increasingly clear that OpenStack services in general need > to look at distributed locking primitives. There's a whole spec for that > right now: > > https://review.openstack.org/#/c/209661/ > > I suggest joining that conversation, and embracing a DLM as the way to > do this. > > Also, the leader election should be per-stack, and the leader selection > should be heavily weighted based on a consistent hash algorithm so that > you get even distribution of stacks to workers. You can look at how > Ironic breaks up all of the nodes that way. They're using a similar lock > to the one Heat uses now, so the two projects can collaborate nicely on > a real solution.
It is worth to mention that there's also an idea of using both Tooz and hash ring approach [1]. There was enormously big discussion on this list when Cinder's faced similar problem [2]. It finally became a discussion on whether we need a common solution for DLM in OpenStack [3]. In the end Cinder is currently trying to achieve A/A capabilities by using CAS DB operations. The detecting of failed services is still discussed, but most mature solution to this problem was described in [4]. It is based on database checks. Given that many projects are facing similar problems (well, it's not a surprise that distributed system is facing general problems of distributed systems…), we should certainly discuss how to approach that class of issues. That's why a cross-project Design Summit session on the topic was proposed [5] (this one is by harlowja, but I know that Mike Perez also wanted to propose such session). [1] https://review.openstack.org/#/c/195366/ [2] http://lists.openstack.org/pipermail/openstack-dev/2015-July/070683.html [3] http://lists.openstack.org/pipermail/openstack-dev/2015-August/071262.html [4] http://gorka.eguileor.com/simpler-road-to-cinder-active-active/ [5] http://odsreg.openstack.org/cfp/details/8 __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev