Excerpts from Jay Pipes's message of 2015-12-28 09:45:39 -0800: > On 12/24/2015 02:30 PM, Clint Byrum wrote: > > This is entirely philosophical, but we should think about when it is > > appropriate to adopt which mode of operation. > > > > There are basically two ways being discussed: > > > > 1) Fail fast. > > 2) Retry forever. > > > > Fail fast pros- Immediate feedback for problems, no zombies to worry > > about staying dormant and resurrecting because their configs accidentally > > become right again. Much more determinism. Debugging is much simpler. To > > summarize, it's up and working, or down and not. > > > > Fail fast cons- Ripple effects. If you have a database or network blip > > while services are starting, you must be aware of all of the downstream > > dependencies and trigger them to start again, or have automation which > > retries forever, giving up some of the benefits of fail-fast. Circular > > dependencies require special workflow to unroll (Service1 aspect A relies > > on aspect X of service2, service2 aspect X relies on aspect B of service1 > > which would start fine without service2). To summarize: this moves the > > retry-forever problem to orchestration, and complicates some corner cases. > > > > Retry forever pros- Circular dependencies are cake. Blips auto-recover. > > Bring-up orchestration is simpler (start everything, wait..). To > > summarize: this makes orchestration simpler. > > > > Retry forever cons- Non-determinism. It's impossible to just look at the > > thing from outside and know if it is ready to do useful work. May > > actually be hiding intermittent problems, requiring more logging and > > indicators in general to allow analysis. > > > > I honestly think any distributed system needs both. > > So do I. I was proposing only that we deal with unrecoverable > configuration errors on startup in a fail-fast way. I was not proposing > that we remove the existing functionality that retries requests in the > occasion where an already-up-and-running scheduler service experiences > (typically transient) I/O disruptions to a dependent service like the DB > or MQ. >
Even during startup, failing fast on remote dependencies complicates things. There's no dependency resolver for the entire cloud, as Kevin Fox suggested. > <snip> > > That said, the scheduler is, IMO, an _extremely_ complex piece of > > OpenStack, with up and down stream dependencies on several levels (which > > is why redesigning it gets debated so often on openstack-dev). > > It's actually not all that complex. Or at least, it doesn't need to be :) > On this we definitely agree. __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
