Blind retries [Was: Opaque automatic hook retries from API]
Stuart Bishopwrites: > I find destroy-service/remove-application is particularly problematic, > because the doomed units don't know they are being destroyed but rather is > informed about departing one relation at a time (which is inherently racy, > because the units the doomed service are related too will process their > relation-departed hooks almost immediately and stop talking to the doomed > service, while the doomed service still thinks it can access their > resources while it falls apart one piece at a time). Yes. I noticed this issue too, and I think it's a valid Juju bug. I'm not sure what the best fix would be, but it probably involves some streamlining of the stop-unit logic (and associated hook sequencing). [...] > One of the reasons test suites are currently flaky is that there are > race conditions we have no reasonable way of solving, such as a > database restarting itself while a hook on another unit is attempting > to use it. In theory this should be rule 0 of programming: handle errors (such as your code failing to talk to a database). This is of course easier said than done, but it's been the case forever. Blind retries are by no means a silver bullet, just because (at least conceptually) there's no way around at looking at the actual issue at hand, when deciding how to handle it (e.g. retry). If you are 100% confident that your code is "idempotent" (for some definition that makes sense in your case), a blind retry mechanism might simply mean that your code will take a bit longer to bubble up a failure (for instance because it's stubbornly retrying a failure condition that has no way out). However it's often difficult to judge if some piece of logic is really idempotent (expecially if the logic encompasses a lot of moving parts, like a hook run, as opposed to some granular API call). So there's always the *some* risk that a blind retry could do something unwanted or even harmful. If you want to be perfectly safe you should look at the failure at hand and make sure you understand, before doing anything. YMMV re real-world statistics of whether this argument is actually relevant (e.g. "blind retry is good enough for me"). This is by no means an easy topic and it's one of the hard parts of programming, as exemplified by this recent juju-dev thread: https://lists.ubuntu.com/archives/juju-dev/2016-October/006091.html It's also an area where some stardardization of failure modes in distributed systems would probably help developing some better automation than blind retry or even some form of AI/learning (the HTTP spec and RESTful architectures were arguably designed with that in mind). -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Opaque automatic hook retries from API
On 6 January 2017 at 01:39, Casey Marshallwrote: > On Thu, Jan 5, 2017 at 3:33 AM, Adam Collard > wrote: > >> Hi, >> >> The automatic hook retries[0] that landed as part of 2.0 (are documented >> as) run indefinitely[1] - this causes problems as an API user: >> >> Imagine you are driving Juju using the API, and when you perform an >> operation (e.g. set the configuration of a service, or reboot the unit, or >> add a relation..) - you want to show the status of that operation. >> >> Prior to the automatic retries, you simply perform your operation, and >> watch the delta streams for the corresponding change to the unit - the >> success or otherwise of the operation is reflected in the unit >> agent-status/workload-status pair. >> >> Now, with retries, if you see a unit in the error state, you can't >> accurately reflect the status of the operation, since the unit will >> undoubtedly retry the hook again. Maybe it succeeds, maybe it fails again. >> How can one say after receiving the first delta of a unit error if the >> operation succeeded or failed? >> >> With no visibility up front on the retry strategy that Juju will perform >> (e.g. something representing the exponential backoff and a fixed number of >> retries before Juju admits defeat) it is impossible to say at any point in >> the delta stream what the result of a failed-at-least-once operation is. >> > > I think the retry strategy is great -- it leverages the immutability we > expect hooks to provide, to deliver a robust result over unreliable > substrates -- and all substrates are unreliable where there's > internetworking involved! > > However I see your point about the retry strategy muddling status. I've > noticed this sometimes when watching openstack or k8s bundles "shake out" > the errors as they come up. I don't think this is always a charm quality > issue, it's maybe because we're trying to show two different things with > status? > errors being 'shaken out' are almost always unhandled race conditions. I find destroy-service/remove-application is particularly problematic, because the doomed units don't know they are being destroyed but rather is informed about departing one relation at a time (which is inherently racy, because the units the doomed service are related too will process their relation-departed hooks almost immediately and stop talking to the doomed service, while the doomed service still thinks it can access their resources while it falls apart one piece at a time). I'm becoming more and more a believer that we can't reasonably avoid these errors, and instead maybe we should assume that they will happen and it is perfectly normal. We can stick to writing nice idempotent handlers, simpler because we can ignore and bubble up failures. Simpler protocols (eg. removing all the handshaking the PostgreSQL interface does to try to avoid races with authorization). And going back to Adam's point, have hooks retried a few times with some sort of backoff before even being reported as a failure to the end user. One of the reasons test suites are currently flaky is that there are race conditions we have no reasonable way of solving, such as a database restarting itself while a hook on another unit is attempting to use it. Even though I currently bootstrap test envs with the retry behaviour off, I'm thinking of changing that. What if Juju made a clearer distinction between result-state ("what I'm > doing most recently or last attempted to do") vs. goal-state ("what I'm > trying to get done") in the status? Would that help? > Isn't the goal state just the failed hook? I would certainly like to see the list of hooks queued to run on each unit though if that is what you mean (not in the default tabular status, but in the json status dump). >> Can retries be limited to a small number, with a backoff algorithm >> explicitly documented and stuck to by Juju, with the retry attempt number >> included in the delta stream? >> > This sounds like a good idea. The limit could even be dynamic, with a retry attempted every time a unit it is related too successfully runs a hook, until the environment is quiescent. -- Stuart Bishop -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Opaque automatic hook retries from API
^^ s/immutability/idempotency On Thu, Jan 5, 2017 at 12:39 PM, Casey Marshall < casey.marsh...@canonical.com> wrote: > On Thu, Jan 5, 2017 at 3:33 AM, Adam Collard> wrote: > >> Hi, >> >> The automatic hook retries[0] that landed as part of 2.0 (are documented >> as) run indefinitely[1] - this causes problems as an API user: >> >> Imagine you are driving Juju using the API, and when you perform an >> operation (e.g. set the configuration of a service, or reboot the unit, or >> add a relation..) - you want to show the status of that operation. >> >> Prior to the automatic retries, you simply perform your operation, and >> watch the delta streams for the corresponding change to the unit - the >> success or otherwise of the operation is reflected in the unit >> agent-status/workload-status pair. >> >> Now, with retries, if you see a unit in the error state, you can't >> accurately reflect the status of the operation, since the unit will >> undoubtedly retry the hook again. Maybe it succeeds, maybe it fails again. >> How can one say after receiving the first delta of a unit error if the >> operation succeeded or failed? >> >> With no visibility up front on the retry strategy that Juju will perform >> (e.g. something representing the exponential backoff and a fixed number of >> retries before Juju admits defeat) it is impossible to say at any point in >> the delta stream what the result of a failed-at-least-once operation is. >> > > I think the retry strategy is great -- it leverages the immutability we > expect hooks to provide, to deliver a robust result over unreliable > substrates -- and all substrates are unreliable where there's > internetworking involved! > > However I see your point about the retry strategy muddling status. I've > noticed this sometimes when watching openstack or k8s bundles "shake out" > the errors as they come up. I don't think this is always a charm quality > issue, it's maybe because we're trying to show two different things with > status? > > >> What if Juju made a clearer distinction between result-state ("what I'm > doing most recently or last attempted to do") vs. goal-state ("what I'm > trying to get done") in the status? Would that help? > > >> Can retries be limited to a small number, with a backoff algorithm >> explicitly documented and stuck to by Juju, with the retry attempt number >> included in the delta stream? >> >> Thanks, >> >> Adam >> >> [0] https://jujucharms.com/docs/2.0/reference-release-notes >> [1] https://jujucharms.com/docs/2.0/models-config#retrying-failed-hooks >> >> -- >> Juju-dev mailing list >> Juju-dev@lists.ubuntu.com >> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailm >> an/listinfo/juju-dev >> >> > -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Opaque automatic hook retries from API
On Thu, Jan 5, 2017 at 3:33 AM, Adam Collardwrote: > Hi, > > The automatic hook retries[0] that landed as part of 2.0 (are documented > as) run indefinitely[1] - this causes problems as an API user: > > Imagine you are driving Juju using the API, and when you perform an > operation (e.g. set the configuration of a service, or reboot the unit, or > add a relation..) - you want to show the status of that operation. > > Prior to the automatic retries, you simply perform your operation, and > watch the delta streams for the corresponding change to the unit - the > success or otherwise of the operation is reflected in the unit > agent-status/workload-status pair. > > Now, with retries, if you see a unit in the error state, you can't > accurately reflect the status of the operation, since the unit will > undoubtedly retry the hook again. Maybe it succeeds, maybe it fails again. > How can one say after receiving the first delta of a unit error if the > operation succeeded or failed? > > With no visibility up front on the retry strategy that Juju will perform > (e.g. something representing the exponential backoff and a fixed number of > retries before Juju admits defeat) it is impossible to say at any point in > the delta stream what the result of a failed-at-least-once operation is. > I think the retry strategy is great -- it leverages the immutability we expect hooks to provide, to deliver a robust result over unreliable substrates -- and all substrates are unreliable where there's internetworking involved! However I see your point about the retry strategy muddling status. I've noticed this sometimes when watching openstack or k8s bundles "shake out" the errors as they come up. I don't think this is always a charm quality issue, it's maybe because we're trying to show two different things with status? > What if Juju made a clearer distinction between result-state ("what I'm doing most recently or last attempted to do") vs. goal-state ("what I'm trying to get done") in the status? Would that help? > Can retries be limited to a small number, with a backoff algorithm > explicitly documented and stuck to by Juju, with the retry attempt number > included in the delta stream? > > Thanks, > > Adam > > [0] https://jujucharms.com/docs/2.0/reference-release-notes > [1] https://jujucharms.com/docs/2.0/models-config#retrying-failed-hooks > > -- > Juju-dev mailing list > Juju-dev@lists.ubuntu.com > Modify settings or unsubscribe at: https://lists.ubuntu.com/ > mailman/listinfo/juju-dev > > -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Opaque automatic hook retries from API
Hi, The automatic hook retries[0] that landed as part of 2.0 (are documented as) run indefinitely[1] - this causes problems as an API user: Imagine you are driving Juju using the API, and when you perform an operation (e.g. set the configuration of a service, or reboot the unit, or add a relation..) - you want to show the status of that operation. Prior to the automatic retries, you simply perform your operation, and watch the delta streams for the corresponding change to the unit - the success or otherwise of the operation is reflected in the unit agent-status/workload-status pair. Now, with retries, if you see a unit in the error state, you can't accurately reflect the status of the operation, since the unit will undoubtedly retry the hook again. Maybe it succeeds, maybe it fails again. How can one say after receiving the first delta of a unit error if the operation succeeded or failed? With no visibility up front on the retry strategy that Juju will perform (e.g. something representing the exponential backoff and a fixed number of retries before Juju admits defeat) it is impossible to say at any point in the delta stream what the result of a failed-at-least-once operation is. Can retries be limited to a small number, with a backoff algorithm explicitly documented and stuck to by Juju, with the retry attempt number included in the delta stream? Thanks, Adam [0] https://jujucharms.com/docs/2.0/reference-release-notes [1] https://jujucharms.com/docs/2.0/models-config#retrying-failed-hooks -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev