Blind retries [Was: Opaque automatic hook retries from API]

2017-01-06 Thread Free Ekanayaka
Stuart Bishop  writes:

> I find destroy-service/remove-application is particularly problematic,
> because the doomed units don't know they are being destroyed but rather is
> informed about departing one relation at a time (which is inherently racy,
> because the units the doomed service are related too will process their
> relation-departed hooks almost immediately and stop talking to the doomed
> service, while the doomed service still thinks it can access their
> resources while it falls apart one piece at a time).

Yes. I noticed this issue too, and I think it's a valid Juju bug. I'm
not sure what the best fix would be, but it probably involves some
streamlining of the stop-unit logic (and associated hook sequencing).

[...]
> One of the reasons test suites are currently flaky is that there are
> race conditions we have no reasonable way of solving, such as a
> database restarting itself while a hook on another unit is attempting
> to use it.

In theory this should be rule 0 of programming: handle errors (such as
your code failing to talk to a database). This is of course easier said
than done, but it's been the case forever.

Blind retries are by no means a silver bullet, just because (at least
conceptually) there's no way around at looking at the actual issue at
hand, when deciding how to handle it (e.g. retry).

If you are 100% confident that your code is "idempotent" (for some
definition that makes sense in your case), a blind retry mechanism might
simply mean that your code will take a bit longer to bubble up a failure
(for instance because it's stubbornly retrying a failure condition that
has no way out).

However it's often difficult to judge if some piece of logic is really
idempotent (expecially if the logic encompasses a lot of moving parts,
like a hook run, as opposed to some granular API call). So there's
always the *some* risk that a blind retry could do something unwanted or
even harmful.

If you want to be perfectly safe you should look at the failure at hand
and make sure you understand, before doing anything.

YMMV re real-world statistics of whether this argument is actually
relevant (e.g. "blind retry is good enough for me").

This is by no means an easy topic and it's one of the hard parts of
programming, as exemplified by this recent juju-dev thread:

https://lists.ubuntu.com/archives/juju-dev/2016-October/006091.html

It's also an area where some stardardization of failure modes in
distributed systems would probably help developing some better
automation than blind retry or even some form of AI/learning (the HTTP
spec and RESTful architectures were arguably designed with that in
mind).

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Opaque automatic hook retries from API

2017-01-06 Thread Stuart Bishop
On 6 January 2017 at 01:39, Casey Marshall 
wrote:

> On Thu, Jan 5, 2017 at 3:33 AM, Adam Collard 
> wrote:
>
>> Hi,
>>
>> The automatic hook retries[0] that landed as part of 2.0 (are documented
>> as) run indefinitely[1] - this causes problems as an API user:
>>
>> Imagine you are driving Juju using the API, and when you perform an
>> operation (e.g. set the configuration of a service, or reboot the unit, or
>> add a relation..) - you want to show the status of that operation.
>>
>> Prior to the automatic retries, you simply perform your operation, and
>> watch the delta streams for the corresponding change to the unit - the
>> success or otherwise of the operation is reflected in the unit
>> agent-status/workload-status pair.
>>
>> Now, with retries, if you see a unit in the error state, you can't
>> accurately reflect the status of the operation, since the unit will
>> undoubtedly retry the hook again. Maybe it succeeds, maybe it fails again.
>> How can one say after receiving the first delta of a unit error if the
>> operation succeeded or failed?
>>
>> With no visibility up front on the retry strategy that Juju will perform
>> (e.g. something representing the exponential backoff and a fixed number of
>> retries before Juju admits defeat) it is impossible to say at any point in
>> the delta stream what the result of a failed-at-least-once operation is.
>>
>
> I think the retry strategy is great -- it leverages the immutability we
> expect hooks to provide, to deliver a robust result over unreliable
> substrates -- and all substrates are unreliable where there's
> internetworking involved!
>
> However I see your point about the retry strategy muddling status. I've
> noticed this sometimes when watching openstack or k8s bundles "shake out"
> the errors as they come up. I don't think this is always a charm quality
> issue, it's maybe because we're trying to show two different things with
> status?
>

errors being 'shaken out' are almost always unhandled race conditions. I
find destroy-service/remove-application is particularly problematic,
because the doomed units don't know they are being destroyed but rather is
informed about departing one relation at a time (which is inherently racy,
because the units the doomed service are related too will process their
relation-departed hooks almost immediately and stop talking to the doomed
service, while the doomed service still thinks it can access their
resources while it falls apart one piece at a time).

I'm becoming more and more a believer that we can't reasonably avoid these
errors, and instead maybe we should assume that they will happen and it is
perfectly normal. We can stick to writing nice idempotent handlers, simpler
because we can ignore and bubble up failures. Simpler protocols (eg.
removing all the handshaking the PostgreSQL interface does to try to avoid
races with authorization). And going back to Adam's point, have hooks
retried a few times with some sort of backoff before even being reported as
a failure to the end user. One of the reasons test suites are currently
flaky is that there are race conditions we have no reasonable way of
solving, such as a database restarting itself while a hook on another unit
is attempting to use it. Even though I currently bootstrap test envs with
the retry behaviour off, I'm thinking of changing that.


What if Juju made a clearer distinction between result-state ("what I'm
> doing most recently or last attempted to do") vs. goal-state ("what I'm
> trying to get done") in the status? Would that help?
>

Isn't the goal state just the failed hook? I would certainly like to see
the list of hooks queued to run on each unit though if that is what you
mean (not in the default tabular status, but in the json status dump).



>> Can retries be limited to a small number, with a backoff algorithm
>> explicitly documented and stuck to by Juju, with the retry attempt number
>> included in the delta stream?
>>
>
This sounds like a good idea. The limit could even be dynamic, with a retry
attempted every time a unit it is related too successfully runs a hook,
until the environment is quiescent.



-- 
Stuart Bishop 
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Opaque automatic hook retries from API

2017-01-05 Thread Casey Marshall
^^ s/immutability/idempotency

On Thu, Jan 5, 2017 at 12:39 PM, Casey Marshall <
casey.marsh...@canonical.com> wrote:

> On Thu, Jan 5, 2017 at 3:33 AM, Adam Collard 
> wrote:
>
>> Hi,
>>
>> The automatic hook retries[0] that landed as part of 2.0 (are documented
>> as) run indefinitely[1] - this causes problems as an API user:
>>
>> Imagine you are driving Juju using the API, and when you perform an
>> operation (e.g. set the configuration of a service, or reboot the unit, or
>> add a relation..) - you want to show the status of that operation.
>>
>> Prior to the automatic retries, you simply perform your operation, and
>> watch the delta streams for the corresponding change to the unit - the
>> success or otherwise of the operation is reflected in the unit
>> agent-status/workload-status pair.
>>
>> Now, with retries, if you see a unit in the error state, you can't
>> accurately reflect the status of the operation, since the unit will
>> undoubtedly retry the hook again. Maybe it succeeds, maybe it fails again.
>> How can one say after receiving the first delta of a unit error if the
>> operation succeeded or failed?
>>
>> With no visibility up front on the retry strategy that Juju will perform
>> (e.g. something representing the exponential backoff and a fixed number of
>> retries before Juju admits defeat) it is impossible to say at any point in
>> the delta stream what the result of a failed-at-least-once operation is.
>>
>
> I think the retry strategy is great -- it leverages the immutability we
> expect hooks to provide, to deliver a robust result over unreliable
> substrates -- and all substrates are unreliable where there's
> internetworking involved!
>
> However I see your point about the retry strategy muddling status. I've
> noticed this sometimes when watching openstack or k8s bundles "shake out"
> the errors as they come up. I don't think this is always a charm quality
> issue, it's maybe because we're trying to show two different things with
> status?
>
>
>> What if Juju made a clearer distinction between result-state ("what I'm
> doing most recently or last attempted to do") vs. goal-state ("what I'm
> trying to get done") in the status? Would that help?
>
>
>> Can retries be limited to a small number, with a backoff algorithm
>> explicitly documented and stuck to by Juju, with the retry attempt number
>> included in the delta stream?
>>
>> Thanks,
>>
>> Adam
>>
>> [0] https://jujucharms.com/docs/2.0/reference-release-notes
>> [1] https://jujucharms.com/docs/2.0/models-config#retrying-failed-hooks
>>
>> --
>> Juju-dev mailing list
>> Juju-dev@lists.ubuntu.com
>> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailm
>> an/listinfo/juju-dev
>>
>>
>
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Opaque automatic hook retries from API

2017-01-05 Thread Casey Marshall
On Thu, Jan 5, 2017 at 3:33 AM, Adam Collard 
wrote:

> Hi,
>
> The automatic hook retries[0] that landed as part of 2.0 (are documented
> as) run indefinitely[1] - this causes problems as an API user:
>
> Imagine you are driving Juju using the API, and when you perform an
> operation (e.g. set the configuration of a service, or reboot the unit, or
> add a relation..) - you want to show the status of that operation.
>
> Prior to the automatic retries, you simply perform your operation, and
> watch the delta streams for the corresponding change to the unit - the
> success or otherwise of the operation is reflected in the unit
> agent-status/workload-status pair.
>
> Now, with retries, if you see a unit in the error state, you can't
> accurately reflect the status of the operation, since the unit will
> undoubtedly retry the hook again. Maybe it succeeds, maybe it fails again.
> How can one say after receiving the first delta of a unit error if the
> operation succeeded or failed?
>
> With no visibility up front on the retry strategy that Juju will perform
> (e.g. something representing the exponential backoff and a fixed number of
> retries before Juju admits defeat) it is impossible to say at any point in
> the delta stream what the result of a failed-at-least-once operation is.
>

I think the retry strategy is great -- it leverages the immutability we
expect hooks to provide, to deliver a robust result over unreliable
substrates -- and all substrates are unreliable where there's
internetworking involved!

However I see your point about the retry strategy muddling status. I've
noticed this sometimes when watching openstack or k8s bundles "shake out"
the errors as they come up. I don't think this is always a charm quality
issue, it's maybe because we're trying to show two different things with
status?


> What if Juju made a clearer distinction between result-state ("what I'm
doing most recently or last attempted to do") vs. goal-state ("what I'm
trying to get done") in the status? Would that help?


> Can retries be limited to a small number, with a backoff algorithm
> explicitly documented and stuck to by Juju, with the retry attempt number
> included in the delta stream?
>
> Thanks,
>
> Adam
>
> [0] https://jujucharms.com/docs/2.0/reference-release-notes
> [1] https://jujucharms.com/docs/2.0/models-config#retrying-failed-hooks
>
> --
> Juju-dev mailing list
> Juju-dev@lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/
> mailman/listinfo/juju-dev
>
>
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Opaque automatic hook retries from API

2017-01-05 Thread Adam Collard
Hi,

The automatic hook retries[0] that landed as part of 2.0 (are documented
as) run indefinitely[1] - this causes problems as an API user:

Imagine you are driving Juju using the API, and when you perform an
operation (e.g. set the configuration of a service, or reboot the unit, or
add a relation..) - you want to show the status of that operation.

Prior to the automatic retries, you simply perform your operation, and
watch the delta streams for the corresponding change to the unit - the
success or otherwise of the operation is reflected in the unit
agent-status/workload-status pair.

Now, with retries, if you see a unit in the error state, you can't
accurately reflect the status of the operation, since the unit will
undoubtedly retry the hook again. Maybe it succeeds, maybe it fails again.
How can one say after receiving the first delta of a unit error if the
operation succeeded or failed?

With no visibility up front on the retry strategy that Juju will perform
(e.g. something representing the exponential backoff and a fixed number of
retries before Juju admits defeat) it is impossible to say at any point in
the delta stream what the result of a failed-at-least-once operation is.

Can retries be limited to a small number, with a backoff algorithm
explicitly documented and stuck to by Juju, with the retry attempt number
included in the delta stream?

Thanks,

Adam

[0] https://jujucharms.com/docs/2.0/reference-release-notes
[1] https://jujucharms.com/docs/2.0/models-config#retrying-failed-hooks
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev