Automatic retries of hooks

2015-11-26 Thread Bogdan Teleaga
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Hello everybody,

This has been a WIP for a while now so maybe some of you have heard
about it.

It all started out with us needing to have hook retried after a random
reboot and it evolved into retrying hooks upon any kind of failure.

So as of now failing hooks will be retried automatically after a
certain time. The minimum wait time will be 20 seconds, while the
maximum will be 20 minutes and it's going to increase with a factor of
2 for every failure. Also a small jitter is introduced for a bit of
randomness. Using juju resolved will overwrite this timer and cause it
to restart at the beginning.

I've tested it for a while and it has proven to be relatively robust
in my tests. Probably having a CI test soonish would be recommended.

The waiting amount has been chosen relatively arbitratily so if anyone
has comments or ideas for that, I'm open to suggestions. The
discussion for that should go
here(https://github.com/juju/juju/pull/3835), since apparently I
merged the branch with some values I used in testing and did not
change them back to the intended ones.

Regards,
Bogdan
-BEGIN PGP SIGNATURE-

iQIcBAEBCgAGBQJWVwiPAAoJEIeXGztzAn9lzNEQALYGq0hp1PWuWy0iADNVpgaV
akgbnxZgqGGCb7ZRH2Lz/fnJAgTQXUNnmi0fI0wDzRwTbtQI/t2fZt1huKzSjtul
AtPzCVlSg5zsolIfbvp3LvdPoneoKrO0W919KYNIw+yPv8mFstqf8oubCmbCn3iF
NbfZ2OR4FyvM9KNH7PE2f2PGhrb4Q2S5sZcE9L65fxmZTTGiplAUOW3rTe5clrdx
LT3aM6RtoOP0D8CzmX/RZLgp9Vrm1hd/Lju+NSc1aupDwid3B8DIERR0891FgQ9z
KnRs9E4w41TZVV3Z3lXr6r47OV0oPgRbs5nLzD1AcAVwLaKB8K975t4BMwbtGbFd
7gtc9obzwMZJ408HPSvlNetxMHb4UwPFn9I/5+G4vBjhhm+I13RNf/pHMrplThaG
ETD4tSUhklN9ISkVPeXU39oY8i5wJNKV54krW2q9BwIRp4LpZnpgFyXh3wdrxow4
4z6NYQ00k1zSdQ8UY74H9zaww6lJAr+HBRXP/zVPpUvm9xaytFx4KReRSnbNu5Dp
r165CuBysPyT/MnglRNUZT/h3/zVKHTpRruOXRagy/y6g+gcoQi3LGnpAqtxzI1D
BLV+B8FP3SwkSBIeDHPRw2Nyo0mydRqgob+WgaOaVGnxIcoLUTeOa4kXK87I5X97
+nr3BdQbXaGugbEmD9xq
=AwIa
-END PGP SIGNATURE-
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-19 Thread James Page
Hi Bogdan

On Thu, Nov 26, 2015 at 1:29 PM, Bogdan Teleaga <
btele...@cloudbasesolutions.com> wrote:

> This has been a WIP for a while now so maybe some of you have heard
> about it.
>
> It all started out with us needing to have hook retried after a random
> reboot and it evolved into retrying hooks upon any kind of failure.
>
> So as of now failing hooks will be retried automatically after a
> certain time. The minimum wait time will be 20 seconds, while the
> maximum will be 20 minutes and it's going to increase with a factor of
> 2 for every failure. Also a small jitter is introduced for a bit of
> randomness. Using juju resolved will overwrite this timer and cause it
> to restart at the beginning.
>
> I've tested it for a while and it has proven to be relatively robust
> in my tests. Probably having a CI test soonish would be recommended.
>
> The waiting amount has been chosen relatively arbitratily so if anyone
> has comments or ideas for that, I'm open to suggestions. The
> discussion for that should go
> here(https://github.com/juju/juju/pull/3835), since apparently I
> merged the branch with some values I used in testing and did not
> change them back to the intended ones.
>

In the daily deluge of email I managed to miss your post to list, and
stumbled upon this feature whilst exercising 1.26 alpha3 with some
development work this week and assumed it was a bug:

  https://bugs.launchpad.net/juju-core/+bug/1535711

I think this is a dangerous behaviour to introduce to Juju; a hook error
should be a signal to an end user that something really bad happened, and
that they need to dig in further (preferably with points from status
messages); if the function that a hook is performing is re-tryable, that
needs to be handled in charm and not by Juju IMHO.

Specifically I was testing some changes to the odl-controller charm; this
feature covered up a race in the charm hook code accessing the API of ODL,
which I failed to notice the first few times I deployed (not paying
attention due to multi-tasking), and then had me scratching my head as to
what was going on when I started to notice the hook failure.

Cheers

James
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-19 Thread John Meinel
There are classes of failures that a charm hook itself cannot handle. The
specific one Bogdan was working with is the fact that the machine itself is
getting restarted while the charm is in the middle of processing a hook.
There isn't any way the hook itself can handle that, unless you could raise
a very specific error that indicates you should be retried (so as it
notices its about to die, it raises the try-me-again error).

Hooks are supposed to be idempotent regardless, aren't they? So while we
paper over transient bugs in them, doesn't it make the system more
resilient overall?

John
=:->

On Tue, Jan 19, 2016 at 6:14 PM, James Page  wrote:

> Hi Bogdan
>
> On Thu, Nov 26, 2015 at 1:29 PM, Bogdan Teleaga <
> btele...@cloudbasesolutions.com> wrote:
>
>> This has been a WIP for a while now so maybe some of you have heard
>> about it.
>>
>> It all started out with us needing to have hook retried after a random
>> reboot and it evolved into retrying hooks upon any kind of failure.
>>
>> So as of now failing hooks will be retried automatically after a
>> certain time. The minimum wait time will be 20 seconds, while the
>> maximum will be 20 minutes and it's going to increase with a factor of
>> 2 for every failure. Also a small jitter is introduced for a bit of
>> randomness. Using juju resolved will overwrite this timer and cause it
>> to restart at the beginning.
>>
>> I've tested it for a while and it has proven to be relatively robust
>> in my tests. Probably having a CI test soonish would be recommended.
>>
>> The waiting amount has been chosen relatively arbitratily so if anyone
>> has comments or ideas for that, I'm open to suggestions. The
>> discussion for that should go
>> here(https://github.com/juju/juju/pull/3835), since apparently I
>> merged the branch with some values I used in testing and did not
>> change them back to the intended ones.
>>
>
> In the daily deluge of email I managed to miss your post to list, and
> stumbled upon this feature whilst exercising 1.26 alpha3 with some
> development work this week and assumed it was a bug:
>
>   https://bugs.launchpad.net/juju-core/+bug/1535711
>
> I think this is a dangerous behaviour to introduce to Juju; a hook error
> should be a signal to an end user that something really bad happened, and
> that they need to dig in further (preferably with points from status
> messages); if the function that a hook is performing is re-tryable, that
> needs to be handled in charm and not by Juju IMHO.
>
> Specifically I was testing some changes to the odl-controller charm; this
> feature covered up a race in the charm hook code accessing the API of ODL,
> which I failed to notice the first few times I deployed (not paying
> attention due to multi-tasking), and then had me scratching my head as to
> what was going on when I started to notice the hook failure.
>
> Cheers
>
> James
>
> --
> Juju-dev mailing list
> Juju-dev@lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
>
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-19 Thread Stuart Bishop
On 20 January 2016 at 13:17, John Meinel  wrote:

> There are classes of failures that a charm hook itself cannot handle. The
> specific one Bogdan was working with is the fact that the machine itself is
> getting restarted while the charm is in the middle of processing a hook.
> There isn't any way the hook itself can handle that, unless you could raise
> a very specific error that indicates you should be retried (so as it notices
> its about to die, it raises the try-me-again error).
>
> Hooks are supposed to be idempotent regardless, aren't they? So while we
> paper over transient bugs in them, doesn't it make the system more resilient
> overall?

The new update-status hook could be used to recover, as it is called
automatically at regular intervals. If the reboot really was random,
you would need to clear the error status first. But if it is triggered
by the charm, it is just a case of 'reboot(now+30s);
status_set('waiting', 'Waiting for reboot'); sys.exit(0)' and waiting
for the update-status hook to kick in.

It happens naturally if you structure your charm to have a single hook
that does everything that needs to be done, rather than trying to
craft individual hooks to deal with specific events.



-- 
Stuart Bishop 

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread William Reade
On Tue, Jan 19, 2016 at 3:14 PM, James Page  wrote:
>
> I think this is a dangerous behaviour to introduce to Juju; a hook error
> should be a signal to an end user that something really bad happened, and
> that they need to dig in further (preferably with points from status
> messages); if the function that a hook is performing is re-tryable, that
> needs to be handled in charm and not by Juju IMHO.
>

There are a few problems with this.

0) The function that a hook is performing *must* be retryable anyway. Hooks
need to be idempotent; we guarantee at-least-once execution, not
at-most-once.

1) As a user, what a hook error means in practice is "retry the hook" (good
thing all those hooks are idempotent...). Most users aren't in a position
to debug their charm if it goes wrong, so their only actual interaction is
basically a thoughtless pavlovian response, the absence of which can leave
an environment needlessly hosed until a human notices it. May as well
automate it for better UX *and* happier outcomes.

2) In any given hook, the ratio of known errors to possible errors is
approximately 0:1 [0]. Those infinitesimally few known errors should indeed
set statuses before failing out (even if you have to look in status history
to see them); but we have to be mindful of the vast majority of cases,
where we have *no idea* what could have gone wrong. And in that case, the
only functional response is to retry -- some unknown errors may be fatal,
but to *assume* they are risks locking up the system on every transient
blip.

3) Finally, now that you have the choice, I'd advise against in-hook
retries: (i) the longer you sit in one hook retrying, the longer all
colocated units are blocked [1]; and (ii) delegating the retries to the
infrastructure lets you write much much cleaner code [2].

Are there any concerns that I've missed?

Specifically I was testing some changes to the odl-controller charm; this
> feature covered up a race in the charm hook code accessing the API of ODL,
> which I failed to notice the first few times I deployed (not paying
> attention due to multi-tasking), and then had me scratching my head as to
> what was going on when I started to notice the hook failure.
>

You say "covered up a race", I say "automatically resolved the problem for
you" :-).

Cheers
William

[0] this applies to any code really, inside or outside juju, it's not
specific to hooks at all.
[1] and while it may not be *common* I'm pretty sure it'd be *possible* for
a hook to deadlock like this; would prefer not to encourage that.
[2] this is also widely applicable: adding retry logic *within* an
idempotent operation is basically always worse than building independent
operation-retrying infrastructure and reusing that where necessary.
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread William Reade
On Wed, Jan 20, 2016 at 8:46 AM, Stuart Bishop 
wrote:

> On 20 January 2016 at 13:17, John Meinel  wrote:
>
> > There are classes of failures that a charm hook itself cannot handle. The
> > specific one Bogdan was working with is the fact that the machine itself
> is
> > getting restarted while the charm is in the middle of processing a hook.
> > There isn't any way the hook itself can handle that, unless you could
> raise
> > a very specific error that indicates you should be retried (so as it
> notices
> > its about to die, it raises the try-me-again error).
> >
> > Hooks are supposed to be idempotent regardless, aren't they? So while we
> > paper over transient bugs in them, doesn't it make the system more
> resilient
> > overall?
>
> The new update-status hook could be used to recover, as it is called
> automatically at regular intervals. If the reboot really was random,
> you would need to clear the error status first. But if it is triggered
> by the charm, it is just a case of 'reboot(now+30s);
> status_set('waiting', 'Waiting for reboot'); sys.exit(0)' and waiting
> for the update-status hook to kick in.
>

If it's triggered by the charm, it should `juju-reboot`, which will bounce
the machine after the hook is committed (or, with `--now`, do so right away
and requeue the executing hook). Regardless, from a charm's perspective,
"random" reboots will happen, as will an arbitrary number of other "random"
failures that really aren't worth a stop-the-line response.

It happens naturally if you structure your charm to have a single hook
> that does everything that needs to be done, rather than trying to
> craft individual hooks to deal with specific events.
>

Independent of everything else, *this* should *excellent* advice for
speeding up your deployments. Have you already been writing charms like
this? I'd love to hear your experiences; and, in particular, if you've
noticed any improvement in deployment speed. The theoretically achievable
speedup is vast, but the hook runner wasn't written with this approach in
mind; we might need to make a couple of small tweaks [0] to get the best
out of the approach.

Cheers
William

[0] basically, check for hook existence *before* doing all the context
setup work. It's essentially a no-brainer, but it's not quite trivial to
do, and has just never hit the top of anyone's list.
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Martin Packman
Another common error we see in CI is apt mirrors being unhappy leading
to hook failures. Just retry later does tend to be the right option
there, though it will often be an our or two until the archive is in a
usable state again.

On 20/01/2016, William Reade  wrote:
>
> Are there any concerns that I've missed?

Automatic retries make debugging your charm harder, as James found. I
think we want an environment setting to disable this for both testing
and for charm authors.

Martin

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread roger peppe
On 20 January 2016 at 12:20, Martin Packman
 wrote:
> Another common error we see in CI is apt mirrors being unhappy leading
> to hook failures. Just retry later does tend to be the right option
> there, though it will often be an our or two until the archive is in a
> usable state again.
>
> On 20/01/2016, William Reade  wrote:
>>
>> Are there any concerns that I've missed?
>
> Automatic retries make debugging your charm harder, as James found. I
> think we want an environment setting to disable this for both testing
> and for charm authors.

This seems like a good idea.
Also perhaps it wouldn't be so bad if you at least were able
to find some record of the hook failures without delving into the
logs.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread John Meinel
>
> ...
>


>
> This seems like a good idea.
> Also perhaps it wouldn't be so bad if you at least were able
> to find some record of the hook failures without delving into the
> logs.
>


If the charm is calling "status-set" then the information would currently
be available in "juju status-history unit/0"

John
=:->
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Adam Collard
On Wed, 20 Jan 2016 at 12:48 John Meinel  wrote:

> ...
>>
>
>
>>
>> This seems like a good idea.
>> Also perhaps it wouldn't be so bad if you at least were able
>> to find some record of the hook failures without delving into the
>> logs.
>>
>
>
> If the charm is calling "status-set" then the information would currently
> be available in "juju status-history unit/0"
>
>
Modulo https://bugs.launchpad.net/juju-core/+bug/1530840  being fixed :)
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Dean Henrichsmeyer
Hi,

It seems the original point James was making is getting missed. No one is
arguing over the value of being able to retry and/or idempotent hooks. Yes,
you should be able to retry them and yes nothing should break if you run
them over and over.

The point made is that Juju shouldn't be automatically retrying them. The
argument of "no one knows what went wrong so Juju automatically retrying
them is a better experience" doesn't work. The intelligence of the stack in
question, regardless of what it is, goes in the charms. If you start
conflating and mixing up where the intelligence goes then creating,
running, and debugging those distributed systems will be a nightmare.

The magic should only be in Juju's ability to effectively drive the models
and intelligence encoded in the charms. It shouldn't make assumptions about
what that intelligence is or what those models require.

Thanks.

-Dean
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Rick Harding
+1 retries are great, with backoff, when you know you're doing it because
you have experience that certain api requests to clouds, or to other known
failure points.

Blindly just saying "if at first you don't succeed, go go go" isn't a
better UX. It adds another layer of complexity in debugging, and doesn't
really improve the product. Only the charm author knows enough about what
it's trying to achieve to do intelligent retry.

In this case, if there's something about unexpected reboots of machines,
perhaps there's some specific case that Juju can grow some intelligence and
hint at the charm author what happened. The charm can then react to that
information as it deems necessary.

On Wed, Jan 20, 2016 at 8:42 AM Dean Henrichsmeyer 
wrote:

> Hi,
>
> It seems the original point James was making is getting missed. No one is
> arguing over the value of being able to retry and/or idempotent hooks.
> Yes, you should be able to retry them and yes nothing should break if you
> run them over and over.
>
> The point made is that Juju shouldn't be automatically retrying them. The
> argument of "no one knows what went wrong so Juju automatically retrying
> them is a better experience" doesn't work. The intelligence of the stack in
> question, regardless of what it is, goes in the charms. If you start
> conflating and mixing up where the intelligence goes then creating,
> running, and debugging those distributed systems will be a nightmare.
>
> The magic should only be in Juju's ability to effectively drive the models
> and intelligence encoded in the charms. It shouldn't make assumptions about
> what that intelligence is or what those models require.
>
> Thanks.
>
>
> -Dean
> --
> Juju-dev mailing list
> Juju-dev@lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Charles Butler
I'm pretty sure that we have amenities to reboot the host without
completely skewing the hook execution

https://jujucharms.com/docs/1.25/reference-hook-tools#juju-reboot-[--now]

This should have rebooted the machine after safely closing out of any hook
context the charm was in, and upon reboot it should have resumed from the
next context in queue.  I'm not a huge fan of a charm doing auto hook
retries, for the reasons outlined by Rick, unless it is well understood and
documented behavior.  Just chiming in with my 2 cents


Charles Butler  - Juju Charmer
Come see the future of datacenter orchestration: http://jujucharms.com

On Wed, Jan 20, 2016 at 9:22 AM, Rick Harding 
wrote:

> +1 retries are great, with backoff, when you know you're doing it because
> you have experience that certain api requests to clouds, or to other known
> failure points.
>
> Blindly just saying "if at first you don't succeed, go go go" isn't a
> better UX. It adds another layer of complexity in debugging, and doesn't
> really improve the product. Only the charm author knows enough about what
> it's trying to achieve to do intelligent retry.
>
> In this case, if there's something about unexpected reboots of machines,
> perhaps there's some specific case that Juju can grow some intelligence and
> hint at the charm author what happened. The charm can then react to that
> information as it deems necessary.
>
> On Wed, Jan 20, 2016 at 8:42 AM Dean Henrichsmeyer 
> wrote:
>
>> Hi,
>>
>> It seems the original point James was making is getting missed. No one is
>> arguing over the value of being able to retry and/or idempotent hooks.
>> Yes, you should be able to retry them and yes nothing should break if you
>> run them over and over.
>>
>> The point made is that Juju shouldn't be automatically retrying them. The
>> argument of "no one knows what went wrong so Juju automatically retrying
>> them is a better experience" doesn't work. The intelligence of the stack in
>> question, regardless of what it is, goes in the charms. If you start
>> conflating and mixing up where the intelligence goes then creating,
>> running, and debugging those distributed systems will be a nightmare.
>>
>> The magic should only be in Juju's ability to effectively drive the
>> models and intelligence encoded in the charms. It shouldn't make
>> assumptions about what that intelligence is or what those models require.
>>
>> Thanks.
>>
>>
>> -Dean
>> --
>> Juju-dev mailing list
>> Juju-dev@lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>
> --
> Juju-dev mailing list
> Juju-dev@lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
>
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Gabriel Samfira
The auto-retry thing was created to overcome situations in which the machine is 
rebooted, or chashes during a hook run (independently of juju). In this case, 
the charm would not be able to recover automatically from a transient situation.

This scenario is more evident in Windows workloads, where some features like 
Hyper-V require a reboot. This is fine, because you can install the feature 
with a -NoReboot flag, and use the juju-reboot --now tool to safely reboot.

After the machine comes back up, Windows still needs to configure  the new 
feature. While it is configuring the feature, the services start (including 
juju), and hook execution starts up (as its supposed to).   The problem is that 
as part of the feature configuration, the system needs to reboot one final time 
(Windowsright?). This is done automatically by the feature installer, 
outside of juju.

This causes the hook to error, but not because of an actual problem. For this 
scenario, its enough to retry once when the unit agent comes up.

A solution might be to make this feature configurable. Something like retry 
profiles:

* periodic (current behavior)
* one-shot (once at agent startup)
* disabled

Gabriel

On Mi, 2016-01-20 at 09:39 -0500, Charles Butler wrote:
I'm pretty sure that we have amenities to reboot the host without completely 
skewing the hook execution

https://jujucharms.com/docs/1.25/reference-hook-tools#juju-reboot-[--now]

This should have rebooted the machine after safely closing out of any hook 
context the charm was in, and upon reboot it should have resumed from the next 
context in queue.  I'm not a huge fan of a charm doing auto hook retries, for 
the reasons outlined by Rick, unless it is well understood and documented 
behavior.  Just chiming in with my 2 cents


Charles Butler 
mailto:charles.but...@canonical.com>> - Juju 
Charmer
Come see the future of datacenter orchestration: http://jujucharms.com

On Wed, Jan 20, 2016 at 9:22 AM, Rick Harding 
mailto:rick.hard...@canonical.com>> wrote:
+1 retries are great, with backoff, when you know you're doing it because you 
have experience that certain api requests to clouds, or to other known failure 
points.

Blindly just saying "if at first you don't succeed, go go go" isn't a better 
UX. It adds another layer of complexity in debugging, and doesn't really 
improve the product. Only the charm author knows enough about what it's trying 
to achieve to do intelligent retry.

In this case, if there's something about unexpected reboots of machines, 
perhaps there's some specific case that Juju can grow some intelligence and 
hint at the charm author what happened. The charm can then react to that 
information as it deems necessary.

On Wed, Jan 20, 2016 at 8:42 AM Dean Henrichsmeyer 
mailto:d...@canonical.com>> wrote:
Hi,

It seems the original point James was making is getting missed. No one is 
arguing over the value of being able to retry and/or idempotent hooks. Yes, you 
should be able to retry them and yes nothing should break if you run them over 
and over.

The point made is that Juju shouldn't be automatically retrying them. The 
argument of "no one knows what went wrong so Juju automatically retrying them 
is a better experience" doesn't work. The intelligence of the stack in 
question, regardless of what it is, goes in the charms. If you start conflating 
and mixing up where the intelligence goes then creating, running, and debugging 
those distributed systems will be a nightmare.

The magic should only be in Juju's ability to effectively drive the models and 
intelligence encoded in the charms. It shouldn't make assumptions about what 
that intelligence is or what those models require.

Thanks.


-Dean
--
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


--
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev



--
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Aaron Bentley
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256


On 2016-01-20 10:30 AM, Gabriel Samfira wrote:
> The auto-retry thing was created to overcome situations in which
> the machine is rebooted, or chashes during a hook run
> (independently of juju). In this case, the charm would not be able
> to recover automatically from a transient situation.

If the intent was to handle reboots, couldn't it be written to restart
any pending hooks after a reboot, rather than when the hooks fail?

Even re-running just at agent-startup would be a lot clearer.

Aaron
-BEGIN PGP SIGNATURE-
Version: GnuPG v2

iQEcBAEBCAAGBQJWn6pLAAoJEK84cMOcf+9h3mIIAMbumuMlehhMELNAlxMN2bnn
1rYUIZ7P/n2CagdMnjzysZXeUkRSHOjdklE4XKJUzhxzaknRgJXNZ8Ab5R7XMU1F
f4GnOXhskmw4mAae9beve5I4vF2WINxUQcxRaRen6Ov6VRQqRxVnMnZ6S85o4tPY
lMQRh+WP40JTzDkUWcCyKpQ5JgBqP9IQwn21y9v/LiXAfbkzrzqR04hvk7HrMM5W
lRBnTUldj3GHiI8Gjq6TVx6Th76PalfPUHoBlF7cmqEEVXydmuOjzr1C3fZR8VO5
JeXif92z5sR6z4TjoxnT7ixyfoz1Rvu6pKhIPJbi1cptXjDv5wU43MJsNqT6KpQ=
=Igdi
-END PGP SIGNATURE-

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Gabriel Samfira
On Mi, 2016-01-20 at 10:39 -0500, Aaron Bentley wrote:
> On 2016-01-20 10:30 AM, Gabriel Samfira wrote:
> > The auto-retry thing was created to overcome situations in which
> > the machine is rebooted, or chashes during a hook run
> > (independently of juju). In this case, the charm would not be able
> > to recover automatically from a transient situation.
> 
> If the intent was to handle reboots, couldn't it be written to
> restart
> any pending hooks after a reboot, rather than when the hooks fail?

The original intent was to re-run a hook in case of external
intervention outside of juju. This includes but is not limited to:

* automatic reboots
* OOM situation
* power outage
* killall -9 jujud (chaos monkey/gremlins/postal sysadmin)

This has grown to automatically retry on any failure. While retrying
 once at agent startup is enough for *some* needs, it may not be enough
for other charms. I would not remove the current behavior. I would
simply make it configurable in case the current behavior does not suit
everyone. The auto retry on all errors is a safe bet for charms that do
not implement retry logic, and as William stated, retrying an operation
inside a hook, will block all other hooks form running.

Just my 2 cents.

> Even re-running just at agent-startup would be a lot clearer.
> 
> Aaron
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread William Reade
On Wed, Jan 20, 2016 at 2:42 PM, Dean Henrichsmeyer 
wrote:

> Hi,
>
> It seems the original point James was making is getting missed. No one is
> arguing over the value of being able to retry and/or idempotent hooks.
> Yes, you should be able to retry them and yes nothing should break if you
> run them over and over.
>
> The point made is that Juju shouldn't be automatically retrying them. The
> argument of "no one knows what went wrong so Juju automatically retrying
> them is a better experience" doesn't work. The intelligence of the stack in
> question, regardless of what it is, goes in the charms. If you start
> conflating and mixing up where the intelligence goes then creating,
> running, and debugging those distributed systems will be a nightmare.
>

Hook errors *will* happen, and often for transient reasons. In handling
this, we can choose between "users retry without understanding the details"
and "juju retries without understanding the details" [0]. I'd be happy to
make the behaviour configurable, for the rare cases when the user *does*
understand the details and wants full and detailed control, but I don't
think that's the common case.

The magic should only be in Juju's ability to effectively drive the models
> and intelligence encoded in the charms. It shouldn't make assumptions about
> what that intelligence is or what those models require.
>

Stopping on hook error can only *prevent* those charms from applying their
intelligence. No more hooks to be run => no more opportunity to react. If a
charm wants to be smart about errors, it needs to detect the errors it
*knows* about, and react to those by setting status; and to move on
*without* failing the hook, thereby giving subsequent hooks an opportunity
to be smart.

Ultimately, it comes down to the fact that there's *always* another error
case you haven't considered. If you depend on the charmer to implement
retries for specific errors, that's essentially a whitelist, and they're
stuck playing whack-a-mole forever [1]. But if the charmer can depend on
external retries, they only have to worry about maintaining a
definitely-fatal blacklist and reporting those conditions in status.

Am I making any sense here?

Cheers
William


[0] or "the system stays broken forever", I suppose :).
[1] I imagine the rational approach there is to give up, and start
whitelisting by operation rather than by error; i.e. to accept that most
errors are unknown/transient and should be dumbly retried. And given that,
why should every charmer have to roll their own retries?
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Stuart Bishop
On 20 January 2016 at 17:46, William Reade  wrote:

> On Wed, Jan 20, 2016 at 8:46 AM, Stuart Bishop 
> wrote:

>> It happens naturally if you structure your charm to have a single hook
>> that does everything that needs to be done, rather than trying to
>> craft individual hooks to deal with specific events.
>
> Independent of everything else, *this* should *excellent* advice for
> speeding up your deployments. Have you already been writing charms like
> this? I'd love to hear your experiences; and, in particular, if you've
> noticed any improvement in deployment speed. The theoretically achievable
> speedup is vast, but the hook runner wasn't written with this approach in
> mind; we might need to make a couple of small tweaks [0] to get the best out
> of the approach.

The PostgreSQL charm has now existed in three forms. Traditional,
services framework, and now reactive framework. Using the services
framework, deployment speed was slower than traditional. You ended up
with one very long string of steps, many of which were unnecessary. I
felt it easier to maintain and understand, but logs noisier and it was
slower. The reactive framework is much faster deployment wise than all
other versions, as you can easily have only the necessary steps
triggered for the current state. The execution thread is harder to
follow, since there isn't really one, but it still seems very
maintainable and understandable. There is less code than the other
versions. It does drive you to create separate handlers for each hook,
but advice is to keep hooks at the absolute bare minimum to adjust the
charms state based on the event and put all the actual logic in the
state driven handlers.


-- 
Stuart Bishop 

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread William Reade
On Wed, Jan 20, 2016 at 3:22 PM, Rick Harding 
wrote:

> +1 retries are great, with backoff, when you know you're doing it because
> you have experience that certain api requests to clouds, or to other known
> failure points.
>

If you're thinking about it in terms of "known failure points" you already
understand that you need a wide net to catch all the retryable errors that
could come out of a given operation. What makes hook execution different
from any other code that we want to be reliable?

Blindly just saying "if at first you don't succeed, go go go" isn't a
> better UX. It adds another layer of complexity in debugging, and doesn't
> really improve the product. Only the charm author knows enough about what
> it's trying to achieve to do intelligent retry.
>

Empirically, it seems that the retries caused jamespage's charm succeed
where it would have failed; and we have happy results from Gabriel's
windows charms as well. That STM to be evidence that the product is
improved...

In this case, if there's something about unexpected reboots of machines,
> perhaps there's some specific case that Juju can grow some intelligence and
> hint at the charm author what happened. The charm can then react to that
> information as it deems necessary.
>

It's not really about reboots. It's that we can't reliably distinguish
between all the cases that could cause us to record the start of a hook
execution but not its completion -- hook errors, context-flush-failure,
oom-killed-jujud, reboots, plain ol' bugs -- and that most of those don't
deserve a freak-out stop-the-world no-more-hooks reaction [0]. And even
when they *do* represent real problems with the deployment, the RTTD is to
set status and move on *without* hook error, because a hook error prevents
the unit from reacting to changes and fixing itself when it can.

Helpful?
Cheers
William

[0] and ofc that is not a comprehensive list, there will always be more
ways we might fail -- adding heuristics and special handling for the
various cases will never be perfect, and will just make us less predictable
and less reliable.



> On Wed, Jan 20, 2016 at 8:42 AM Dean Henrichsmeyer 
> wrote:
>
>> Hi,
>>
>> It seems the original point James was making is getting missed. No one is
>> arguing over the value of being able to retry and/or idempotent hooks.
>> Yes, you should be able to retry them and yes nothing should break if you
>> run them over and over.
>>
>> The point made is that Juju shouldn't be automatically retrying them. The
>> argument of "no one knows what went wrong so Juju automatically retrying
>> them is a better experience" doesn't work. The intelligence of the stack in
>> question, regardless of what it is, goes in the charms. If you start
>> conflating and mixing up where the intelligence goes then creating,
>> running, and debugging those distributed systems will be a nightmare.
>>
>> The magic should only be in Juju's ability to effectively drive the
>> models and intelligence encoded in the charms. It shouldn't make
>> assumptions about what that intelligence is or what those models require.
>>
>> Thanks.
>>
>>
>> -Dean
>> --
>> Juju-dev mailing list
>> Juju-dev@lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>
> --
> Juju-dev mailing list
> Juju-dev@lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
>
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Dean Henrichsmeyer
On Wed, Jan 20, 2016 at 11:41 AM, William Reade  wrote:

> On Wed, Jan 20, 2016 at 3:22 PM, Rick Harding 
> wrote:
>
>> +1 retries are great, with backoff, when you know you're doing it because
>> you have experience that certain api requests to clouds, or to other known
>> failure points.
>>
>
> If you're thinking about it in terms of "known failure points" you already
> understand that you need a wide net to catch all the retryable errors that
> could come out of a given operation. What makes hook execution different
> from any other code that we want to be reliable?
>
> Blindly just saying "if at first you don't succeed, go go go" isn't a
>> better UX. It adds another layer of complexity in debugging, and doesn't
>> really improve the product. Only the charm author knows enough about what
>> it's trying to achieve to do intelligent retry.
>>
>
> Empirically, it seems that the retries caused jamespage's charm succeed
> where it would have failed; and we have happy results from Gabriel's
> windows charms as well. That STM to be evidence that the product is
> improved...
>

You realize James was complaining and not celebrating the "success" ? The
fact that we can have a discussion trying to determine whether something is
a bug or a feature indicates a problem.

-D
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread William Reade
On Wed, Jan 20, 2016 at 8:01 PM, Dean Henrichsmeyer 
wrote:
>
> You realize James was complaining and not celebrating the "success" ? The
> fact that we can have a discussion trying to determine whether something is
> a bug or a feature indicates a problem.
>

Sorry, I didn't intend to disparage his experience; I took it as legitimate
and reasonable surprise at a change we evidently didn't communicate
adequately. But I don't think it's a misfeature; I think it's a necessary
approach, in service of global reliability in challenging environments.

But: if there are times it's inconvenient and not just surprising, we
should surely be able to disable it. Gabriel/Bogdan, would you be able to
address this?

Cheers
William
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread David Britton
On Wed, Jan 20, 2016 at 09:31:32PM +0100, William Reade wrote:
> But: if there are times it's inconvenient and not just surprising, we
> should surely be able to disable it. Gabriel/Bogdan, would you be able to
> address this?

I'm +1 on the feature if it can be turned off, especially in a dev/test
environment.  It certainly does cross the line onto the inconvenient
side when it can cover over race conditions.

For production cases, like deploying openstack in a customer data
center, a sensible retry limit with backoff makes sense to me, mostly
agreeing with what William has already said.

As long as it's discoverable what happened and when -- referencing
#1530840, we should be able to satisfy both use cases (dev &
production).

-- 
David Britton 

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Nate Finch
To anyone who is not the author/maintainer/deeply knowledgeable about a
charm, a hook error is doom.  There is only one hammer an average user of
Juju can use in that case, and it's juju resolved --retry.  If we can do
that automatically for our users, then it can be a big help to cover up for
an imperfect charm (and no charm is perfect).  The range of transient
things that can go wrong is infinite, and it's unrealistic to assume every
single charm will handle every single problem.  Not to mention that some
problems are simply not possible to handle (OOM, reboot, etc). This can
only make Juju more reliable and more user-friendly.

The knowledge level of people like Dean and James is *far *beyond that of
an average user of Juju.  When they see a hook error, they can go analyze
the logs and status output and figure out whether an error is likely to get
resolved by a retry, and know if the charm should be updated to account for
this case.  But we shouldn't expect average users to have that depth of
knowledge.

Yes, we should absolutely be able to turn off hook retries in the
environment config, specifically for dev, testing, and expert users like
those in our organization.  But by default, they should retry.  To do
otherwise would be a disservice to our users.

On Wed, Jan 20, 2016 at 3:31 PM William Reade 
wrote:

> On Wed, Jan 20, 2016 at 8:01 PM, Dean Henrichsmeyer 
> wrote:
>>
>> You realize James was complaining and not celebrating the "success" ? The
>> fact that we can have a discussion trying to determine whether something is
>> a bug or a feature indicates a problem.
>>
>
> Sorry, I didn't intend to disparage his experience; I took it as
> legitimate and reasonable surprise at a change we evidently didn't
> communicate adequately. But I don't think it's a misfeature; I think it's a
> necessary approach, in service of global reliability in challenging
> environments.
>
> But: if there are times it's inconvenient and not just surprising, we
> should surely be able to disable it. Gabriel/Bogdan, would you be able to
> address this?
>
> Cheers
> William
> --
> Juju-dev mailing list
> Juju-dev@lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-20 Thread Gabriel Samfira
On Mi, 2016-01-20 at 21:31 +0100, William Reade wrote:
> On Wed, Jan 20, 2016 at 8:01 PM, Dean Henrichsmeyer <
> d...@canonical.com> wrote:
> > You realize James was complaining and not celebrating the "success"
> > ? The fact that we can have a discussion trying to determine
> > whether something is a bug or a feature indicates a problem.
> > 
> Sorry, I didn't intend to disparage his experience; I took it as
> legitimate and reasonable surprise at a change we evidently didn't
> communicate adequately. But I don't think it's a misfeature; I think
> it's a necessary approach, in service of global reliability in
> challenging environments.
> 
> But: if there are times it's inconvenient and not just surprising, we
> should surely be able to disable it. Gabriel/Bogdan, would you be
> able to address this?

Prioritizing it ASAP. Should be a simple change.

> 
> Cheers
> William
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-21 Thread James Page
On Wed, 20 Jan 2016 at 20:31 William Reade 
wrote:

> On Wed, Jan 20, 2016 at 8:01 PM, Dean Henrichsmeyer 
> wrote:
>>
>> You realize James was complaining and not celebrating the "success" ? The
>> fact that we can have a discussion trying to determine whether something is
>> a bug or a feature indicates a problem.
>>
>
> Sorry, I didn't intend to disparage his experience; I took it as
> legitimate and reasonable surprise at a change we evidently didn't
> communicate adequately. But I don't think it's a misfeature; I think it's a
> necessary approach, in service of global reliability in challenging
> environments.
>

You didn't - don't worry!


> But: if there are times it's inconvenient and not just surprising, we
> should surely be able to disable it. Gabriel/Bogdan, would you be able to
> address this?
>

I Agree with David's +1 on this feature with the condition that it can be
disabled so that charm authors actually understand the behaviour of the
software they are deploying.

Please lets also ensure the retry limit is sensible - otherwise we might
end up with end-users waiting a loong time to understand that something
is not recoverable which could be equally as damaging.
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-21 Thread roger peppe
On 21 January 2016 at 09:51, James Page  wrote:
> On Wed, 20 Jan 2016 at 20:31 William Reade 
> wrote:
>>
>> On Wed, Jan 20, 2016 at 8:01 PM, Dean Henrichsmeyer 
>> wrote:
>>>
>>> You realize James was complaining and not celebrating the "success" ? The
>>> fact that we can have a discussion trying to determine whether something is
>>> a bug or a feature indicates a problem.
>>
>>
>> Sorry, I didn't intend to disparage his experience; I took it as
>> legitimate and reasonable surprise at a change we evidently didn't
>> communicate adequately. But I don't think it's a misfeature; I think it's a
>> necessary approach, in service of global reliability in challenging
>> environments.
>
>
> You didn't - don't worry!
>
>>
>> But: if there are times it's inconvenient and not just surprising, we
>> should surely be able to disable it. Gabriel/Bogdan, would you be able to
>> address this?
>
>
> I Agree with David's +1 on this feature with the condition that it can be
> disabled so that charm authors actually understand the behaviour of the
> software they are deploying.
>
> Please lets also ensure the retry limit is sensible - otherwise we might end
> up with end-users waiting a loong time to understand that something is
> not recoverable which could be equally as damaging.

It would perhaps be good if the default status showed that
the hook was being retried. On the other hand, if retries
become common, then it could be the basis of any number
of false-alarm support calls.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Automatic retries of hooks

2016-01-21 Thread roger peppe
On 21 January 2016 at 09:51, James Page  wrote:
> On Wed, 20 Jan 2016 at 20:31 William Reade 
> wrote:
>>
>> On Wed, Jan 20, 2016 at 8:01 PM, Dean Henrichsmeyer 
>> wrote:
>>>
>>> You realize James was complaining and not celebrating the "success" ? The
>>> fact that we can have a discussion trying to determine whether something is
>>> a bug or a feature indicates a problem.
>>
>>
>> Sorry, I didn't intend to disparage his experience; I took it as
>> legitimate and reasonable surprise at a change we evidently didn't
>> communicate adequately. But I don't think it's a misfeature; I think it's a
>> necessary approach, in service of global reliability in challenging
>> environments.
>
>
> You didn't - don't worry!
>
>>
>> But: if there are times it's inconvenient and not just surprising, we
>> should surely be able to disable it. Gabriel/Bogdan, would you be able to
>> address this?
>
>
> I Agree with David's +1 on this feature with the condition that it can be
> disabled so that charm authors actually understand the behaviour of the
> software they are deploying.
>
> Please lets also ensure the retry limit is sensible - otherwise we might end
> up with end-users waiting a loong time to understand that something is
> not recoverable which could be equally as damaging.

It would perhaps be good if the default status showed that
the hook was being retried. On the other hand, if retries
become common, then it could be the basis of any number
of false-alarm support calls.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev