Re: Action health checks

2019-11-07 Thread Tyson Norris
Hi - 
As discussed, I have updated the PR to reflect:
> - for prewarm, use the tcp connection for monitoring outside of activation
> workflow
> - for warm, handle it as a case of retry, where request *connection*
> failure only for /run, will be handled by way of rescheduling back to
> ContainerPool (/init should already be handled by retry for a time 
period).

Please review and provide any feedback.
https://github.com/apache/openwhisk/pull/4698

Thanks!
Tyson

On 10/30/19, 9:03 AM, "Markus Thömmes"  wrote:

Yes, I used the word "retry" here to mean "reschedule to another
container", just like you would if the healthiness probe failed.

A word of caution: TCP probes might be behaving strangely in a container
setting. They sometimes accept connections even though nothing is listening
and stuff like that.

Am Mi., 30. Okt. 2019 um 16:34 Uhr schrieb Tyson Norris
:

> I don't think "retry" is the right handling for warm connection failures -
> if a connection cannot be made due to container crash/removal, it won't
> suddenly come back. I would instead treat it as a "reschedule", where the
> failure routes the activation back to ContainerPool, to be scheduled to a
> different container. I'm not sure how distinct we can be on detecting
> contrainer failure vs temporary network issue that may or may not resolve
> on its own, so I would treat them the same, and assume the container is
> gone.
>
> So for this PR, is there any objection to:
> - for prewarm, use the tcp connection for monitoring outside of activation
> workflow
> - for warm, handle it as a case of retry, where request *connection*
> failure only for /run, will be handled by way of rescheduling back to
> ContainerPool (/init should already be handled by retry for a time 
period).
>
> Thanks!
> Tyson
>
> On 10/30/19, 7:03 AM, "Markus Thömmes"  wrote:
>
> Increasing latency would be my biggest concern here as well. With a
> health
> ping, we can't even be sure that a container is still healthy for the
> "real
> request". To guarantee that, I'd still propose to have a look at the
> possible failure modes and implement a retry mechanism on them. If you
> get
> a "connection refused" error, I'm fairly certain that it can be 
retried
> without harm. In fact, any error where we can guarantee that we 
haven't
> actually reached the container can be safely retried in the described
> way.
>
> Pre-warmed containers indeed are somewhat of a different story. A
> health
> ping as mentioned here would for sure help there, be it just a TCP
> probe or
> even a full-fledged /health call. I'd be fine with either way in this
> case
> as it doesn't affect the critical path.
>
> Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
> :
>
> > By "critical path" you mean the path during action invocation?
> > The current PR only introduces latency on that path for the case of 
a
> > Paused container changing to Running state (once per transition from
> Paused
> > -> Running).
> > In case it isn't clear, this change does not affect any retry (or
> lack of
> > retry) behavior.
> >
> > Thanks
> > Tyson
> >
> > On 10/29/19, 9:38 AM, "Rodric Rabbah"  wrote:
> >
> > as a longer term point to consider, i think the current model of
> "best
> > effort at most once" was the wrong design point - if we embraced
> > failure
> > and just retried (at least once), then failure at this level
> would
> > lead to
> > retries which is reasonable.
> >
> > if we added a third health route or introduced a health check,
> would we
> > increase the critical path?
> >
> > -r
> >
> > On Tue, Oct 29, 2019 at 12:29 PM David P Grove <
> gro...@us.ibm.com>
> > wrote:
> >
> > > Tyson Norris  wrote on 10/28/2019
> > 11:17:50 AM:
> > > > I'm curious to know what other
> > > > folks think about "generic active probing from invoker" vs
> "docker/
> > > > mesos/k8s specific integrations for reacting to container
> > failures"?
> > > >
> > >
> > > From a pure maintenance and testing perspective I think a
> single
> > common
> > > mechanism would be best if we can do it with acceptable 
runtime
> > overhead.
> > >
> > > --dave
> > >
> >
> >
> >
>
>
>




Re: Action health checks

2019-10-30 Thread Markus Thömmes
Yes, I used the word "retry" here to mean "reschedule to another
container", just like you would if the healthiness probe failed.

A word of caution: TCP probes might be behaving strangely in a container
setting. They sometimes accept connections even though nothing is listening
and stuff like that.

Am Mi., 30. Okt. 2019 um 16:34 Uhr schrieb Tyson Norris
:

> I don't think "retry" is the right handling for warm connection failures -
> if a connection cannot be made due to container crash/removal, it won't
> suddenly come back. I would instead treat it as a "reschedule", where the
> failure routes the activation back to ContainerPool, to be scheduled to a
> different container. I'm not sure how distinct we can be on detecting
> contrainer failure vs temporary network issue that may or may not resolve
> on its own, so I would treat them the same, and assume the container is
> gone.
>
> So for this PR, is there any objection to:
> - for prewarm, use the tcp connection for monitoring outside of activation
> workflow
> - for warm, handle it as a case of retry, where request *connection*
> failure only for /run, will be handled by way of rescheduling back to
> ContainerPool (/init should already be handled by retry for a time period).
>
> Thanks!
> Tyson
>
> On 10/30/19, 7:03 AM, "Markus Thömmes"  wrote:
>
> Increasing latency would be my biggest concern here as well. With a
> health
> ping, we can't even be sure that a container is still healthy for the
> "real
> request". To guarantee that, I'd still propose to have a look at the
> possible failure modes and implement a retry mechanism on them. If you
> get
> a "connection refused" error, I'm fairly certain that it can be retried
> without harm. In fact, any error where we can guarantee that we haven't
> actually reached the container can be safely retried in the described
> way.
>
> Pre-warmed containers indeed are somewhat of a different story. A
> health
> ping as mentioned here would for sure help there, be it just a TCP
> probe or
> even a full-fledged /health call. I'd be fine with either way in this
> case
> as it doesn't affect the critical path.
>
> Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
> :
>
> > By "critical path" you mean the path during action invocation?
> > The current PR only introduces latency on that path for the case of a
> > Paused container changing to Running state (once per transition from
> Paused
> > -> Running).
> > In case it isn't clear, this change does not affect any retry (or
> lack of
> > retry) behavior.
> >
> > Thanks
> > Tyson
> >
> > On 10/29/19, 9:38 AM, "Rodric Rabbah"  wrote:
> >
> > as a longer term point to consider, i think the current model of
> "best
> > effort at most once" was the wrong design point - if we embraced
> > failure
> > and just retried (at least once), then failure at this level
> would
> > lead to
> > retries which is reasonable.
> >
> > if we added a third health route or introduced a health check,
> would we
> > increase the critical path?
> >
> > -r
> >
> > On Tue, Oct 29, 2019 at 12:29 PM David P Grove <
> gro...@us.ibm.com>
> > wrote:
> >
> > > Tyson Norris  wrote on 10/28/2019
> > 11:17:50 AM:
> > > > I'm curious to know what other
> > > > folks think about "generic active probing from invoker" vs
> "docker/
> > > > mesos/k8s specific integrations for reacting to container
> > failures"?
> > > >
> > >
> > > From a pure maintenance and testing perspective I think a
> single
> > common
> > > mechanism would be best if we can do it with acceptable runtime
> > overhead.
> > >
> > > --dave
> > >
> >
> >
> >
>
>
>


Re: Action health checks

2019-10-30 Thread Tyson Norris
I don't think "retry" is the right handling for warm connection failures - if a 
connection cannot be made due to container crash/removal, it won't suddenly 
come back. I would instead treat it as a "reschedule", where the failure routes 
the activation back to ContainerPool, to be scheduled to a different container. 
I'm not sure how distinct we can be on detecting contrainer failure vs 
temporary network issue that may or may not resolve on its own, so I would 
treat them the same, and assume the container is gone.

So for this PR, is there any objection to:
- for prewarm, use the tcp connection for monitoring outside of activation 
workflow
- for warm, handle it as a case of retry, where request *connection* failure 
only for /run, will be handled by way of rescheduling back to ContainerPool 
(/init should already be handled by retry for a time period).

Thanks!
Tyson

On 10/30/19, 7:03 AM, "Markus Thömmes"  wrote:

Increasing latency would be my biggest concern here as well. With a health
ping, we can't even be sure that a container is still healthy for the "real
request". To guarantee that, I'd still propose to have a look at the
possible failure modes and implement a retry mechanism on them. If you get
a "connection refused" error, I'm fairly certain that it can be retried
without harm. In fact, any error where we can guarantee that we haven't
actually reached the container can be safely retried in the described way.

Pre-warmed containers indeed are somewhat of a different story. A health
ping as mentioned here would for sure help there, be it just a TCP probe or
even a full-fledged /health call. I'd be fine with either way in this case
as it doesn't affect the critical path.

Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
:

> By "critical path" you mean the path during action invocation?
> The current PR only introduces latency on that path for the case of a
> Paused container changing to Running state (once per transition from 
Paused
> -> Running).
> In case it isn't clear, this change does not affect any retry (or lack of
> retry) behavior.
>
> Thanks
> Tyson
>
> On 10/29/19, 9:38 AM, "Rodric Rabbah"  wrote:
>
> as a longer term point to consider, i think the current model of "best
> effort at most once" was the wrong design point - if we embraced
> failure
> and just retried (at least once), then failure at this level would
> lead to
> retries which is reasonable.
>
> if we added a third health route or introduced a health check, would 
we
> increase the critical path?
>
> -r
>
> On Tue, Oct 29, 2019 at 12:29 PM David P Grove 
> wrote:
>
> > Tyson Norris  wrote on 10/28/2019
> 11:17:50 AM:
> > > I'm curious to know what other
> > > folks think about "generic active probing from invoker" vs 
"docker/
> > > mesos/k8s specific integrations for reacting to container
> failures"?
> > >
> >
> > From a pure maintenance and testing perspective I think a single
> common
> > mechanism would be best if we can do it with acceptable runtime
> overhead.
> >
> > --dave
> >
>
>
>




Re: Action health checks

2019-10-30 Thread Markus Thömmes
Increasing latency would be my biggest concern here as well. With a health
ping, we can't even be sure that a container is still healthy for the "real
request". To guarantee that, I'd still propose to have a look at the
possible failure modes and implement a retry mechanism on them. If you get
a "connection refused" error, I'm fairly certain that it can be retried
without harm. In fact, any error where we can guarantee that we haven't
actually reached the container can be safely retried in the described way.

Pre-warmed containers indeed are somewhat of a different story. A health
ping as mentioned here would for sure help there, be it just a TCP probe or
even a full-fledged /health call. I'd be fine with either way in this case
as it doesn't affect the critical path.

Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
:

> By "critical path" you mean the path during action invocation?
> The current PR only introduces latency on that path for the case of a
> Paused container changing to Running state (once per transition from Paused
> -> Running).
> In case it isn't clear, this change does not affect any retry (or lack of
> retry) behavior.
>
> Thanks
> Tyson
>
> On 10/29/19, 9:38 AM, "Rodric Rabbah"  wrote:
>
> as a longer term point to consider, i think the current model of "best
> effort at most once" was the wrong design point - if we embraced
> failure
> and just retried (at least once), then failure at this level would
> lead to
> retries which is reasonable.
>
> if we added a third health route or introduced a health check, would we
> increase the critical path?
>
> -r
>
> On Tue, Oct 29, 2019 at 12:29 PM David P Grove 
> wrote:
>
> > Tyson Norris  wrote on 10/28/2019
> 11:17:50 AM:
> > > I'm curious to know what other
> > > folks think about "generic active probing from invoker" vs "docker/
> > > mesos/k8s specific integrations for reacting to container
> failures"?
> > >
> >
> > From a pure maintenance and testing perspective I think a single
> common
> > mechanism would be best if we can do it with acceptable runtime
> overhead.
> >
> > --dave
> >
>
>
>


Re: Action health checks

2019-10-29 Thread Tyson Norris
By "critical path" you mean the path during action invocation? 
The current PR only introduces latency on that path for the case of a Paused 
container changing to Running state (once per transition from Paused -> 
Running). 
In case it isn't clear, this change does not affect any retry (or lack of 
retry) behavior.

Thanks
Tyson

On 10/29/19, 9:38 AM, "Rodric Rabbah"  wrote:

as a longer term point to consider, i think the current model of "best
effort at most once" was the wrong design point - if we embraced failure
and just retried (at least once), then failure at this level would lead to
retries which is reasonable.

if we added a third health route or introduced a health check, would we
increase the critical path?

-r

On Tue, Oct 29, 2019 at 12:29 PM David P Grove  wrote:

> Tyson Norris  wrote on 10/28/2019 11:17:50 AM:
> > I'm curious to know what other
> > folks think about "generic active probing from invoker" vs "docker/
> > mesos/k8s specific integrations for reacting to container failures"?
> >
>
> From a pure maintenance and testing perspective I think a single common
> mechanism would be best if we can do it with acceptable runtime overhead.
>
> --dave
>




Re: Action health checks

2019-10-29 Thread Rodric Rabbah
as a longer term point to consider, i think the current model of "best
effort at most once" was the wrong design point - if we embraced failure
and just retried (at least once), then failure at this level would lead to
retries which is reasonable.

if we added a third health route or introduced a health check, would we
increase the critical path?

-r

On Tue, Oct 29, 2019 at 12:29 PM David P Grove  wrote:

> Tyson Norris  wrote on 10/28/2019 11:17:50 AM:
> > I'm curious to know what other
> > folks think about "generic active probing from invoker" vs "docker/
> > mesos/k8s specific integrations for reacting to container failures"?
> >
>
> From a pure maintenance and testing perspective I think a single common
> mechanism would be best if we can do it with acceptable runtime overhead.
>
> --dave
>


RE: Action health checks

2019-10-29 Thread David P Grove
Tyson Norris  wrote on 10/28/2019 11:17:50 AM:
> I'm curious to know what other
> folks think about "generic active probing from invoker" vs "docker/
> mesos/k8s specific integrations for reacting to container failures"?
>

>From a pure maintenance and testing perspective I think a single common
mechanism would be best if we can do it with acceptable runtime overhead.

--dave


Re: Action health checks

2019-10-28 Thread Tyson Norris
Hi Markus - 
The failures are generic and we haven't seen a real cause as of yet, on mesos 
we get an error of "Container exited with status 125". We continue to 
investigate that of course, but containers may die for any number of reasons so 
we should just plan on them dying. We do get an event from mesos already on 
these failures, and I'm sure we can integrate with Kubernetes to react as well, 
but I thought it might be better to make this probing simpler and consistent 
e.g. where DockerContainerFactory can be treated the same way. If nothing else, 
it is certainly easier to test. I'm curious to know what other folks think 
about "generic active probing from invoker" vs "docker/mesos/k8s specific 
integrations for reacting to container failures"?

RE HTTP requests - For prewarm, we cannot add this check there, since e.g. if 
20 prewarms fail for this invoker, a single activation might try each of those 
twenty before getting a working container, which seems like bad behavior 
compared to preemptively validating the container and replacing it outside the 
HTTP workflow for prewarms. For warm containers, it would be more feasible to 
do this but we would need to distinguish "/run after resume" from "/run before 
pause", and provide a special error case for connection failure after resume 
since we cannot treat all warm container failures as retriable - only once 
after resume.  This seemed more complicated than explicitly checking it once 
after resume inside ContainerProxy.  One possible change would be to move the 
checking logic inside either Container or ContainerClient, but I would keep it 
separate from /init and /run, and consider revisiting it if we change the HTTP 
protocol to include some more sophisticated checking via HTTP ( add a /health 
endpoint etc). 

Thanks
Tyson


On 10/28/19, 2:21 AM, "Markus Thömmes"  wrote:

Heya,

thanks for the elaborate proposal.

Do you have any more information on why these containers are dying off in
the first place? In the case of Kubernetes/Mesos I could imagine we might
want to keep the Invoker's state consistent by checking it against the
respective API repeatedly. On Kubernetes for instance, you could setup an
informer that'd inform you about any state changes on the pods that this
Invoker has spawned. If a prewarm container dies this way, we can simply
remove it from the Invoker's bookkeeping and trigger a backfill.

Secondly, could we potentially fold this check into the HTTP requests
themselves? If we get a "connection refused" on an action that we knew
worked before, we can safely retry. There should be a set of exceptions
that our HTTP clients should surface that should be safe for us to retry in
the invoker anyway. The only addition you'd need in this case is an
enhancement on the ContainerProxy's state machine I believe, that allows
for such a retrying use-case. The "connection refused" use-case I mentioned
should be equivalent to the TCP probe you're doing now.

WDYT?

Cheers,
Markus

Am So., 27. Okt. 2019 um 02:56 Uhr schrieb Tyson Norris
:

> Hi Whiskers –
> We periodically have an unfortunate problem where a docker container (or
> worse, many of them) dies off unexpectedly, outside of HTTP usage from
> invoker. In these cases, prewarm or warm containers may still have
> references at the Invoker, and eventually if an activation arrives that
> matches those container references, the HTTP workflow starts and fails
> immediately since the node is not listening anymore, resulting in failed
> activations. Or, any even worse situation, can be when a container failed
> earlier, and a new container, initialized with a different action is
> initialized on the same host and port (more likely a problem for k8s/mesos
> cluster usage).
>
> To mitigate these issues, I put together a health check process [1] from
> invoker to action containers, where we can test
>
>   *   prewarm containers periodically to verify they are still
> operational, and
>   *   warm containers immediately after resuming them (before HTTP
> requests are sent)
> In case of prewarm failure, we should backfill the prewarms to the
> specified config count.
> In case of warm failure, the activation is rescheduled to ContainerPool,
> which typically would either route to a different prewarm, or start a new
> cold container.
>
> The test ping is in the form of tcp connection only, since we otherwise
> need to update the HTTP protocol implemented by all runtimes. This test is
> good enough for the worst case of “container has gone missing”, but cannot
> test for more subtle problems like “/run endpoint is broken”. There could
> be other checks to increase the quality of test we add in the future, but
> most of this I think requires expanding the HTTP 

Re: Action health checks

2019-10-28 Thread Markus Thömmes
Heya,

thanks for the elaborate proposal.

Do you have any more information on why these containers are dying off in
the first place? In the case of Kubernetes/Mesos I could imagine we might
want to keep the Invoker's state consistent by checking it against the
respective API repeatedly. On Kubernetes for instance, you could setup an
informer that'd inform you about any state changes on the pods that this
Invoker has spawned. If a prewarm container dies this way, we can simply
remove it from the Invoker's bookkeeping and trigger a backfill.

Secondly, could we potentially fold this check into the HTTP requests
themselves? If we get a "connection refused" on an action that we knew
worked before, we can safely retry. There should be a set of exceptions
that our HTTP clients should surface that should be safe for us to retry in
the invoker anyway. The only addition you'd need in this case is an
enhancement on the ContainerProxy's state machine I believe, that allows
for such a retrying use-case. The "connection refused" use-case I mentioned
should be equivalent to the TCP probe you're doing now.

WDYT?

Cheers,
Markus

Am So., 27. Okt. 2019 um 02:56 Uhr schrieb Tyson Norris
:

> Hi Whiskers –
> We periodically have an unfortunate problem where a docker container (or
> worse, many of them) dies off unexpectedly, outside of HTTP usage from
> invoker. In these cases, prewarm or warm containers may still have
> references at the Invoker, and eventually if an activation arrives that
> matches those container references, the HTTP workflow starts and fails
> immediately since the node is not listening anymore, resulting in failed
> activations. Or, any even worse situation, can be when a container failed
> earlier, and a new container, initialized with a different action is
> initialized on the same host and port (more likely a problem for k8s/mesos
> cluster usage).
>
> To mitigate these issues, I put together a health check process [1] from
> invoker to action containers, where we can test
>
>   *   prewarm containers periodically to verify they are still
> operational, and
>   *   warm containers immediately after resuming them (before HTTP
> requests are sent)
> In case of prewarm failure, we should backfill the prewarms to the
> specified config count.
> In case of warm failure, the activation is rescheduled to ContainerPool,
> which typically would either route to a different prewarm, or start a new
> cold container.
>
> The test ping is in the form of tcp connection only, since we otherwise
> need to update the HTTP protocol implemented by all runtimes. This test is
> good enough for the worst case of “container has gone missing”, but cannot
> test for more subtle problems like “/run endpoint is broken”. There could
> be other checks to increase the quality of test we add in the future, but
> most of this I think requires expanding the HTTP protocol and state managed
> at the container, and I wanted to get something working for basic
> functionality to start with.
>
> Let me know if you have opinions about this, and we can discuss  here or
> in the PR.
> Thanks
> Tyson
>
> [1] https://github.com/apache/openwhisk/pull/4698
>