Re: Action health checks

Tyson Norris Wed, 30 Oct 2019 08:35:09 -0700

I don't think "retry" is the right handling for warm connection failures - if a 
connection cannot be made due to container crash/removal, it won't suddenly 
come back. I would instead treat it as a "reschedule", where the failure routes 
the activation back to ContainerPool, to be scheduled to a different container. 
I'm not sure how distinct we can be on detecting contrainer failure vs 
temporary network issue that may or may not resolve on its own, so I would 
treat them the same, and assume the container is gone.


So for this PR, is there any objection to:
- for prewarm, use the tcp connection for monitoring outside of activation 
workflow
- for warm, handle it as a case of retry, where request *connection* failure 
only for /run, will be handled by way of rescheduling back to ContainerPool 
(/init should already be handled by retry for a time period).

Thanks!
Tyson

On 10/30/19, 7:03 AM, "Markus Thömmes" <[email protected]> wrote:

    Increasing latency would be my biggest concern here as well. With a health
    ping, we can't even be sure that a container is still healthy for the "real
    request". To guarantee that, I'd still propose to have a look at the
    possible failure modes and implement a retry mechanism on them. If you get
    a "connection refused" error, I'm fairly certain that it can be retried
    without harm. In fact, any error where we can guarantee that we haven't
    actually reached the container can be safely retried in the described way.
    
    Pre-warmed containers indeed are somewhat of a different story. A health
    ping as mentioned here would for sure help there, be it just a TCP probe or
    even a full-fledged /health call. I'd be fine with either way in this case
    as it doesn't affect the critical path.
    
    Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
    <[email protected]>:
    
    > By "critical path" you mean the path during action invocation?
    > The current PR only introduces latency on that path for the case of a
    > Paused container changing to Running state (once per transition from 
Paused
    > -> Running).
    > In case it isn't clear, this change does not affect any retry (or lack of
    > retry) behavior.
    >
    > Thanks
    > Tyson
    >
    > On 10/29/19, 9:38 AM, "Rodric Rabbah" <[email protected]> wrote:
    >
    >     as a longer term point to consider, i think the current model of "best
    >     effort at most once" was the wrong design point - if we embraced
    > failure
    >     and just retried (at least once), then failure at this level would
    > lead to
    >     retries which is reasonable.
    >
    >     if we added a third health route or introduced a health check, would 
we
    >     increase the critical path?
    >
    >     -r
    >
    >     On Tue, Oct 29, 2019 at 12:29 PM David P Grove <[email protected]>
    > wrote:
    >
    >     > Tyson Norris <[email protected]> wrote on 10/28/2019
    > 11:17:50 AM:
    >     > > I'm curious to know what other
    >     > > folks think about "generic active probing from invoker" vs 
"docker/
    >     > > mesos/k8s specific integrations for reacting to container
    > failures"?
    >     > >
    >     >
    >     > From a pure maintenance and testing perspective I think a single
    > common
    >     > mechanism would be best if we can do it with acceptable runtime
    > overhead.
    >     >
    >     > --dave
    >     >
    >
    >
    >

Re: Action health checks

Reply via email to