Yes, I used the word "retry" here to mean "reschedule to another
container", just like you would if the healthiness probe failed.

A word of caution: TCP probes might be behaving strangely in a container
setting. They sometimes accept connections even though nothing is listening
and stuff like that.

Am Mi., 30. Okt. 2019 um 16:34 Uhr schrieb Tyson Norris
<tnor...@adobe.com.invalid>:

> I don't think "retry" is the right handling for warm connection failures -
> if a connection cannot be made due to container crash/removal, it won't
> suddenly come back. I would instead treat it as a "reschedule", where the
> failure routes the activation back to ContainerPool, to be scheduled to a
> different container. I'm not sure how distinct we can be on detecting
> contrainer failure vs temporary network issue that may or may not resolve
> on its own, so I would treat them the same, and assume the container is
> gone.
>
> So for this PR, is there any objection to:
> - for prewarm, use the tcp connection for monitoring outside of activation
> workflow
> - for warm, handle it as a case of retry, where request *connection*
> failure only for /run, will be handled by way of rescheduling back to
> ContainerPool (/init should already be handled by retry for a time period).
>
> Thanks!
> Tyson
>
> On 10/30/19, 7:03 AM, "Markus Thömmes" <markusthoem...@apache.org> wrote:
>
>     Increasing latency would be my biggest concern here as well. With a
> health
>     ping, we can't even be sure that a container is still healthy for the
> "real
>     request". To guarantee that, I'd still propose to have a look at the
>     possible failure modes and implement a retry mechanism on them. If you
> get
>     a "connection refused" error, I'm fairly certain that it can be retried
>     without harm. In fact, any error where we can guarantee that we haven't
>     actually reached the container can be safely retried in the described
> way.
>
>     Pre-warmed containers indeed are somewhat of a different story. A
> health
>     ping as mentioned here would for sure help there, be it just a TCP
> probe or
>     even a full-fledged /health call. I'd be fine with either way in this
> case
>     as it doesn't affect the critical path.
>
>     Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
>     <tnor...@adobe.com.invalid>:
>
>     > By "critical path" you mean the path during action invocation?
>     > The current PR only introduces latency on that path for the case of a
>     > Paused container changing to Running state (once per transition from
> Paused
>     > -> Running).
>     > In case it isn't clear, this change does not affect any retry (or
> lack of
>     > retry) behavior.
>     >
>     > Thanks
>     > Tyson
>     >
>     > On 10/29/19, 9:38 AM, "Rodric Rabbah" <rod...@gmail.com> wrote:
>     >
>     >     as a longer term point to consider, i think the current model of
> "best
>     >     effort at most once" was the wrong design point - if we embraced
>     > failure
>     >     and just retried (at least once), then failure at this level
> would
>     > lead to
>     >     retries which is reasonable.
>     >
>     >     if we added a third health route or introduced a health check,
> would we
>     >     increase the critical path?
>     >
>     >     -r
>     >
>     >     On Tue, Oct 29, 2019 at 12:29 PM David P Grove <
> gro...@us.ibm.com>
>     > wrote:
>     >
>     >     > Tyson Norris <tnor...@adobe.com.INVALID> wrote on 10/28/2019
>     > 11:17:50 AM:
>     >     > > I'm curious to know what other
>     >     > > folks think about "generic active probing from invoker" vs
> "docker/
>     >     > > mesos/k8s specific integrations for reacting to container
>     > failures"?
>     >     > >
>     >     >
>     >     > From a pure maintenance and testing perspective I think a
> single
>     > common
>     >     > mechanism would be best if we can do it with acceptable runtime
>     > overhead.
>     >     >
>     >     > --dave
>     >     >
>     >
>     >
>     >
>
>
>

Reply via email to