Hi -
As discussed, I have updated the PR to reflect:
> - for prewarm, use the tcp connection for monitoring outside of activation
> workflow
> - for warm, handle it as a case of retry, where request *connection*
> failure only for /run, will be handled by way of rescheduling back to
> ContainerPool (/init should already be handled by retry for a time
period).
Please review and provide any feedback.
https://github.com/apache/openwhisk/pull/4698
Thanks!
Tyson
On 10/30/19, 9:03 AM, "Markus Thömmes" <[email protected]> wrote:
Yes, I used the word "retry" here to mean "reschedule to another
container", just like you would if the healthiness probe failed.
A word of caution: TCP probes might be behaving strangely in a container
setting. They sometimes accept connections even though nothing is listening
and stuff like that.
Am Mi., 30. Okt. 2019 um 16:34 Uhr schrieb Tyson Norris
<[email protected]>:
> I don't think "retry" is the right handling for warm connection failures -
> if a connection cannot be made due to container crash/removal, it won't
> suddenly come back. I would instead treat it as a "reschedule", where the
> failure routes the activation back to ContainerPool, to be scheduled to a
> different container. I'm not sure how distinct we can be on detecting
> contrainer failure vs temporary network issue that may or may not resolve
> on its own, so I would treat them the same, and assume the container is
> gone.
>
> So for this PR, is there any objection to:
> - for prewarm, use the tcp connection for monitoring outside of activation
> workflow
> - for warm, handle it as a case of retry, where request *connection*
> failure only for /run, will be handled by way of rescheduling back to
> ContainerPool (/init should already be handled by retry for a time
period).
>
> Thanks!
> Tyson
>
> On 10/30/19, 7:03 AM, "Markus Thömmes" <[email protected]> wrote:
>
> Increasing latency would be my biggest concern here as well. With a
> health
> ping, we can't even be sure that a container is still healthy for the
> "real
> request". To guarantee that, I'd still propose to have a look at the
> possible failure modes and implement a retry mechanism on them. If you
> get
> a "connection refused" error, I'm fairly certain that it can be
retried
> without harm. In fact, any error where we can guarantee that we
haven't
> actually reached the container can be safely retried in the described
> way.
>
> Pre-warmed containers indeed are somewhat of a different story. A
> health
> ping as mentioned here would for sure help there, be it just a TCP
> probe or
> even a full-fledged /health call. I'd be fine with either way in this
> case
> as it doesn't affect the critical path.
>
> Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
> <[email protected]>:
>
> > By "critical path" you mean the path during action invocation?
> > The current PR only introduces latency on that path for the case of
a
> > Paused container changing to Running state (once per transition from
> Paused
> > -> Running).
> > In case it isn't clear, this change does not affect any retry (or
> lack of
> > retry) behavior.
> >
> > Thanks
> > Tyson
> >
> > On 10/29/19, 9:38 AM, "Rodric Rabbah" <[email protected]> wrote:
> >
> > as a longer term point to consider, i think the current model of
> "best
> > effort at most once" was the wrong design point - if we embraced
> > failure
> > and just retried (at least once), then failure at this level
> would
> > lead to
> > retries which is reasonable.
> >
> > if we added a third health route or introduced a health check,
> would we
> > increase the critical path?
> >
> > -r
> >
> > On Tue, Oct 29, 2019 at 12:29 PM David P Grove <
> [email protected]>
> > wrote:
> >
> > > Tyson Norris <[email protected]> wrote on 10/28/2019
> > 11:17:50 AM:
> > > > I'm curious to know what other
> > > > folks think about "generic active probing from invoker" vs
> "docker/
> > > > mesos/k8s specific integrations for reacting to container
> > failures"?
> > > >
> > >
> > > From a pure maintenance and testing perspective I think a
> single
> > common
> > > mechanism would be best if we can do it with acceptable
runtime
> > overhead.
> > >
> > > --dave
> > >
> >
> >
> >
>
>
>