Hi - As discussed, I have updated the PR to reflect: > - for prewarm, use the tcp connection for monitoring outside of activation > workflow > - for warm, handle it as a case of retry, where request *connection* > failure only for /run, will be handled by way of rescheduling back to > ContainerPool (/init should already be handled by retry for a time period).
Please review and provide any feedback. https://github.com/apache/openwhisk/pull/4698 Thanks! Tyson On 10/30/19, 9:03 AM, "Markus Thömmes" <markusthoem...@apache.org> wrote: Yes, I used the word "retry" here to mean "reschedule to another container", just like you would if the healthiness probe failed. A word of caution: TCP probes might be behaving strangely in a container setting. They sometimes accept connections even though nothing is listening and stuff like that. Am Mi., 30. Okt. 2019 um 16:34 Uhr schrieb Tyson Norris <tnor...@adobe.com.invalid>: > I don't think "retry" is the right handling for warm connection failures - > if a connection cannot be made due to container crash/removal, it won't > suddenly come back. I would instead treat it as a "reschedule", where the > failure routes the activation back to ContainerPool, to be scheduled to a > different container. I'm not sure how distinct we can be on detecting > contrainer failure vs temporary network issue that may or may not resolve > on its own, so I would treat them the same, and assume the container is > gone. > > So for this PR, is there any objection to: > - for prewarm, use the tcp connection for monitoring outside of activation > workflow > - for warm, handle it as a case of retry, where request *connection* > failure only for /run, will be handled by way of rescheduling back to > ContainerPool (/init should already be handled by retry for a time period). > > Thanks! > Tyson > > On 10/30/19, 7:03 AM, "Markus Thömmes" <markusthoem...@apache.org> wrote: > > Increasing latency would be my biggest concern here as well. With a > health > ping, we can't even be sure that a container is still healthy for the > "real > request". To guarantee that, I'd still propose to have a look at the > possible failure modes and implement a retry mechanism on them. If you > get > a "connection refused" error, I'm fairly certain that it can be retried > without harm. In fact, any error where we can guarantee that we haven't > actually reached the container can be safely retried in the described > way. > > Pre-warmed containers indeed are somewhat of a different story. A > health > ping as mentioned here would for sure help there, be it just a TCP > probe or > even a full-fledged /health call. I'd be fine with either way in this > case > as it doesn't affect the critical path. > > Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris > <tnor...@adobe.com.invalid>: > > > By "critical path" you mean the path during action invocation? > > The current PR only introduces latency on that path for the case of a > > Paused container changing to Running state (once per transition from > Paused > > -> Running). > > In case it isn't clear, this change does not affect any retry (or > lack of > > retry) behavior. > > > > Thanks > > Tyson > > > > On 10/29/19, 9:38 AM, "Rodric Rabbah" <rod...@gmail.com> wrote: > > > > as a longer term point to consider, i think the current model of > "best > > effort at most once" was the wrong design point - if we embraced > > failure > > and just retried (at least once), then failure at this level > would > > lead to > > retries which is reasonable. > > > > if we added a third health route or introduced a health check, > would we > > increase the critical path? > > > > -r > > > > On Tue, Oct 29, 2019 at 12:29 PM David P Grove < > gro...@us.ibm.com> > > wrote: > > > > > Tyson Norris <tnor...@adobe.com.INVALID> wrote on 10/28/2019 > > 11:17:50 AM: > > > > I'm curious to know what other > > > > folks think about "generic active probing from invoker" vs > "docker/ > > > > mesos/k8s specific integrations for reacting to container > > failures"? > > > > > > > > > > From a pure maintenance and testing perspective I think a > single > > common > > > mechanism would be best if we can do it with acceptable runtime > > overhead. > > > > > > --dave > > > > > > > > > > > >