I don't think "retry" is the right handling for warm connection failures - if a connection cannot be made due to container crash/removal, it won't suddenly come back. I would instead treat it as a "reschedule", where the failure routes the activation back to ContainerPool, to be scheduled to a different container. I'm not sure how distinct we can be on detecting contrainer failure vs temporary network issue that may or may not resolve on its own, so I would treat them the same, and assume the container is gone.
So for this PR, is there any objection to: - for prewarm, use the tcp connection for monitoring outside of activation workflow - for warm, handle it as a case of retry, where request *connection* failure only for /run, will be handled by way of rescheduling back to ContainerPool (/init should already be handled by retry for a time period). Thanks! Tyson On 10/30/19, 7:03 AM, "Markus Thömmes" <markusthoem...@apache.org> wrote: Increasing latency would be my biggest concern here as well. With a health ping, we can't even be sure that a container is still healthy for the "real request". To guarantee that, I'd still propose to have a look at the possible failure modes and implement a retry mechanism on them. If you get a "connection refused" error, I'm fairly certain that it can be retried without harm. In fact, any error where we can guarantee that we haven't actually reached the container can be safely retried in the described way. Pre-warmed containers indeed are somewhat of a different story. A health ping as mentioned here would for sure help there, be it just a TCP probe or even a full-fledged /health call. I'd be fine with either way in this case as it doesn't affect the critical path. Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris <tnor...@adobe.com.invalid>: > By "critical path" you mean the path during action invocation? > The current PR only introduces latency on that path for the case of a > Paused container changing to Running state (once per transition from Paused > -> Running). > In case it isn't clear, this change does not affect any retry (or lack of > retry) behavior. > > Thanks > Tyson > > On 10/29/19, 9:38 AM, "Rodric Rabbah" <rod...@gmail.com> wrote: > > as a longer term point to consider, i think the current model of "best > effort at most once" was the wrong design point - if we embraced > failure > and just retried (at least once), then failure at this level would > lead to > retries which is reasonable. > > if we added a third health route or introduced a health check, would we > increase the critical path? > > -r > > On Tue, Oct 29, 2019 at 12:29 PM David P Grove <gro...@us.ibm.com> > wrote: > > > Tyson Norris <tnor...@adobe.com.INVALID> wrote on 10/28/2019 > 11:17:50 AM: > > > I'm curious to know what other > > > folks think about "generic active probing from invoker" vs "docker/ > > > mesos/k8s specific integrations for reacting to container > failures"? > > > > > > > From a pure maintenance and testing perspective I think a single > common > > mechanism would be best if we can do it with acceptable runtime > overhead. > > > > --dave > > > > >