Hi Markus - The failures are generic and we haven't seen a real cause as of yet, on mesos we get an error of "Container exited with status 125". We continue to investigate that of course, but containers may die for any number of reasons so we should just plan on them dying. We do get an event from mesos already on these failures, and I'm sure we can integrate with Kubernetes to react as well, but I thought it might be better to make this probing simpler and consistent e.g. where DockerContainerFactory can be treated the same way. If nothing else, it is certainly easier to test. I'm curious to know what other folks think about "generic active probing from invoker" vs "docker/mesos/k8s specific integrations for reacting to container failures"?
RE HTTP requests - For prewarm, we cannot add this check there, since e.g. if 20 prewarms fail for this invoker, a single activation might try each of those twenty before getting a working container, which seems like bad behavior compared to preemptively validating the container and replacing it outside the HTTP workflow for prewarms. For warm containers, it would be more feasible to do this but we would need to distinguish "/run after resume" from "/run before pause", and provide a special error case for connection failure after resume since we cannot treat all warm container failures as retriable - only once after resume. This seemed more complicated than explicitly checking it once after resume inside ContainerProxy. One possible change would be to move the checking logic inside either Container or ContainerClient, but I would keep it separate from /init and /run, and consider revisiting it if we change the HTTP protocol to include some more sophisticated checking via HTTP ( add a /health endpoint etc). Thanks Tyson On 10/28/19, 2:21 AM, "Markus Thömmes" <markusthoem...@apache.org> wrote: Heya, thanks for the elaborate proposal. Do you have any more information on why these containers are dying off in the first place? In the case of Kubernetes/Mesos I could imagine we might want to keep the Invoker's state consistent by checking it against the respective API repeatedly. On Kubernetes for instance, you could setup an informer that'd inform you about any state changes on the pods that this Invoker has spawned. If a prewarm container dies this way, we can simply remove it from the Invoker's bookkeeping and trigger a backfill. Secondly, could we potentially fold this check into the HTTP requests themselves? If we get a "connection refused" on an action that we knew worked before, we can safely retry. There should be a set of exceptions that our HTTP clients should surface that should be safe for us to retry in the invoker anyway. The only addition you'd need in this case is an enhancement on the ContainerProxy's state machine I believe, that allows for such a retrying use-case. The "connection refused" use-case I mentioned should be equivalent to the TCP probe you're doing now. WDYT? Cheers, Markus Am So., 27. Okt. 2019 um 02:56 Uhr schrieb Tyson Norris <tnor...@adobe.com.invalid>: > Hi Whiskers – > We periodically have an unfortunate problem where a docker container (or > worse, many of them) dies off unexpectedly, outside of HTTP usage from > invoker. In these cases, prewarm or warm containers may still have > references at the Invoker, and eventually if an activation arrives that > matches those container references, the HTTP workflow starts and fails > immediately since the node is not listening anymore, resulting in failed > activations. Or, any even worse situation, can be when a container failed > earlier, and a new container, initialized with a different action is > initialized on the same host and port (more likely a problem for k8s/mesos > cluster usage). > > To mitigate these issues, I put together a health check process [1] from > invoker to action containers, where we can test > > * prewarm containers periodically to verify they are still > operational, and > * warm containers immediately after resuming them (before HTTP > requests are sent) > In case of prewarm failure, we should backfill the prewarms to the > specified config count. > In case of warm failure, the activation is rescheduled to ContainerPool, > which typically would either route to a different prewarm, or start a new > cold container. > > The test ping is in the form of tcp connection only, since we otherwise > need to update the HTTP protocol implemented by all runtimes. This test is > good enough for the worst case of “container has gone missing”, but cannot > test for more subtle problems like “/run endpoint is broken”. There could > be other checks to increase the quality of test we add in the future, but > most of this I think requires expanding the HTTP protocol and state managed > at the container, and I wanted to get something working for basic > functionality to start with. > > Let me know if you have opinions about this, and we can discuss here or > in the PR. > Thanks > Tyson > > [1] https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fopenwhisk%2Fpull%2F4698&data=02%7C01%7Ctnorris%40adobe.com%7Cf082e59d6ff049a9235a08d75b8846f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637078513173443883&sdata=4ngLog%2BJjK1MQBVTICLmzBBystdBPfLhV1HBi%2BbogXc%3D&reserved=0 >