Hi Whiskers – We periodically have an unfortunate problem where a docker container (or worse, many of them) dies off unexpectedly, outside of HTTP usage from invoker. In these cases, prewarm or warm containers may still have references at the Invoker, and eventually if an activation arrives that matches those container references, the HTTP workflow starts and fails immediately since the node is not listening anymore, resulting in failed activations. Or, any even worse situation, can be when a container failed earlier, and a new container, initialized with a different action is initialized on the same host and port (more likely a problem for k8s/mesos cluster usage).
To mitigate these issues, I put together a health check process [1] from invoker to action containers, where we can test * prewarm containers periodically to verify they are still operational, and * warm containers immediately after resuming them (before HTTP requests are sent) In case of prewarm failure, we should backfill the prewarms to the specified config count. In case of warm failure, the activation is rescheduled to ContainerPool, which typically would either route to a different prewarm, or start a new cold container. The test ping is in the form of tcp connection only, since we otherwise need to update the HTTP protocol implemented by all runtimes. This test is good enough for the worst case of “container has gone missing”, but cannot test for more subtle problems like “/run endpoint is broken”. There could be other checks to increase the quality of test we add in the future, but most of this I think requires expanding the HTTP protocol and state managed at the container, and I wanted to get something working for basic functionality to start with. Let me know if you have opinions about this, and we can discuss here or in the PR. Thanks Tyson [1] https://github.com/apache/openwhisk/pull/4698