Hi Whiskers –
We periodically have an unfortunate problem where a docker container (or worse, 
many of them) dies off unexpectedly, outside of HTTP usage from invoker. In 
these cases, prewarm or warm containers may still have references at the 
Invoker, and eventually if an activation arrives that matches those container 
references, the HTTP workflow starts and fails immediately since the node is 
not listening anymore, resulting in failed activations. Or, any even worse 
situation, can be when a container failed earlier, and a new container, 
initialized with a different action is initialized on the same host and port 
(more likely a problem for k8s/mesos cluster usage).

To mitigate these issues, I put together a health check process [1] from 
invoker to action containers, where we can test

  *   prewarm containers periodically to verify they are still operational, and
  *   warm containers immediately after resuming them (before HTTP requests are 
sent)
In case of prewarm failure, we should backfill the prewarms to the specified 
config count.
In case of warm failure, the activation is rescheduled to ContainerPool, which 
typically would either route to a different prewarm, or start a new cold 
container.

The test ping is in the form of tcp connection only, since we otherwise need to 
update the HTTP protocol implemented by all runtimes. This test is good enough 
for the worst case of “container has gone missing”, but cannot test for more 
subtle problems like “/run endpoint is broken”. There could be other checks to 
increase the quality of test we add in the future, but most of this I think 
requires expanding the HTTP protocol and state managed at the container, and I 
wanted to get something working for basic functionality to start with.

Let me know if you have opinions about this, and we can discuss  here or in the 
PR.
Thanks
Tyson

[1] https://github.com/apache/openwhisk/pull/4698

Reply via email to