Re: Action health checks

Tyson Norris Mon, 28 Oct 2019 08:18:32 -0700

Hi Markus - 
The failures are generic and we haven't seen a real cause as of yet, on mesos 
we get an error of "Container exited with status 125". We continue to 
investigate that of course, but containers may die for any number of reasons so 
we should just plan on them dying. We do get an event from mesos already on 
these failures, and I'm sure we can integrate with Kubernetes to react as well, 
but I thought it might be better to make this probing simpler and consistent 
e.g. where DockerContainerFactory can be treated the same way. If nothing else, 
it is certainly easier to test. I'm curious to know what other folks think 
about "generic active probing from invoker" vs "docker/mesos/k8s specific 
integrations for reacting to container failures"?


RE HTTP requests - For prewarm, we cannot add this check there, since e.g. if 
20 prewarms fail for this invoker, a single activation might try each of those 
twenty before getting a working container, which seems like bad behavior 
compared to preemptively validating the container and replacing it outside the 
HTTP workflow for prewarms. For warm containers, it would be more feasible to 
do this but we would need to distinguish "/run after resume" from "/run before 
pause", and provide a special error case for connection failure after resume 
since we cannot treat all warm container failures as retriable - only once 
after resume.  This seemed more complicated than explicitly checking it once 
after resume inside ContainerProxy.  One possible change would be to move the 
checking logic inside either Container or ContainerClient, but I would keep it 
separate from /init and /run, and consider revisiting it if we change the HTTP 
protocol to include some more sophisticated checking via HTTP ( add a /health 
endpoint etc). 

Thanks
Tyson


On 10/28/19, 2:21 AM, "Markus Thömmes" <markusthoem...@apache.org> wrote:

    Heya,
    
    thanks for the elaborate proposal.
    
    Do you have any more information on why these containers are dying off in
    the first place? In the case of Kubernetes/Mesos I could imagine we might
    want to keep the Invoker's state consistent by checking it against the
    respective API repeatedly. On Kubernetes for instance, you could setup an
    informer that'd inform you about any state changes on the pods that this
    Invoker has spawned. If a prewarm container dies this way, we can simply
    remove it from the Invoker's bookkeeping and trigger a backfill.
    
    Secondly, could we potentially fold this check into the HTTP requests
    themselves? If we get a "connection refused" on an action that we knew
    worked before, we can safely retry. There should be a set of exceptions
    that our HTTP clients should surface that should be safe for us to retry in
    the invoker anyway. The only addition you'd need in this case is an
    enhancement on the ContainerProxy's state machine I believe, that allows
    for such a retrying use-case. The "connection refused" use-case I mentioned
    should be equivalent to the TCP probe you're doing now.
    
    WDYT?
    
    Cheers,
    Markus
    
    Am So., 27. Okt. 2019 um 02:56 Uhr schrieb Tyson Norris
    <tnor...@adobe.com.invalid>:
    
    > Hi Whiskers –
    > We periodically have an unfortunate problem where a docker container (or
    > worse, many of them) dies off unexpectedly, outside of HTTP usage from
    > invoker. In these cases, prewarm or warm containers may still have
    > references at the Invoker, and eventually if an activation arrives that
    > matches those container references, the HTTP workflow starts and fails
    > immediately since the node is not listening anymore, resulting in failed
    > activations. Or, any even worse situation, can be when a container failed
    > earlier, and a new container, initialized with a different action is
    > initialized on the same host and port (more likely a problem for k8s/mesos
    > cluster usage).
    >
    > To mitigate these issues, I put together a health check process [1] from
    > invoker to action containers, where we can test
    >
    >   *   prewarm containers periodically to verify they are still
    > operational, and
    >   *   warm containers immediately after resuming them (before HTTP
    > requests are sent)
    > In case of prewarm failure, we should backfill the prewarms to the
    > specified config count.
    > In case of warm failure, the activation is rescheduled to ContainerPool,
    > which typically would either route to a different prewarm, or start a new
    > cold container.
    >
    > The test ping is in the form of tcp connection only, since we otherwise
    > need to update the HTTP protocol implemented by all runtimes. This test is
    > good enough for the worst case of “container has gone missing”, but cannot
    > test for more subtle problems like “/run endpoint is broken”. There could
    > be other checks to increase the quality of test we add in the future, but
    > most of this I think requires expanding the HTTP protocol and state 
managed
    > at the container, and I wanted to get something working for basic
    > functionality to start with.
    >
    > Let me know if you have opinions about this, and we can discuss  here or
    > in the PR.
    > Thanks
    > Tyson
    >
    > [1] 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fopenwhisk%2Fpull%2F4698&amp;data=02%7C01%7Ctnorris%40adobe.com%7Cf082e59d6ff049a9235a08d75b8846f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637078513173443883&amp;sdata=4ngLog%2BJjK1MQBVTICLmzBBystdBPfLhV1HBi%2BbogXc%3D&amp;reserved=0
    >

Re: Action health checks

Reply via email to