Re: [go-nuts] Re: Recover considered harmful

Jesper Louis Andersen Thu, 27 Apr 2017 05:50:41 -0700

On Wed, Apr 26, 2017 at 10:55 AM Peter Herth <he...@peter-herth.de> wrote:


>
> No, panic certainly does not do that. It prints the stack trace. A proper
> logger could add additional information about the program state at the
> point of the panic, which is not visible from the stack trace. It also
> might at least be reasonable to perform an auto-save before quitting.
>
>
Additional comments in a haphazard order:

It makes sense to accept a panic() in a Go program will have some
collateral requests being taken down as a consequence. This argument can be
extended however. Since the operating system kernel might be wrong, it is
better to halt the operating system whenever a Goroutine panics. After all,
the logic seems, who can be sure the operating system forget to release a
Mutex lock? And why stop there? The hardware on which you are running may
have a failure. Better replace that whenever a goroutine panics!

In practice---I think this is due to work by Peter J. Denning
originally---we use process isolation at the OS level to guard against such
failure. We ought to use a layered model, where each layer guards the
layers below it. There is a 7 year old blogpost I wrote on the subject, in
which I used an onion as a metaphor for the model[0], and it is one of the
blog posts which have had more readers than other posts.

In general, failure is something you ought to capture for post-mortem
analysis. Get the core-dump, push it into your blob store, restart the
process and then attach a debugger to the blob to figure out what is wrong.
In my experience, it is also important to have access to the memory state
of the program in addition to the backtrace if the problem is complex.

What Erlang people acutely understands is that the granularity of failure
matter. A single request failing in a system is, usually, localized to that
single request. If, however, we have a situation as Roger Pepper mentions
where a mutex is locked, the failure of single requests should at some
point escalate to larger parts of the system. This is where the concept of
a "restart strategy" in Erlang systems are necessary: more than K failures
in a time-frame window of W increases the granularity and resets larger
parts of the system. Eventually, the whole node() resets, which is akin to
a Go panic() which isn't getting caught. The advantage is that the size of
the failure determines its impact: small errors have small impact. Large
errors have large impact.

Dave Cheney touches on another important point: if you care about requests
and a panic in one Go process can make other requests running collaterally
fail, then you should build your load balancer such that it retries the
requests on another worker[1].

Yet another point worth mentioning is that a panic() can have a large
recovery time for a process. If you have a large heap of several hundred
gigabytes of data, reestablishing such heap after a failure might take a
long time. Thus, it can be beneficial to restart parts of the system at a
finer granularity first, before resorting to rebooting the full process.
Likewise, if a system knows it is in a bad state, it is often faster to
report said state to the load-balancer rather than relying on it eventually
figuring it out. Depending on your SLA, you may fail many requests in the
mean time and this may affect your reliability measure. This is especially
true if your system has a high processing rate of requests, which makes it
far more sensitive to latency fluctuations.

So what is a Go programmer to do? The solution, at least from my view, is
to use the 'context' package to establish a tree of work-areas for
different parts of the Go program. Failure in one tree can then be handled
by failing a given context, and if the system can clean up, you can
continue operating. The question is, then, what to do with cross-cutting
concerns where one context talks to the goroutines of another context in
the tree. My guess is you signal that by closing channels appropriately,
but I'm not sure. Erlang systems provide a monitor-concept for this very
situation, in which you can subscribe to the lifetime of another part of
the system. If it fails, a message is sent to you about its failure so you
can take action.

[0]
http://jlouisramblings.blogspot.dk/2010/11/on-erlang-state-and-crashes.html

[1] Beware of a "poisonous" request however! A single bad request that
panics a system and then getting retried in the load balancer can easily
take down all of your backend worker pool.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Re: Recover considered harmful

Reply via email to