Re: [go-nuts] Go program getting SIGSEGV as it panics and runs down after a SIGTERM or a SIGINT signal

2020-07-06 Thread estess
Ian,

Thankyou VERY much for your reply. Your analysis was spot-on. There is a 
3rd way that a goroutine could have the stack ripped out from under it - if 
it is running a signal handler. Go uses the SA_ONSTACK (alternate signal 
stack) facility which we had adopted for YottaDB as well. Back when we were 
still trying to field signals with YottaDB (C code) and pass them to Go 
(which never worked for reasons I was never clear on), we noted that the 
small altstacks that Go was using were too small for our signal handler 
routines so we would re-allocate the altstack to a larger size if we saw 
they were insufficient. Later, when we ran our cleanup handler, we would 
restore the stacks that Go was using and free what we had allocated. It was 
this free occuring while processes were shutting down in an asymmetric way 
due to a signal that was pulling the stacks out from under running signal 
routines occasionally. But removing the code that did the "restore" of Go's 
altstack got rid of these failures.

Again, much thanks for the insight that enabled us to figure out what was 
wrong.

Steve

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/16d79fec-7b41-4fc4-9272-dcd898a95118o%40googlegroups.com.


Re: [go-nuts] Go program getting SIGSEGV as it panics and runs down after a SIGTERM or a SIGINT signal

2020-07-04 Thread Ian Lance Taylor
On Thu, Jul 2, 2020 at 2:18 PM  wrote:
>
> We have developed a Go API (which we call a Go wrapper - 
> https://pkg.go.dev/lang.yottadb.com/go/yottadb?tab=doc) to the YottaDB 
> hierarchical key-value database (https://yottadb.com). It works well for the 
> most part, but there are some edge cases during process shutdown that are 
> standing between us and a full production-grade API. We have a test program 
> called threeenp1C2 that is an implementation of the classic 3n+1 problem that 
> does some database-intensive activity. It comes in two flavors: a 
> multi-process version (1 [application] goroutine per process) and a single 
> process version (multiple [application] goroutines in one process). The 
> latter runs fine; the discussion below is about the multi-process version.
>
> A Go main spawns 15 copies of itself as workers, each of which runs an 
> internal routine. Each process computes the lengths of 3n+1 sequences for a 
> block of integers, and then returns for another block of integers. The 
> results are stored in the database so that processes can use the results of 
> other processes' work. When they finish computing the results for the 
> problem, they shut down. So, for example, the overall problem may be to 
> compute the 3n+1 sequences for all integers from one through a million, with 
> processes working on blocks of 10,000 initial integers. There is nothing 
> special about the computation other than that it generates a lot of database 
> activity.
>
> This version runs fine, and always shuts down cleanly if allowed to run to 
> completion. But since the database is used for mission critical applications, 
> we have a number of stress tests. The threeenp1C2 test involves starting 15 
> cooperative worker processes, and then sending each process a SIGTERM or 
> SIGINT, depending on the test. Sporadically, one of the processes receiving 
> the signal shuts down generating a core file due to a SIGSEGV instead of 
> shutting down cleanly. That's the 10,000 ft view. Here are more details:
>
> The database engine is daemonless, and runs in the address space of each 
> process, with processes cooperating with one another to manage the database 
> using a variety of semaphores and data structures in shared memory segments, 
> as well as OS semaphores and mutexes. The database engine is single-threaded, 
> and when multiple threads in a process call the database engine, there are 
> mechanisms to ensure that non-reentrant code is called without reentrancy.
> The database engine is written in C, and traditionally has had a heavy 
> reliance on signals but with the Go wrapper calling the engine through cgo, 
> things were a bit dicey. So code was reworked for use with Go such that Go 
> now handles the signals and lets the YottaDB engine know about them. To that 
> end, a goroutine for each signal type we want to know about (around 17 of 
> them) is started up each of which then call into a "signal dispatcher" in the 
> C code to drive signal handlers for those signals we are notifed of.
> When a fatal signal such as a SIGTERM occurs, the goroutines started for 
> signal handling are all told to close down and we wait for them to shut down 
> before driving a panic to stop the world (doing this reduced the failures 
> from their previous failure rate of nearly 100%). The current failure rate 
> with the core dumps now occurs 3-10% of the time. These strange failures are 
> in the bowels of Go (usually in either a futex call or something called 
> newstack?). Empirical evidence suggests the failure rate increases when the 
> system is loaded - I guess thus affecting the timing of how/when things 
> shutdown though proof is hard to come by.
> The database engine uses optimistic concurrency control to implement ACID 
> transactions. What this means is that Go application code (say Routine A) 
> calls the database engine through CGO, passing it a Go entry point (say 
> Routine B) that the database engine calls one or more times till it completes 
> the transaction (to avoid live-locks, in a final retry, should one be 
> required, the engine locks out all other accesses, essentially 
> single-threading the database). Routine B itself calls into the YottaDB API 
> through cgo. So mixed stacks of C and Go code are common.
> To avoid endangering database integrity, the engine attempts to shut down at 
> “safe”points. If certain fatal signals are received at an unsafe spot, we 
> defer handling the signal till it is in a safe place. To ensure that 
> everything stops when it reaches a safe place, the engine calls back into Go 
> and drives a panic of choice to shut down the entire process.  I find myself 
> wondering if the "sandwich stack" of C and Go routines is somehow causing the 
> panic to generate cores.
>
> These failures are like nothing I've seen. Sometimes one of the threads that 
> we create in the C code are still around (it's been told to shutdown but 
> evidently hasn't gotten around to it

[go-nuts] Go program getting SIGSEGV as it panics and runs down after a SIGTERM or a SIGINT signal

2020-07-02 Thread estess
We have developed a Go API (which we call a Go wrapper - 
https://pkg.go.dev/lang.yottadb.com/go/yottadb?tab=doc) to the YottaDB 
hierarchical key-value database (https://yottadb.com). It works well for 
the most part, but there are some edge cases during process shutdown that 
are standing between us and a full production-grade API. We have a test 
program called threeenp1C2 that is an implementation of the classic 3n+1 
problem that does some database-intensive activity. It comes in two 
flavors: a multi-process version (1 [application] goroutine per process) 
and a single process version (multiple [application] goroutines in one 
process). The latter runs fine; the discussion below is about the 
multi-process version.

A Go main spawns 15 copies of itself as workers, each of which runs an 
internal routine. Each process computes the lengths of 3n+1 sequences for a 
block of integers, and then returns for another block of integers. The 
results are stored in the database so that processes can use the results of 
other processes' work. When they finish computing the results for the 
problem, they shut down. So, for example, the overall problem may be to 
compute the 3n+1 sequences for all integers from one through a million, 
with processes working on blocks of 10,000 initial integers. There is 
nothing special about the computation other than that it generates a lot of 
database activity.

This version runs fine, and always shuts down cleanly if allowed to run to 
completion. But since the database is used for mission critical 
applications, we have a number of stress tests. The threeenp1C2 test 
involves starting 15 cooperative worker processes, and then sending each 
process a SIGTERM or SIGINT, depending on the test. Sporadically, one of 
the processes receiving the signal shuts down generating a core file due to 
a SIGSEGV instead of shutting down cleanly. That's the 10,000 ft view. Here 
are more details:

   - The database engine is daemonless, and runs in the address space of 
   each process, with processes cooperating with one another to manage the 
   database using a variety of semaphores and data structures in shared memory 
   segments, as well as OS semaphores and mutexes. The database engine is 
   single-threaded, and when multiple threads in a process call the database 
   engine, there are mechanisms to ensure that non-reentrant code is called 
   without reentrancy.
   - The database engine is written in C, and traditionally has had a heavy 
   reliance on signals but with the Go wrapper calling the engine through cgo, 
   things were a bit dicey. So code was reworked for use with Go such that Go 
   now handles the signals and lets the YottaDB engine know about them. To 
   that end, a goroutine for each signal type we want to know about (around 17 
   of them) is started up each of which then call into a "signal dispatcher" 
   in the C code to drive signal handlers for those signals we are notifed of.
   - When a fatal signal such as a SIGTERM occurs, the goroutines started 
   for signal handling are all told to close down and we wait for them to shut 
   down before driving a panic to stop the world (doing this reduced the 
   failures from their previous failure rate of nearly 100%). The current 
   failure rate with the core dumps now occurs 3-10% of the time. These 
   strange failures are in the bowels of Go (usually in either a futex call or 
   something called newstack?). Empirical evidence suggests the failure 
   rate increases when the system is loaded - I guess thus affecting the 
   timing of how/when things shutdown though proof is hard to come by.
   - The database engine uses optimistic concurrency control to implement 
   ACID transactions. What this means is that Go application code (say Routine 
   A) calls the database engine through CGO, passing it a Go entry point (say 
   Routine B) that the database engine calls one or more times till it 
   completes the transaction (to avoid live-locks, in a final retry, should 
   one be required, the engine locks out all other accesses, essentially 
   single-threading the database). Routine B itself calls into the YottaDB API 
   through cgo. So mixed stacks of C and Go code are common.
   - To avoid endangering database integrity, the engine attempts to shut 
   down at “safe”points. If certain fatal signals are received at an unsafe 
   spot, we defer handling the signal till it is in a safe place. To ensure 
   that everything stops when it reaches a safe place, the engine calls back 
   into Go and drives a panic of choice to shut down the entire process.  I 
   find myself wondering if the "sandwich stack" of C and Go routines is 
   somehow causing the panic to generate cores.
   
These failures are like nothing I've seen. Sometimes one of the threads 
that we create in the C code are still around (it's been told to shutdown 
but evidently hasn't gotten around to it yet - like thread 11 in the gdb 
trace below). I