Re: [go-nuts] Go program getting SIGSEGV as it panics and runs down after a SIGTERM or a SIGINT signal
Ian, Thankyou VERY much for your reply. Your analysis was spot-on. There is a 3rd way that a goroutine could have the stack ripped out from under it - if it is running a signal handler. Go uses the SA_ONSTACK (alternate signal stack) facility which we had adopted for YottaDB as well. Back when we were still trying to field signals with YottaDB (C code) and pass them to Go (which never worked for reasons I was never clear on), we noted that the small altstacks that Go was using were too small for our signal handler routines so we would re-allocate the altstack to a larger size if we saw they were insufficient. Later, when we ran our cleanup handler, we would restore the stacks that Go was using and free what we had allocated. It was this free occuring while processes were shutting down in an asymmetric way due to a signal that was pulling the stacks out from under running signal routines occasionally. But removing the code that did the "restore" of Go's altstack got rid of these failures. Again, much thanks for the insight that enabled us to figure out what was wrong. Steve -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/16d79fec-7b41-4fc4-9272-dcd898a95118o%40googlegroups.com.
Re: [go-nuts] Go program getting SIGSEGV as it panics and runs down after a SIGTERM or a SIGINT signal
On Thu, Jul 2, 2020 at 2:18 PM wrote: > > We have developed a Go API (which we call a Go wrapper - > https://pkg.go.dev/lang.yottadb.com/go/yottadb?tab=doc) to the YottaDB > hierarchical key-value database (https://yottadb.com). It works well for the > most part, but there are some edge cases during process shutdown that are > standing between us and a full production-grade API. We have a test program > called threeenp1C2 that is an implementation of the classic 3n+1 problem that > does some database-intensive activity. It comes in two flavors: a > multi-process version (1 [application] goroutine per process) and a single > process version (multiple [application] goroutines in one process). The > latter runs fine; the discussion below is about the multi-process version. > > A Go main spawns 15 copies of itself as workers, each of which runs an > internal routine. Each process computes the lengths of 3n+1 sequences for a > block of integers, and then returns for another block of integers. The > results are stored in the database so that processes can use the results of > other processes' work. When they finish computing the results for the > problem, they shut down. So, for example, the overall problem may be to > compute the 3n+1 sequences for all integers from one through a million, with > processes working on blocks of 10,000 initial integers. There is nothing > special about the computation other than that it generates a lot of database > activity. > > This version runs fine, and always shuts down cleanly if allowed to run to > completion. But since the database is used for mission critical applications, > we have a number of stress tests. The threeenp1C2 test involves starting 15 > cooperative worker processes, and then sending each process a SIGTERM or > SIGINT, depending on the test. Sporadically, one of the processes receiving > the signal shuts down generating a core file due to a SIGSEGV instead of > shutting down cleanly. That's the 10,000 ft view. Here are more details: > > The database engine is daemonless, and runs in the address space of each > process, with processes cooperating with one another to manage the database > using a variety of semaphores and data structures in shared memory segments, > as well as OS semaphores and mutexes. The database engine is single-threaded, > and when multiple threads in a process call the database engine, there are > mechanisms to ensure that non-reentrant code is called without reentrancy. > The database engine is written in C, and traditionally has had a heavy > reliance on signals but with the Go wrapper calling the engine through cgo, > things were a bit dicey. So code was reworked for use with Go such that Go > now handles the signals and lets the YottaDB engine know about them. To that > end, a goroutine for each signal type we want to know about (around 17 of > them) is started up each of which then call into a "signal dispatcher" in the > C code to drive signal handlers for those signals we are notifed of. > When a fatal signal such as a SIGTERM occurs, the goroutines started for > signal handling are all told to close down and we wait for them to shut down > before driving a panic to stop the world (doing this reduced the failures > from their previous failure rate of nearly 100%). The current failure rate > with the core dumps now occurs 3-10% of the time. These strange failures are > in the bowels of Go (usually in either a futex call or something called > newstack?). Empirical evidence suggests the failure rate increases when the > system is loaded - I guess thus affecting the timing of how/when things > shutdown though proof is hard to come by. > The database engine uses optimistic concurrency control to implement ACID > transactions. What this means is that Go application code (say Routine A) > calls the database engine through CGO, passing it a Go entry point (say > Routine B) that the database engine calls one or more times till it completes > the transaction (to avoid live-locks, in a final retry, should one be > required, the engine locks out all other accesses, essentially > single-threading the database). Routine B itself calls into the YottaDB API > through cgo. So mixed stacks of C and Go code are common. > To avoid endangering database integrity, the engine attempts to shut down at > “safe”points. If certain fatal signals are received at an unsafe spot, we > defer handling the signal till it is in a safe place. To ensure that > everything stops when it reaches a safe place, the engine calls back into Go > and drives a panic of choice to shut down the entire process. I find myself > wondering if the "sandwich stack" of C and Go routines is somehow causing the > panic to generate cores. > > These failures are like nothing I've seen. Sometimes one of the threads that > we create in the C code are still around (it's been told to shutdown but > evidently hasn't gotten around to it
[go-nuts] Go program getting SIGSEGV as it panics and runs down after a SIGTERM or a SIGINT signal
We have developed a Go API (which we call a Go wrapper - https://pkg.go.dev/lang.yottadb.com/go/yottadb?tab=doc) to the YottaDB hierarchical key-value database (https://yottadb.com). It works well for the most part, but there are some edge cases during process shutdown that are standing between us and a full production-grade API. We have a test program called threeenp1C2 that is an implementation of the classic 3n+1 problem that does some database-intensive activity. It comes in two flavors: a multi-process version (1 [application] goroutine per process) and a single process version (multiple [application] goroutines in one process). The latter runs fine; the discussion below is about the multi-process version. A Go main spawns 15 copies of itself as workers, each of which runs an internal routine. Each process computes the lengths of 3n+1 sequences for a block of integers, and then returns for another block of integers. The results are stored in the database so that processes can use the results of other processes' work. When they finish computing the results for the problem, they shut down. So, for example, the overall problem may be to compute the 3n+1 sequences for all integers from one through a million, with processes working on blocks of 10,000 initial integers. There is nothing special about the computation other than that it generates a lot of database activity. This version runs fine, and always shuts down cleanly if allowed to run to completion. But since the database is used for mission critical applications, we have a number of stress tests. The threeenp1C2 test involves starting 15 cooperative worker processes, and then sending each process a SIGTERM or SIGINT, depending on the test. Sporadically, one of the processes receiving the signal shuts down generating a core file due to a SIGSEGV instead of shutting down cleanly. That's the 10,000 ft view. Here are more details: - The database engine is daemonless, and runs in the address space of each process, with processes cooperating with one another to manage the database using a variety of semaphores and data structures in shared memory segments, as well as OS semaphores and mutexes. The database engine is single-threaded, and when multiple threads in a process call the database engine, there are mechanisms to ensure that non-reentrant code is called without reentrancy. - The database engine is written in C, and traditionally has had a heavy reliance on signals but with the Go wrapper calling the engine through cgo, things were a bit dicey. So code was reworked for use with Go such that Go now handles the signals and lets the YottaDB engine know about them. To that end, a goroutine for each signal type we want to know about (around 17 of them) is started up each of which then call into a "signal dispatcher" in the C code to drive signal handlers for those signals we are notifed of. - When a fatal signal such as a SIGTERM occurs, the goroutines started for signal handling are all told to close down and we wait for them to shut down before driving a panic to stop the world (doing this reduced the failures from their previous failure rate of nearly 100%). The current failure rate with the core dumps now occurs 3-10% of the time. These strange failures are in the bowels of Go (usually in either a futex call or something called newstack?). Empirical evidence suggests the failure rate increases when the system is loaded - I guess thus affecting the timing of how/when things shutdown though proof is hard to come by. - The database engine uses optimistic concurrency control to implement ACID transactions. What this means is that Go application code (say Routine A) calls the database engine through CGO, passing it a Go entry point (say Routine B) that the database engine calls one or more times till it completes the transaction (to avoid live-locks, in a final retry, should one be required, the engine locks out all other accesses, essentially single-threading the database). Routine B itself calls into the YottaDB API through cgo. So mixed stacks of C and Go code are common. - To avoid endangering database integrity, the engine attempts to shut down at “safe”points. If certain fatal signals are received at an unsafe spot, we defer handling the signal till it is in a safe place. To ensure that everything stops when it reaches a safe place, the engine calls back into Go and drives a panic of choice to shut down the entire process. I find myself wondering if the "sandwich stack" of C and Go routines is somehow causing the panic to generate cores. These failures are like nothing I've seen. Sometimes one of the threads that we create in the C code are still around (it's been told to shutdown but evidently hasn't gotten around to it yet - like thread 11 in the gdb trace below). I