On Thu, Jul 2, 2020 at 2:18 PM <est...@yottadb.com> wrote:
>
> We have developed a Go API (which we call a Go wrapper - 
> https://pkg.go.dev/lang.yottadb.com/go/yottadb?tab=doc) to the YottaDB 
> hierarchical key-value database (https://yottadb.com). It works well for the 
> most part, but there are some edge cases during process shutdown that are 
> standing between us and a full production-grade API. We have a test program 
> called threeenp1C2 that is an implementation of the classic 3n+1 problem that 
> does some database-intensive activity. It comes in two flavors: a 
> multi-process version (1 [application] goroutine per process) and a single 
> process version (multiple [application] goroutines in one process). The 
> latter runs fine; the discussion below is about the multi-process version.
>
> A Go main spawns 15 copies of itself as workers, each of which runs an 
> internal routine. Each process computes the lengths of 3n+1 sequences for a 
> block of integers, and then returns for another block of integers. The 
> results are stored in the database so that processes can use the results of 
> other processes' work. When they finish computing the results for the 
> problem, they shut down. So, for example, the overall problem may be to 
> compute the 3n+1 sequences for all integers from one through a million, with 
> processes working on blocks of 10,000 initial integers. There is nothing 
> special about the computation other than that it generates a lot of database 
> activity.
>
> This version runs fine, and always shuts down cleanly if allowed to run to 
> completion. But since the database is used for mission critical applications, 
> we have a number of stress tests. The threeenp1C2 test involves starting 15 
> cooperative worker processes, and then sending each process a SIGTERM or 
> SIGINT, depending on the test. Sporadically, one of the processes receiving 
> the signal shuts down generating a core file due to a SIGSEGV instead of 
> shutting down cleanly. That's the 10,000 ft view. Here are more details:
>
> The database engine is daemonless, and runs in the address space of each 
> process, with processes cooperating with one another to manage the database 
> using a variety of semaphores and data structures in shared memory segments, 
> as well as OS semaphores and mutexes. The database engine is single-threaded, 
> and when multiple threads in a process call the database engine, there are 
> mechanisms to ensure that non-reentrant code is called without reentrancy.
> The database engine is written in C, and traditionally has had a heavy 
> reliance on signals but with the Go wrapper calling the engine through cgo, 
> things were a bit dicey. So code was reworked for use with Go such that Go 
> now handles the signals and lets the YottaDB engine know about them. To that 
> end, a goroutine for each signal type we want to know about (around 17 of 
> them) is started up each of which then call into a "signal dispatcher" in the 
> C code to drive signal handlers for those signals we are notifed of.
> When a fatal signal such as a SIGTERM occurs, the goroutines started for 
> signal handling are all told to close down and we wait for them to shut down 
> before driving a panic to stop the world (doing this reduced the failures 
> from their previous failure rate of nearly 100%). The current failure rate 
> with the core dumps now occurs 3-10% of the time. These strange failures are 
> in the bowels of Go (usually in either a futex call or something called 
> newstack?). Empirical evidence suggests the failure rate increases when the 
> system is loaded - I guess thus affecting the timing of how/when things 
> shutdown though proof is hard to come by.
> The database engine uses optimistic concurrency control to implement ACID 
> transactions. What this means is that Go application code (say Routine A) 
> calls the database engine through CGO, passing it a Go entry point (say 
> Routine B) that the database engine calls one or more times till it completes 
> the transaction (to avoid live-locks, in a final retry, should one be 
> required, the engine locks out all other accesses, essentially 
> single-threading the database). Routine B itself calls into the YottaDB API 
> through cgo. So mixed stacks of C and Go code are common.
> To avoid endangering database integrity, the engine attempts to shut down at 
> “safe”points. If certain fatal signals are received at an unsafe spot, we 
> defer handling the signal till it is in a safe place. To ensure that 
> everything stops when it reaches a safe place, the engine calls back into Go 
> and drives a panic of choice to shut down the entire process.  I find myself 
> wondering if the "sandwich stack" of C and Go routines is somehow causing the 
> panic to generate cores.
>
> These failures are like nothing I've seen. Sometimes one of the threads that 
> we create in the C code are still around (it's been told to shutdown but 
> evidently hasn't gotten around to it yet - like thread 11 in the gdb trace 
> below). It has a stack trace with ydb_stm_thread() in it and is generally 
> asleep on a timer. Note I'm using Go 1.14.4 on Ubuntu 18.04.
>
> Here is the list of goroutines from delve (which I've only just started using 
> so not well versed in its usage):
>
> [3:01pm] [estess@flyingv] : 
> /testarea/estess/tst_V994_R129_dbg_04_200701_144712/go_0_6/sigint/go/src/threeenp1C2
>  > dlv core threeenp1C2 core.88642 --check-go-version=false
> Type 'help' for list of commands.
> (dlv) goroutines
>   Goroutine 1 - User: /snap/go/5830/src/time/sleep.go:84 time.NewTimer 
> (0x4ed308) (thread 89352)
>   Goroutine 2 - User: /snap/go/5830/src/runtime/proc.go:305 runtime.gopark 
> (0x445270)
>   Goroutine 3 - User: /snap/go/5830/src/runtime/proc.go:305 runtime.gopark 
> (0x445270)
>   Goroutine 4 - User: /snap/go/5830/src/runtime/proc.go:305 runtime.gopark 
> (0x445270)
>   Goroutine 18 - User: /snap/go/5830/src/runtime/proc.go:305 runtime.gopark 
> (0x445270)
>   Goroutine 34 - User: /snap/go/5830/src/runtime/sigqueue.go:147 
> os/signal.signal_recv (0x45a74c) (thread 88678)
>   Goroutine 35 - User: /snap/go/5830/src/runtime/proc.go:305 runtime.gopark 
> (0x445270)
>   Goroutine 36 - User: /snap/go/5830/src/runtime/proc.go:305 runtime.gopark 
> (0x445270)
>   Goroutine 37 - User: /snap/go/5830/src/runtime/proc.go:305 runtime.gopark 
> (0x445270)
>   Goroutine 38 - User: /snap/go/5830/src/runtime/proc.go:305 runtime.gopark 
> (0x445270)
>   Goroutine 53 - User: /snap/go/5830/src/runtime/proc.go:305 runtime.gopark 
> (0x445270)
>   Goroutine 54 - User: /snap/go/5830/src/runtime/proc.go:305 runtime.gopark 
> (0x445270)
> * Goroutine 66 - User: /snap/go/5830/src/runtime/sys_linux_amd64.s:568 
> runtime.futex (0x476273) (thread 88642)
>   Goroutine 67 - User: /snap/go/5830/src/runtime/mgcmark.go:1241 
> runtime.scanobject (0x42fa96) (thread 88718)
>   Goroutine 68 - User: /snap/go/5830/src/runtime/proc.go:305 runtime.gopark 
> (0x445270)
> [15 goroutines]
> (dlv) bt
> 0  0x0000000000476273 in runtime.futex
>    at /snap/go/5830/src/runtime/sys_linux_amd64.s:568
> 1  0x0000000000471b50 in runtime.systemstack_switch
>    at /snap/go/5830/src/runtime/asm_amd64.s:330
> 2  0x000000000042b2db in runtime.gcMarkDone
>    at /snap/go/5830/src/runtime/mgc.go:1449
> 3  0x000000000042c38e in runtime.gcBgMarkWorker
>    at /snap/go/5830/src/runtime/mgc.go:2000
> 4  0x0000000000473c61 in runtime.goexit
>    at /snap/go/5830/src/runtime/asm_amd64.s:1373
> (dlv)
>
> And here's the thread list and traceback done in gdb:
>
> [3:04pm] [estess@flyingv] : 
> /testarea/estess/tst_V994_R129_dbg_04_200701_144712/go_0_6/sigint/go/src/threeenp1C2
>  > gdb threeenp1C2 core.88642
> GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
> Copyright (C) 2018 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from threeenp1C2...done.
> [New LWP 88642]
> [New LWP 88678]
> [New LWP 88718]
> [New LWP 89351]
> [New LWP 88681]
> [New LWP 89352]
> [New LWP 88682]
> [New LWP 88684]
> [New LWP 89203]
> [New LWP 88666]
> [New LWP 88740]
> [New LWP 88783]
> [New LWP 88738]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by 
> `/extra3/testarea1/estess/tst_V994_R129_dbg_04_200701_144712/go_0_6/sigint/go/sr'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  runtime.futex () at /snap/go/5830/src/runtime/sys_linux_amd64.s:568
> 568             MOVL    AX, ret+40(FP)
> [Current thread is 1 (Thread 0x7fc4a58621c0 (LWP 88642))]
> Loading Go Runtime support.
> (gdb) thread apply all where
>
> Thread 13 (Thread 0x7fc44ffff700 (LWP 88738)):
> #0  runtime.futex () at /snap/go/5830/src/runtime/sys_linux_amd64.s:568
> #1  0x000000000043f076 in runtime.futexsleep (addr=0xc0001804c8, val=0, 
> ns=-1) at /snap/go/5830/src/runtime/os_linux.go:45
> #2  0x000000000041c25f in runtime.notesleep (n=0xc0001804c8) at 
> /snap/go/5830/src/runtime/lock_futex.go:151
> #3  0x0000000000448ce0 in runtime.stopm () at 
> /snap/go/5830/src/runtime/proc.go:1834
> #4  0x000000000044a2fd in runtime.findrunnable (gp=0xc00004b000, 
> inheritTime=false) at /snap/go/5830/src/runtime/proc.go:2366
> #5  0x000000000044ae3c in runtime.schedule () at 
> /snap/go/5830/src/runtime/proc.go:2526
> #6  0x000000000044b3bd in runtime.park_m (gp=0xc00008a900) at 
> /snap/go/5830/src/runtime/proc.go:2696
> #7  0x0000000000471b3b in runtime.mcall () at 
> /snap/go/5830/src/runtime/asm_amd64.s:318
> #8  0x0000000000000000 in ?? ()
>
> Thread 12 (Thread 0x7fc44bffd700 (LWP 88783)):
> #0  runtime.futex () at /snap/go/5830/src/runtime/sys_linux_amd64.s:568
> #1  0x000000000043f076 in runtime.futexsleep (addr=0xc0001032c8, val=0, 
> ns=-1) at /snap/go/5830/src/runtime/os_linux.go:45
> #2  0x000000000041c25f in runtime.notesleep (n=0xc0001032c8) at 
> /snap/go/5830/src/runtime/lock_futex.go:151
> #3  0x0000000000448ce0 in runtime.stopm () at 
> /snap/go/5830/src/runtime/proc.go:1834
> #4  0x000000000044a2fd in runtime.findrunnable (gp=0xc00003e800, 
> inheritTime=false) at /snap/go/5830/src/runtime/proc.go:2366
> #5  0x000000000044ae3c in runtime.schedule () at 
> /snap/go/5830/src/runtime/proc.go:2526
> #6  0x000000000044b3bd in runtime.park_m (gp=0xc000105c80) at 
> /snap/go/5830/src/runtime/proc.go:2696
> #7  0x0000000000471b3b in runtime.mcall () at 
> /snap/go/5830/src/runtime/asm_amd64.s:318
> #8  0x0000000000000000 in ?? ()
>
> Thread 11 (Thread 0x7fc44dffe700 (LWP 88740)):
> #0  __clock_nanosleep (clock_id=1, flags=1, req=0x7fc44dffdec0, rem=0x0) at 
> ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
> #1  0x00007fc4a49b70f0 in ydb_stm_thread (dummy_parm=0x0) at 
> /Distrib/YottaDB/V994_R129/sr_unix/ydb_stm_thread.c:109
> #2  0x00007fc4a45796db in start_thread (arg=0x7fc44dffe700) at 
> pthread_create.c:463
> #3  0x00007fc4a42a288f in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>
> Thread 10 (Thread 0x7fc47bfee700 (LWP 88666)):
> #0  runtime.usleep () at /snap/go/5830/src/runtime/sys_linux_amd64.s:146
> #1  0x000000000044fbfd in runtime.sysmon () at 
> /snap/go/5830/src/runtime/proc.go:4479
> #2  0x0000000000447793 in runtime.mstart1 () at 
> /snap/go/5830/src/runtime/proc.go:1097
> #3  0x00000000004476ae in runtime.mstart () at 
> /snap/go/5830/src/runtime/proc.go:1062
> #4  0x000000000053291c in crosscall_amd64 () at gcc_amd64.S:35
> #5  0x00007ffe8a519cd0 in ?? ()
> #6  0x0000000001c952a0 in ?? ()
> #7  0x0000000000000000 in ?? ()
>
> Thread 9 (Thread 0x7fc443fff700 (LWP 89203)):
> #0  runtime.futex () at /snap/go/5830/src/runtime/sys_linux_amd64.s:568
> #1  0x000000000043f076 in runtime.futexsleep (addr=0xc000180848, val=0, 
> ns=-1) at /snap/go/5830/src/runtime/os_linux.go:45
> #2  0x000000000041c25f in runtime.notesleep (n=0xc000180848) at 
> /snap/go/5830/src/runtime/lock_futex.go:151
> #3  0x0000000000448ce0 in runtime.stopm () at 
> /snap/go/5830/src/runtime/proc.go:1834
> #4  0x0000000000449622 in runtime.startlockedm (gp=0xc000000180) at 
> /snap/go/5830/src/runtime/proc.go:2007
> #5  0x000000000044aba9 in runtime.schedule () at 
> /snap/go/5830/src/runtime/proc.go:2563
> #6  0x000000000044be36 in runtime.goexit0 (gp=0xc000183080) at 
> /snap/go/5830/src/runtime/proc.go:2855
> #7  0x0000000000471b3b in runtime.mcall () at 
> /snap/go/5830/src/runtime/asm_amd64.s:318
> #8  0x0000000000000000 in ?? ()
>
> Thread 8 (Thread 0x7fc45ffff700 (LWP 88684)):
> #0  runtime.futex () at /snap/go/5830/src/runtime/sys_linux_amd64.s:568
> #1  0x000000000043f076 in runtime.futexsleep (addr=0x12ee618 
> <runtime.newmHandoff+24>, val=0, ns=-1) at 
> /snap/go/5830/src/runtime/os_linux.go:45
> #2  0x000000000041c25f in runtime.notesleep (n=0x12ee618 
> <runtime.newmHandoff+24>) at /snap/go/5830/src/runtime/lock_futex.go:151
> #3  0x0000000000448c02 in runtime.templateThread () at 
> /snap/go/5830/src/runtime/proc.go:1812
> #4  0x0000000000447793 in runtime.mstart1 () at 
> /snap/go/5830/src/runtime/proc.go:1097
> #5  0x00000000004476ae in runtime.mstart () at 
> /snap/go/5830/src/runtime/proc.go:1062
> #6  0x000000000053291c in crosscall_amd64 () at gcc_amd64.S:35
> #7  0x00007ffe8a519d70 in ?? ()
> #8  0x0000000001c956c0 in ?? ()
> #9  0x0000000000000000 in ?? ()
>
> Thread 7 (Thread 0x7fc46fffd700 (LWP 88682)):
> #0  runtime.futex () at /snap/go/5830/src/runtime/sys_linux_amd64.s:568
> #1  0x000000000043f076 in runtime.futexsleep (addr=0xc000088148, val=0, 
> ns=-1) at /snap/go/5830/src/runtime/os_linux.go:45
> #2  0x000000000041c25f in runtime.notesleep (n=0xc000088148) at 
> /snap/go/5830/src/runtime/lock_futex.go:151
> #3  0x0000000000449378 in runtime.stoplockedm () at 
> /snap/go/5830/src/runtime/proc.go:1977
> #4  0x000000000044afe6 in runtime.schedule () at 
> /snap/go/5830/src/runtime/proc.go:2460
> #5  0x000000000044b3bd in runtime.park_m (gp=0xc00008a480) at 
> /snap/go/5830/src/runtime/proc.go:2696
> #6  0x0000000000471b3b in runtime.mcall () at 
> /snap/go/5830/src/runtime/asm_amd64.s:318
> #7  0x0000000000000000 in ?? ()
>
> Thread 6 (Thread 0x7fc435ffe700 (LWP 89352)):
> #0  0x0000000000545d84 in __tsan::MemoryRangeSet(__tsan::ThreadState*, 
> unsigned long, unsigned long, unsigned long, unsigned long long) [clone 
> .isra.176] [clone .part.177] ()
> #1  0x0000000000475a73 in racecall () at 
> /snap/go/5830/src/runtime/race_amd64.s:381
> #2  0x0000000000000000 in ?? ()
>
> Thread 5 (Thread 0x7fc471ffe700 (LWP 88681)):
> #0  runtime.futex () at /snap/go/5830/src/runtime/sys_linux_amd64.s:568
> #1  0x000000000043f076 in runtime.futexsleep (addr=0xc000058bc8, val=0, 
> ns=-1) at /snap/go/5830/src/runtime/os_linux.go:45
> #2  0x000000000041c25f in runtime.notesleep (n=0xc000058bc8) at 
> /snap/go/5830/src/runtime/lock_futex.go:151
> #3  0x0000000000448ce0 in runtime.stopm () at 
> /snap/go/5830/src/runtime/proc.go:1834
> #4  0x000000000044a2fd in runtime.findrunnable (gp=0xc000046000, 
> inheritTime=false) at /snap/go/5830/src/runtime/proc.go:2366
> #5  0x000000000044ae3c in runtime.schedule () at 
> /snap/go/5830/src/runtime/proc.go:2526
> #6  0x000000000044b3bd in runtime.park_m (gp=0xc000105b00) at 
> /snap/go/5830/src/runtime/proc.go:2696
> #7  0x0000000000471b3b in runtime.mcall () at 
> /snap/go/5830/src/runtime/asm_amd64.s:318
> #8  0x0000000000000000 in ?? ()
>
> Thread 4 (Thread 0x7fc437fff700 (LWP 89351)):
> #0  runtime.epollwait () at /snap/go/5830/src/runtime/sys_linux_amd64.s:705
> #1  0x000000000043ed52 in runtime.netpoll (delay=9999949504, ~r1=...) at 
> /snap/go/5830/src/runtime/netpoll_epoll.go:119
> #2  0x000000000044a01b in runtime.findrunnable (gp=0xc00003e800, 
> inheritTime=false) at /snap/go/5830/src/runtime/proc.go:2329
> #3  0x000000000044ae3c in runtime.schedule () at 
> /snap/go/5830/src/runtime/proc.go:2526
> #4  0x000000000044b3bd in runtime.park_m (gp=0xc000105b00) at 
> /snap/go/5830/src/runtime/proc.go:2696
> #5  0x0000000000471b3b in runtime.mcall () at 
> /snap/go/5830/src/runtime/asm_amd64.s:318
> #6  0x0000000000000000 in ?? ()
>
> Thread 3 (Thread 0x7fc457fff700 (LWP 88718)):
> #0  runtime.scanobject (b=824635064320, gcw=0xc000042698) at 
> /snap/go/5830/src/runtime/mgcmark.go:1241
> #1  0x000000000042f30b in runtime.gcDrain (gcw=0xc000042698, flags=3) at 
> /snap/go/5830/src/runtime/mgcmark.go:1032
> #2  0x000000000046ec40 in runtime.gcBgMarkWorker.func2 () at 
> /snap/go/5830/src/runtime/mgc.go:1940
> #3  0x0000000000471bc6 in runtime.systemstack () at 
> /snap/go/5830/src/runtime/asm_amd64.s:370
> #4  0x0000000000447640 in ?? () at <autogenerated>:1
> #5  0x0000000000000000 in ?? ()
>
> Thread 2 (Thread 0x7fc473fff700 (LWP 88678)):
> #0  runtime.futex () at /snap/go/5830/src/runtime/sys_linux_amd64.s:568
> #1  0x000000000043f076 in runtime.futexsleep (addr=0x12ee720 <runtime.sig>, 
> val=0, ns=-1) at /snap/go/5830/src/runtime/os_linux.go:45
> #2  0x000000000041c336 in runtime.notetsleep_internal (n=0x12ee720 
> <runtime.sig>, ns=-1, ~r2=<optimized out>)
>     at /snap/go/5830/src/runtime/lock_futex.go:174
> #3  0x000000000041c53c in runtime.notetsleepg (n=0x12ee720 <runtime.sig>, 
> ns=-1, ~r2=<optimized out>) at /snap/go/5830/src/runtime/lock_futex.go:228
> #4  0x000000000045a74c in os/signal.signal_recv (~r0=<optimized out>) at 
> /snap/go/5830/src/runtime/sigqueue.go:147
> #5  0x0000000000515ce0 in os/signal.loop () at 
> /snap/go/5830/src/os/signal/signal_unix.go:23
> #6  0x0000000000473c61 in runtime.goexit () at 
> /snap/go/5830/src/runtime/asm_amd64.s:1373
> #7  0x0000000000000000 in ?? ()
>
> Thread 1 (Thread 0x7fc4a58621c0 (LWP 88642)):
> #0  runtime.futex () at /snap/go/5830/src/runtime/sys_linux_amd64.s:568
> #1  0x000000000043f0f4 in runtime.futexsleep (addr=0x8b35d0 
> <runtime.sched+304>, val=0, ns=100000) at 
> /snap/go/5830/src/runtime/os_linux.go:51
> #2  0x000000000041c3de in runtime.notetsleep_internal (n=0x8b35d0 
> <runtime.sched+304>, ns=100000, ~r2=<optimized out>)
>     at /snap/go/5830/src/runtime/lock_futex.go:193
> #3  0x000000000041c4b1 in runtime.notetsleep (n=0x8b35d0 <runtime.sched+304>, 
> ns=100000, ~r2=<optimized out>)
>     at /snap/go/5830/src/runtime/lock_futex.go:216
> #4  0x0000000000447dcc in runtime.forEachP (fn={void (runtime.p *)} 
> 0x7ffe8a519ef8) at /snap/go/5830/src/runtime/proc.go:1292
> #5  0x000000000046e7fe in runtime.gcMarkDone.func1 () at 
> /snap/go/5830/src/runtime/mgc.go:1456
> #6  0x0000000000471bc6 in runtime.systemstack () at 
> /snap/go/5830/src/runtime/asm_amd64.s:370
> #7  0x0000000000447640 in ?? () at <autogenerated>:1
> #8  0x0000000000471a54 in runtime.rt0_go () at 
> /snap/go/5830/src/runtime/asm_amd64.s:220
> #9  0x00000000005644d0 in ?? ()
> #10 0x0000000000471a5b in runtime.rt0_go () at 
> /snap/go/5830/src/runtime/asm_amd64.s:225
> #11 0x0000000000000003 in ?? ()
> #12 0x00007ffe8a51a058 in ?? ()
> #13 0x0000000000000003 in ?? ()
> #14 0x00007ffe8a51a058 in ?? ()
> #15 0x0000000000000000 in ?? ()
> (gdb) i goroutines
> * 1 running  time.NewTimer
>   2 waiting  runtime.gopark
>   3 waiting  runtime.gopark
>   4 waiting  runtime.gopark
>   18 waiting  runtime.gopark
> * 34 syscall  runtime.notetsleepg
>   35 waiting  runtime.gopark
> * 66 waiting  runtime.systemstack_switch
>   36 waiting  runtime.gopark
> * 67 waiting  runtime.systemstack_switch
>   53 waiting  runtime.gopark
>   68 waiting  runtime.gopark
>   37 waiting  runtime.gopark
>   38 waiting  runtime.gopark
>   54 waiting  runtime.gopark
>
> (gdb)
>
> In the testing I've done with tracing turned on, I have seen the panic begin, 
> I've seen the deferred yottadb.Exit() handler set in the main routine start 
> up, watched the main routine's cleanup handler run and then it cores with one 
> of these strange cores on a futex access. And it is ALWAYS the main thread of 
> the process that fails. I really don't know at a low level what is happening 
> so have no idea how to fix it. Note this also seems like it may have occurred 
> after Go took down its signal handlers as there was ZERO output from this 
> failure so this was not a typical synchronous signal as Go defines them. This 
> failure seems like it may have been handled by the default handlers - thus 
> creating the core even though $GOTRACEBACK was not set.
>
> If anyone has any thoughts on what could be causing this, we would really 
> appreciate the suggestions.
>
> In case anyone would like to try to reproduce, below are links to the source 
> for the YottaDB runtime (would need to be built using directions in README), 
> the Go wrapper, and the specific test program I'm referring to. I'm sorry 
> this is not a nice tidy little package but unfortunately a lot of code is 
> involved and I've been unsuccessful trying to create this failure without the 
> full thing:
>
> YottaDB: https://gitlab.com/estess/YDB
> Go Wrapper: https://gitlab.com/estess/YDBGo/-/tree/develop
> threeenp1C2: 
> https://gitlab.com/estess/YDBTest/-/blob/master/go/inref/threeenp1C2.go
>
> Note this facility uses a yottadb.pc config file that is located in the 
> install directory of the YottaDB runtime. To run the test after everything is 
> built and installed, do the following:
>
> $ydb_dist/GDE exit {this runs GDE to create a global directory file [default 
> mumps.gld] for the database in the current directory}
> $ydb_dist/mupip create {this creates the database [default mumps.dat] in the 
> current directory}
> Run threeenp1C2 {It will sit and wait for input - it's just a test so has 
> limited glamor}
> Enter "1000000" which should have it running long enough to allow it to be 
> shot with a signal.
> From another session do "kill threeenp1C2" to kill all of the spawned 
> processes. Each spawned process creates an output file. If no cores, try 
> again. It may be good to script it so it does it over and over until a core 
> occurs. Our failure rate is once in 20-30 though failures happen more often 
> the more loaded the system is.
>
> If further information from this or any of the other cores would help, I can 
> certainly do that if you let me know what/how and where to do it.


A SIGSEGV while returning from runtime.futex doesn't make a lot of
sense.  The instruction that is showing the SIGSEGV is just writing a
value to the stack.  I can only think of two ways that that could
fail.

One would be if the stack is somehow being unmapped from memory.  This
code is running on what Go calls the g0 stack, which in a cgo program
like yours is allocated by calling the C function pthread_create.  So
it's not completely impossible that as the program exits it destroys
threads, unmaps their stacks, but that the goroutine is somehow still
running.  I don't really see how this could happen, but perhaps there
is a way.

Another thing that could cause this would be if the system call is
interrupted by a signal, which could certainly be happening in your
case, and that the signal handler is changing the stack pointer before
it returns.  Signal handlers are able to do that, but the Go signal
handler doesn't do it.  So again I don't really see how this could
happen but some sort of memory corruption might be happening somehow.

The crash traceback, which I don't think you showed above, should show
the faulting memory address.  That might help you decide whether the
stack pointer is corrupted or whether it might be pointing to memory
that was mapped but has been unmapped.  If gdb or delve can show you
the active memory mappings, that might also help; I think those should
be recorded in a core file, but I'm not sure about that.

That's what comes to mind, anyhow.  Hope it helps.  It could easily be
something I haven't thought of.

Ian

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAOyqgcU%2Bz8n3ts89796W-2oWv3Bi%3DWjeGpHN7F7dHQyeQqW6vA%40mail.gmail.com.

Reply via email to