I have a fairly complex program that has about 150 goroutines servicing 
requests. Some requests go out to cgo, which does some network IO. There is 
one goroutine that is responsible for a heartbeat (etcd keepalive) and 
sometimes (~every two weeks) this goroutine doesn't get scheduled for long 
periods of time (more than ten seconds), so it misses the heartbeat.

I've been debugging this on and off for several months, and am completely 
sure it relates to scheduling because there is code that looks similar to 
this this:

fmt.Printf("before %v\n",time.Now())
<-time.After(500*time.Millisecond)
fmt.Printf("after %v\n",time.Now())

And the print statements show times more than ten seconds apart.

I considered it might be a very long GC stop the world (maybe some rogue 
unpreemptible goroutine) but I don't see any output from gctrace=1 that 
indicates that (I could be wrong)

The goroutines in cgo can take some time to complete their operations 
(normally tens of milliseconds, sometimes tens of seconds) but I have never 
observed more than 200 processes being created in response to the blocking 
M being moved off the P and a new M being created.  

The total CPU usage doesn't exceed 60%, so there are spare CPU cycles.

I turned on scheduling trace, and here is the output from a problematic 
event:
# this is where we select on <-time.After(500ms)
[etcddebug] start of select 2019-08-01 06:29:55.873680674 +0000 UTC 
m=+132838.232219823
SCHED 132838304ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
idlethreads=102 runqueue=2 [0 0 3 0 0 0 0 0]
[INFO] 2019-08-01T06:29:56.084Z <some log output from another goroutine>
SCHED 132839312ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
idlethreads=105 runqueue=9 [0 0 0 0 5 0 0 0]
<some output from another goroutine>
SCHED 132840318ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
idlethreads=106 runqueue=12 [0 0 0 0 5 0 0 0]
SCHED 132841325ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
idlethreads=106 runqueue=12 [0 0 0 0 5 0 0 0]
SCHED 132842331ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
idlethreads=107 runqueue=13 [0 0 0 0 5 0 0 0]
SCHED 132843338ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
idlethreads=107 runqueue=13 [0 0 0 0 5 0 0 0]
SCHED 132844344ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
idlethreads=107 runqueue=13 [0 0 0 0 5 0 0 0]
SCHED 132845351ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
idlethreads=107 runqueue=14 [0 0 0 0 5 0 0 0]
SCHED 132846357ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
idlethreads=107 runqueue=14 [0 0 0 0 5 0 0 0]
SCHED 132847364ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
idlethreads=107 runqueue=15 [0 0 0 0 5 0 0 0]
# This is where the program gets killed because the <-time.After goroutine 
never ran
[ERROR] 2019-08-01T06:30:05.472Z clustering.go:259 ETCD LEASE CHANNEL CLOSED


The program panics soon after that message (on purpose) and I get this 
printout from sched trace. Not sure what triggers it.

SCHED 132847848ms: gomaxprocs=8 idleprocs=0 threads=123 spinningthreads=0 
idlethreads=97 runqueue=2 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
P0: status=2 schedtick=714575223 syscalltick=286704429 m=-1 runqsize=0 
gfreecnt=41
#this is the <-time.After goroutine finally running
[etcddebug] time.After 2019-08-01 06:30:05.473046143 +0000 UTC 
m=+132847.831585336 (227)
P1: status=1 schedtick=714230235 syscalltick=286072819 m=36 runqsize=0 
gfreecnt=11
P2: status=2 schedtick=710948022 syscalltick=284960027 m=-1 runqsize=0 
gfreecnt=53
P3: status=1 schedtick=707003392 syscalltick=283311908 m=88 runqsize=1 
gfreecnt=53
P4: status=2 schedtick=698758214 syscalltick=280825500 m=-1 runqsize=0 
gfreecnt=23
P5: status=2 schedtick=690236090 syscalltick=278745170 m=-1 runqsize=0 
gfreecnt=38
P6: status=1 schedtick=685940734 syscalltick=276797704 m=77 runqsize=0 
gfreecnt=51
P7: status=2 schedtick=681779831 syscalltick=275239769 m=-1 runqsize=0 
gfreecnt=50



I've tried several times to create a simple reproducing program, and have 
failed. I'm reaching out to ask if anyone could help me interpret the 
output from the scheduling trace, or perhaps even suggest what might be 
causing this, so I can create a standalone reproducing program.

Thanks
Michael

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/5052986e-82c6-4aa6-b6de-fda92839045b%40googlegroups.com.

Reply via email to