We do have a convenient sampler that tells us the state of all of the 
goroutines, goprocs, and threads: schedtrace+scheddetail.  The traces in my 
first post were one example, but I think I can find/repro a simpler case where 
we are stuck harder with nothing changing until the SIGTERM comes in.


> On Dec 12, 2018, at 3:54 PM, robert engels <reng...@ix.netcom.com> wrote:
> 
> I think the allocation triggers the GC, but it is stuck in start because it 
> is waiting for all other Go routines to come to the safe point. I would grab 
> a trace (but I’m not sure you can capture the trace if things are “stuck”) 
> and review that. Or maybe use an external sampler like ‘Instruments’ on osx, 
> or similar. It certainly seems like some routine is spinning - prevent GC 
> from happening and pausing everything else.
> 
>> On Dec 12, 2018, at 5:20 PM, Eric Hamilton <e...@kasten.io 
>> <mailto:e...@kasten.io>> wrote:
>> 
>> Thanks for the various possibilities.  It’s helpful to know things to look 
>> out for.
>> 
>> I don’t think there’s a tight polling loop in the code— at least I haven’t 
>> found it yet.
>> 
>> We’re using some open source packages, notably Kanister 
>> (https://github.com/kanisterio <https://github.com/kanisterio>) which Kasten 
>> contributes, Stow (https://github.com/graymeta/stow 
>> <https://github.com/graymeta/stow>) and in this case some Google cloud APIs 
>> through Stow.
>> 
>> On more than one occasion it has looked like the application goroutine gets 
>> stuck allocating an 8MB byte slice.  (8MB is the default chunk size in one 
>> of the Google cloud storage APIs and Stow lets it default.)
>> 
>> I don’t know whether an 8MB malloc is a problem per se or perhaps that a few 
>> of those trigger GC and take the shouldhelpgc path.
>> 
>> It’s most likely that the problem is in our code somewhere or settings that 
>> match our usage pattern of growing and shrinking the live heap.  I have 
>> wondered whether we might be hitting a corner case in the runtime (missed 
>> wakeup?).  Given the wide usage of Go across a variety of different types of 
>> programs, I doubt it, but something sure looks funny with long StopTheWorld 
>> times.
>> 
>> Any additional thought on diagnosis or experiments?  If I can’t figure out 
>> exactly what’s happening I’ll see about a workaround such as explicit calls 
>> to runtme.GC() at safe spots or periodically, etc.
>> 
>> 
>> Eric
>> 
>> 
>> p.s. Here’s the top of a stack trace triggered with a panic in SIGTERM 
>> handler.  It shows the make([]byte, 0, 8 * 1024 *1024) triggering gcStart().
>> 
>> 
>> goroutine 15 [runnable]:
>> runtime.gcStart(0x0, 0x1, 0x0, 0x0)
>>      /usr/local/go/src/runtime/mgc.go:1236 +0x583 fp=0xc000cd7540 
>> sp=0xc000cd7538 pc=0x419e63
>> runtime.mallocgc(0x800000, 0x1c04f80, 0x1, 0x600)
>>      /usr/local/go/src/runtime/malloc.go:996 +0x40e fp=0xc000cd75e0 
>> sp=0xc000cd7540 pc=0x40bb0e
>> runtime.makeslice(0x1c04f80, 0x0, 0x800000, 0xc00022ad50, 0xc000cd7678, 
>> 0x15cfcdc)
>>      /usr/local/go/src/runtime/slice.go:70 +0x77 fp=0xc000cd7610 
>> sp=0xc000cd75e0 pc=0x4414f7
>> vendor/google.golang.org/api/gensupport.NewMediaBuffer 
>> <http://google.golang.org/api/gensupport.NewMediaBuffer>(...)
>>      /repo/go/src/vendor/google.golang.org/api/gensupport/buffer.go:26 
>> <http://google.golang.org/api/gensupport/buffer.go:26>
>> vendor/google.golang.org/api/gensupport.PrepareUpload 
>> <http://google.golang.org/api/gensupport.PrepareUpload>(0x23aa900, 
>> 0xc000039b30, 0x800000, 0x0, 0x23aa900, 0xc000039b30)
>>      /repo/go/src/vendor/google.golang.org/api/gensupport/media.go:188 
>> <http://google.golang.org/api/gensupport/media.go:188> +0x50 fp=0xc000cd7688 
>> sp=0xc000cd7610 pc=0x15d03e0
>> vendor/google.golang.org/api/storage/v1 
>> <http://google.golang.org/api/storage/v1>.(*ObjectsInsertCall).Media(0xc000b2e780,
>>  0x23a5a80, 0xc00022ad50, 0x0, 0x0, 0x0, 0xe)
>>      
>> /repo/go/src/vendor/google.golang.org/api/storage/v1/storage-gen.go:8622 
>> <http://google.golang.org/api/storage/v1/storage-gen.go:8622> +0xb8 
>> fp=0xc000cd7718 sp=0xc000cd7688 pc=0x195c188
>> vendor/github.com/graymeta/stow/google 
>> <http://github.com/graymeta/stow/google>.(*Container).Put(0xc000de4be0, 
>> 0xc00058a871, 0x86, 0x23a5a80, 0xc00022ad50, 0x45, 0xc00022ad80, 0x50, 
>> 0xc000cd7840, 0x40c278, ...)
>>      /repo/go/src/vendor/github.com/graymeta/stow/google/container.go:144 
>> <http://github.com/graymeta/stow/google/container.go:144> +0x186 
>> fp=0xc000cd77d0 sp=0xc000cd7718 pc=0x1969d36
>> github.com/kanisterio/kanister/pkg/objectstore 
>> <http://github.com/kanisterio/kanister/pkg/objectstore>.(*directory).Put(0xc000de5b20,
>>  0x23c9740, 0xc00090b200, 0x20fdcf9, 0xe, 0x23a5a80, 0xc00022ad50, 0x45, 
>> 0x0, 0xc000171ad0, ...)
>>      
>> /repo/go/src/github.com/kanisterio/kanister/pkg/objectstore/directory.go:206 
>> <http://github.com/kanisterio/kanister/pkg/objectstore/directory.go:206> 
>> +0xe9 fp=0xc000cd7850 sp=0xc000cd77d0 pc=0x1973769
>> github.com/kanisterio/kanister/pkg/objectstore 
>> <http://github.com/kanisterio/kanister/pkg/objectstore>.(*directory).PutBytes(0xc000de5b20,
>>  0x23c9740, 0xc00090b200, 0x20fdcf9, 0xe, 0xc000e4e410, 0x45, 0x50, 0x0, 
>> 0x9d895a, ...)
>>      
>> /repo/go/src/github.com/kanisterio/kanister/pkg/objectstore/directory.go:212 
>> <http://github.com/kanisterio/kanister/pkg/objectstore/directory.go:212> 
>> +0xd5 fp=0xc000cd78b8 sp=0xc000cd7850 pc=0x19739a5
>> 
>> 
>> 
>> 
>>> On Dec 11, 2018, at 8:21 PM, robert engels <reng...@ix.netcom.com 
>>> <mailto:reng...@ix.netcom.com>> wrote:
>>> 
>>> You might want to review this https://github.com/golang/go/issues/10958 
>>> <https://github.com/golang/go/issues/10958>
>>> 
>>>> On Dec 11, 2018, at 10:16 PM, robert engels <reng...@ix.netcom.com 
>>>> <mailto:reng...@ix.netcom.com>> wrote:
>>>> 
>>>> Reviewing the code, the 5s of cpu time is technically the stop-the-world 
>>>> (STW) sweep termination
>>>> 
>>>> So, I think the cause of your problem is that you have a tight / 
>>>> “infinite” loop that is not calling runtime.Gosched(), so it is taking a 
>>>> very long time for the sweep termination to complete.
>>>> 
>>>> 
>>>>> On Dec 11, 2018, at 9:43 PM, robert engels <reng...@ix.netcom.com 
>>>>> <mailto:reng...@ix.netcom.com>> wrote:
>>>>> 
>>>>> Well, your pause is clearly related to the GC - the first phase, the 
>>>>> mark, is 5s in #17. Are you certain you don’t have an incorrect highly 
>>>>> recursive loop that is causing the stack marking to take a really long 
>>>>> time… ?
>>>>> 
>>>>> 
>>>>>> On Dec 11, 2018, at 8:45 PM, Eric Hamilton <e...@kasten.io 
>>>>>> <mailto:e...@kasten.io>> wrote:
>>>>>> 
>>>>>> Of course.  (I forgot about that option and I'd collected those traces 
>>>>>> at a time when I thought I'd ruled out GC-- probably misread MemStats 
>>>>>> values).
>>>>>> 
>>>>>> Here's the gctrace output for a repro with a 5.6 second delay ending 
>>>>>> with finish of gc17 and a 30+ second delay ending in SIGTERM (which 
>>>>>> coincides with completion of gc18):
>>>>>> 
>>>>>> gc 1 @0.022s 1%: 0.94+10+0.19 ms clock, 0.94+0.65/0.60/5.3+0.19 ms cpu, 
>>>>>> 4->4->1 MB, 5 MB goal, 4 P
>>>>>> gc 2 @0.048s 3%: 0.007+5.9+0.17 ms clock, 0.007+0.24/5.7/1.0+0.17 ms 
>>>>>> cpu, 4->4->1 MB, 5 MB goal, 4 P
>>>>>> gc 3 @0.062s 3%: 0.004+3.7+0.18 ms clock, 0.004+0.19/0.78/2.8+0.18 ms 
>>>>>> cpu, 4->4->1 MB, 5 MB goal, 4 P
>>>>>> gc 4 @0.075s 13%: 0.003+3.4+69 ms clock, 0.003+0.069/0.99/2.3+69 ms cpu, 
>>>>>> 4->4->2 MB, 5 MB goal, 4 P
>>>>>> gc 5 @0.154s 12%: 0.007+7.4+0.032 ms clock, 0.007+0.15/0/5.1+0.032 ms 
>>>>>> cpu, 4->5->3 MB, 5 MB goal, 4 P
>>>>>> gc 6 @0.234s 10%: 0.004+34+0.025 ms clock, 0.004+0/33/6.8+0.025 ms cpu, 
>>>>>> 6->7->5 MB, 7 MB goal, 4 P
>>>>>> gc 7 @0.296s 8%: 0.005+28+0.031 ms clock, 0.005+0/0.99/50+0.031 ms cpu, 
>>>>>> 10->10->5 MB, 11 MB goal, 4 P
>>>>>> gc 8 @0.416s 6%: 0.007+49+1.1 ms clock, 0.007+0.13/4.5/32+1.1 ms cpu, 
>>>>>> 11->11->6 MB, 12 MB goal, 4 P
>>>>>> gc 9 @0.591s 5%: 0.006+23+0.032 ms clock, 0.006+0/6.5/31+0.032 ms cpu, 
>>>>>> 12->12->7 MB, 13 MB goal, 4 P
>>>>>> gc 10 @0.640s 5%: 0.006+10+0.032 ms clock, 0.006+0/3.6/12+0.032 ms cpu, 
>>>>>> 13->13->8 MB, 14 MB goal, 4 P
>>>>>> gc 11 @120.657s 0%: 0.012+11+0.030 ms clock, 0.012+0/1.7/23+0.030 ms 
>>>>>> cpu, 13->13->9 MB, 16 MB goal, 4 P
>>>>>> gc 12 @240.674s 0%: 0.011+12+0.037 ms clock, 0.011+0/6.1/27+0.037 ms 
>>>>>> cpu, 10->10->9 MB, 18 MB goal, 4 P
>>>>>> gc 13 @360.691s 0%: 0.011+10+0.022 ms clock, 0.011+0/2.7/25+0.022 ms 
>>>>>> cpu, 10->10->9 MB, 18 MB goal, 4 P
>>>>>> gc 14 @480.711s 0%: 0.018+11+0.021 ms clock, 0.018+0/4.1/27+0.021 ms 
>>>>>> cpu, 10->10->9 MB, 18 MB goal, 4 P
>>>>>> gc 15 @600.727s 0%: 0.016+14+0.030 ms clock, 0.016+0/7.0/30+0.030 ms 
>>>>>> cpu, 11->11->9 MB, 18 MB goal, 4 P
>>>>>> gc 16 @720.746s 0%: 0.016+13+0.023 ms clock, 0.016+0/5.7/27+0.023 ms 
>>>>>> cpu, 10->10->9 MB, 18 MB goal, 4 P
>>>>>> gc 17 @816.508s 0%: 5560+26+0.067 ms clock, 5560+7.3/26/22+0.067 ms cpu, 
>>>>>> 18->19->11 MB, 19 MB goal, 4 P
>>>>>> gc 18 @824.133s 1%: 40566+16+0.035 ms clock, 40566+6.8/15/10+0.035 ms 
>>>>>> cpu, 23->24->18 MB, 24 MB goal, 4 P
>>>>>> 
>>>>>> I've attached a file that also includes the schedtrace=1000, health 
>>>>>> check log messages, key app logs, and the runtime.MemStats collected at 
>>>>>> the end of the same run.
>>>>>> 
>>>>>> Thanks again for any insights,
>>>>>> Eric
>>>>>> 
>>>>>> 
>>>>>> On Tuesday, December 11, 2018 at 4:32:33 PM UTC-8, robert engels wrote:
>>>>>> I think it would be more helpful if you used gctrace=1 to report on the 
>>>>>> GC activity.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "golang-nuts" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>>> an email to golang-nuts+unsubscr...@googlegroups.com 
>>>>>> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
>>>>>> For more options, visit https://groups.google.com/d/optout 
>>>>>> <https://groups.google.com/d/optout>.
>>>>>> <trace.medium.txt>
>>>>> 
>>>>> 
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google Groups 
>>>>> "golang-nuts" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>>> email to golang-nuts+unsubscr...@googlegroups.com 
>>>>> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
>>>>> For more options, visit https://groups.google.com/d/optout 
>>>>> <https://groups.google.com/d/optout>.
>>>> 
>>>> 
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "golang-nuts" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>> email to golang-nuts+unsubscr...@googlegroups.com 
>>>> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
>>>> For more options, visit https://groups.google.com/d/optout 
>>>> <https://groups.google.com/d/optout>.
>>> 
>> 
> 

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to