Did you try running on an older release of Go, like 1.10?

> On Jul 2, 2019, at 11:53 AM, 'Yunchi Luo' via golang-nuts 
> <golang-nuts@googlegroups.com> wrote:
> 
> I'm not so much pointing my finger at GC as I am hoping GC logs could help 
> tell the story, and that someone with a strong understanding of GC in Go 
> could weigh in here. In the last 4 seconds before OOM, "TotalAlloc" increased 
> by only 80M, yet "HeapIdle" increased to 240M from 50M, RSS increased by 
> 810M. The numbers don't add up for me. A running sum of 80M of heap objects 
> were allocated in the time RSS increased by 10X that. If GC was completely 
> off, I still wouldn't expect this to happen, which makes me want to rule out 
> GC being blocked as a problem. Maybe there was runaway heap reservation 
> because something in my code caused a ton of fragmentation? Is that sane? The 
> heap profile lacking clues is also strange.
> 
> Regarding the possibility of a race, I forgot I do have goroutine profiles 
> captured along with the heap profiles at the time memory exploded. There are 
> only 10 goroutines running on the serving path, which rules out too many 
> concurrent requests being served (please correct me if I'm wrong). Those fan 
> out to 13 goroutines talking to the db, all of which are in 
> http.Transport.roundTrip (which is blocked on runtime.gopark so I assume they 
> are waiting on the TCP connection). All other goroutines that don't originate 
> in the stdlib are also blocked on `select` or sleeping. Our CI does run with 
> go test -race, but I'll try doing some load testing with a race detector 
> enabled binary.
> 
> Bakul, that is sound advice. I've actually been debugging this on and off for 
> a couple months now, with the help of several people, a few of which have 
> peer reviewed the code. I agree it's most likely to be some runaway code that 
> I caused in my logic, but we haven't been able to pin-point the cause and 
> I've run out of hypothesis to test at the moment. This is why I've started 
> asking on go-nuts@. The circuit breaker code was one of the first things I 
> checked, has been unit tested and verified working with load tests. Now that 
> you mention it, I actually did uncover a Go stdlib bug in http2 while doing 
> the load tests... but that's unrelated.
> 
> 
>> On Tue, Jul 2, 2019 at 2:24 AM Bakul Shah <ba...@bitblocks.com> wrote:
>> Before assuming it is the GC or something system related, you may wish to 
>> verify it is *not your own logic*. Larger RSS could also be due to your own 
>> logic touching more and more memory due to some runaway effect. The 
>> probability this has to do with GC is very low given the very widespread use 
>> of Go and the probability of a bug in new code is very high given it is used 
>> on a much much smaller scale. 
>> 
>> This has the "smell" of a concurrency bug. If I were you I'd test the code 
>> for any races, I'd review the code thoroughly with someone who doesn't know 
>> the code so that I'm forced to explain it, and I'd add plenty of assertions. 
>> I'd probably first look at the circuit breaker code -- things like how does 
>> it know how many concurrent connections exist?
>> 
>> In general, any hypothesis you come up with, you should have a test that 
>> *catches* the bug given the hypothesis. Elusive bugs tend to become more 
>> elusive as you are on the hunt and as you may fix other problems you 
>> discover on the way.
>> 
>> I even suspect you're looking at GC logs a bit too early. Instrument your 
>> own code and look at what patterns emerge. [Not to mention any time you 
>> spend on understanding your code will help improve your service; but better 
>> understanding of and debugging the GC won't necessarily help you!]
>> 
>>> On Jul 1, 2019, at 12:14 PM, 'Yunchi Luo' via golang-nuts 
>>> <golang-nuts@googlegroups.com> wrote:
>>> 
>>> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>>> 
>>> I'm trying to debug OOM on a service we are running in k8s. The service is 
>>> just a CRUD server hitting a database (DynamoDB). Each replica serves about 
>>> 300 qps of traffic. There are no memory leaks. On occasion (seemingly 
>>> correlated to small latency spikes on the backend), the service would OOM. 
>>> This is surprising because it has a circuit breaker that drops requests 
>>> after 200 concurrent connections that has never trips, and goroutine 
>>> profiles confirm that there are nowhere 200 active goroutines.
>>> 
>>> GC logs are pasted below. It's interlaced with dumps of runtime.Memstats 
>>> (the RSS number is coming from /proc/<pid>/stats). Go version is 1.12.5, 
>>> running an Alpine 3.10 container in an Amazon kernel 
>>> 4.14.123-111.109.amzn2.x86_64.
>>> 
>>> The service happily serves requests using ~50MB of RSS for hours, until the 
>>> last 2 seconds, where GC mark&sweep time starts to 2-4X per cycle 
>>> (43+489/158/0.60+0.021 ms cpu => 43+489/158/0.60+0.021 ms cpu), and RSS and 
>>> Sys blow up. It’s also interesting that in the last log line: `Sys=995MB 
>>> RSS=861MB HeapSys=199MB`. If I’m reading this correctly, there’s at least 
>>> `662MB` of memory in RSS that is not assigned to the heap. Though this 
>>> might be due to the change in 1.125 to use MADV_FREE, so the pages are 
>>> freeable not yet reclaimed by the kernel.
>>> 
>>> I don’t understand how heap can be so small across gc cycles (28->42->30MB 
>>> on the last line means heap doesn't grow past 42MB?), yet RSS keeps 
>>> growing. I'm assuming the increased RSS is causing the kernel to OOM the 
>>> service, but that should only happen if the RSS is not freeable as marked 
>>> by MADV_FREE. There doesn't seem to be any indication of that from the GC 
>>> logs. I guess this all comes down to me not having a good understanding of 
>>> how the GC algorithm works and how to read these logs. I'd really 
>>> appreciate it if anyone can explain what's happening and why.
>>> 
>>> gc 41833 @19135.227s 0%: 0.019+2.3+0.005 ms clock, 0.079+0.29/2.2/5.6+0.020 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> INFO 2019-06-30T08:46:04.886 [Memory]: Alloc=7MB TotalAlloc=230172MB 
>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB
>>> gc 41834 @19135.869s 0%: 0.005+2.9+0.003 ms clock, 0.023+0.32/2.5/6.6+0.012 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> INFO 2019-06-30T08:46:05.886 [Memory]: Alloc=9MB TotalAlloc=230179MB 
>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
>>> gc 41835 @19136.704s 0%: 0.038+2.1+0.004 ms clock, 0.15+0.35/2.1/5.3+0.016 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> INFO 2019-06-30T08:46:06.886 [Memory]: Alloc=9MB TotalAlloc=230184MB 
>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
>>> gc 41836 @19137.611s 0%: 0.009+2.1+0.003 ms clock, 0.036+0.39/2.0/5.7+0.015 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> INFO 2019-06-30T08:46:07.887 [Memory]: Alloc=10MB TotalAlloc=230190MB 
>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB
>>> gc 41837 @19138.444s 0%: 0.008+2.1+0.004 ms clock, 0.035+0.51/2.1/5.7+0.017 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> INFO 2019-06-30T08:46:08.887 [Memory]: Alloc=10MB TotalAlloc=230195MB 
>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB
>>> gc 41838 @19139.474s 0%: 0.005+2.6+0.003 ms clock, 0.023+0.37/2.5/4.3+0.014 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> gc 41839 @19140.173s 0%: 0.011+2.4+0.003 ms clock, 0.046+0.20/2.3/5.8+0.015 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> INFO 2019-06-30T08:46:09.887 [Memory]: Alloc=7MB TotalAlloc=230202MB 
>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB
>>> gc 41840 @19140.831s 0%: 0.082+2.1+0.003 ms clock, 0.32+0.64/2.1/5.3+0.014 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> INFO 2019-06-30T08:46:10.887 [Memory]: Alloc=9MB TotalAlloc=230209MB 
>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
>>> gc 41841 @19141.655s 0%: 0.014+2.1+0.003 ms clock, 0.056+0.28/2.0/5.7+0.013 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> gc 41842 @19142.316s 0%: 0.006+2.7+0.003 ms clock, 0.027+0.29/2.6/6.2+0.014 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> INFO 2019-06-30T08:46:11.888 [Memory]: Alloc=6MB TotalAlloc=230216MB 
>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB
>>> gc 41843 @19142.942s 0%: 0.010+2.1+0.005 ms clock, 0.040+0.29/2.0/5.7+0.023 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> INFO 2019-06-30T08:46:12.888 [Memory]: Alloc=9MB TotalAlloc=230223MB 
>>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB
>>> gc 41844 @19143.724s 0%: 0.008+2.4+0.004 ms clock, 0.035+0.38/2.0/5.7+0.017 
>>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>>> gc 41845 @19144.380s 0%: 10+9.3+0.044 ms clock, 43+6.1/9.2/4.4+0.17 ms cpu, 
>>> 11->11->6 MB, 12 MB goal, 4 P
>>> INFO 2019-06-30T08:46:13.901 [Memory]: Alloc=6MB TotalAlloc=230230MB 
>>> Sys=136MB RSS=98MB HeapSys=94MB HeapIdle=83MB HeapInUse=11MB 
>>> HeapReleased=35MB
>>> gc 41846 @19144.447s 0%: 0.008+26+0.005 ms clock, 0.033+0.46/7.8/26+0.020 
>>> ms cpu, 11->12->9 MB, 12 MB goal, 4 P
>>> gc 41847 @19144.672s 0%: 0.013+76+0.006 ms clock, 0.053+0.20/6.4/80+0.024 
>>> ms cpu, 17->18->8 MB, 18 MB goal, 4 P
>>> gc 41848 @19145.014s 0%: 0.008+172+0.005 ms clock, 0.035+0.13/8.5/177+0.022 
>>> ms cpu, 15->17->10 MB, 16 MB goal, 4 P
>>> gc 41849 @19145.298s 0%: 0.007+285+0.006 ms clock, 0.030+10/285/7.6+0.024 
>>> ms cpu, 19->23->15 MB, 20 MB goal, 4 P
>>> INFO 2019-06-30T08:46:15.052 [Memory]: Alloc=22MB TotalAlloc=230264MB 
>>> Sys=598MB RSS=531MB HeapSys=265MB HeapIdle=240MB HeapInUse=25MB 
>>> HeapReleased=164MB
>>> gc 41850 @19145.665s 0%: 10+419+0.005 ms clock, 43+489/158/0.60+0.021 ms 
>>> cpu, 26->30->17 MB, 30 MB goal, 4 P
>>> gc 41851 @19146.325s 0%: 21+798+0.036 ms clock, 86+990/401/0+0.14 ms cpu, 
>>> 28->42->30 MB, 34 MB goal, 4 P
>>> INFO 2019-06-30T08:46:16.613 [Memory]: Alloc=41MB TotalAlloc=230303MB 
>>> Sys=995MB RSS=861MB HeapSys=199MB HeapIdle=155MB HeapInUse=44MB 
>>> HeapReleased=54MB
>>> 
>>> I also captured the OOM log from dmesg here 
>>> https://gist.github.com/mightyguava/7ecc6fc55f5cd925062d6beede3783b3.
>>> 
>>> -- 
>>> Yunchi
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "golang-nuts" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to golang-nuts+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com.
>>> For more options, visit https://groups.google.com/d/optout.
>> 
> 
> 
> -- 
> Yunchi
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/CANnT9sjNpE8wjqv6n%2BbHyZJ_cCvwN3O9rHKTT3%3DdSqZah0PfHA%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/5A68A2C6-59BC-4193-B6F4-45F56127372B%40ix.netcom.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to