(I don't suspect GC either, it was a rule it out suggestion)

On Mon, Jul 1, 2019 at 10:20 PM robert engels <reng...@ix.netcom.com> wrote:

> One other thing to think about - 500 qps is a lot for a single server if
> things start blocking and you have a low memory cap. Imagine the scenario
> where the DynamoDB “locks up / heavy delays due to IO contention /
> something”, depending on how your server is written/configured it may still
> keep accepting inbound connections, each potentially causing a thread to be
> created (if they are doing blocking IO). If there is not a low cap on the
> number of inbound connections you could quickly run out of memory. You
> might want to see if there is a configuration option on grpc-gateway to
> limit the maximum number of simultaneous inbound connections, set that to a
> low number, and see if it still runs out of memory.
>
> This scenario is unlikely if you didn’t see the TCPConn instances in the
> heap dump.
>
> On Jul 1, 2019, at 11:28 PM, Mighty Guava <mightygu...@gmail.com> wrote:
>
> Oh now that you mention it, I'm using grpc-gateway with the stdlib http
> server. Though all it really does is convert json <=> protobuf and is
> pretty mature so it's unlikely to be the culprit.
>
> Running with race detector is a good idea! I'll try that tomorrow.
>
> Thanks,
> Yunchi Luo
> On Jul 2, 2019, 12:25 AM -0400, robert engels <reng...@ix.netcom.com>,
> wrote:
>
> Are you using anything like fast-http, or such? Have you tried running the
> server with the race-detector enabled? I could see a race condition causing
> a rare failure that causes the code to go into a ‘spinning loop’ preventing
> GC to work properly. It’s a guess but I would try that to rule out the
> possibility.
>
> On Jul 1, 2019, at 11:06 PM, Mighty Guava <mightygu...@gmail.com> wrote:
>
> I don't think GC is being blocked. GC ran multiple times during the time
> the service was exploding in memory.
>
> I'm not using mmap in the service. It's just a straightforward CRUD
> webserver backed by DynamoDB. The only thing non-standard I can think of is
> that it connects to its backing database over HTTP/1.1 (using AWS Go SDK)
> instead of a custom database protocol. It's serving a fair amount of
> requests (500 qps per replica), so I initially assumed it was
> under-provisioned to handle latency spikes. But I tripled the memory and
> it's still OOMing on occasion, even though RSS is 9% of allocated memory
> just seconds before.
>
> Yunchi
> On Jul 2, 2019, 12:00 AM -0400, Robert Engels <reng...@ix.netcom.com>,
> wrote:
>
> Does your process use mmap? Maybe you are leaking there, as this counts
> against process memory size.
>
> On Jul 1, 2019, at 9:11 PM, Mighty Guava <mightygu...@gmail.com> wrote:
>
> I don't understand. What would adding runtime.Goscheduled() do here? I
> don't have any explicit loops in this service.
> On Jul 1, 2019, 9:11 PM -0400, Michael Jones <michael.jo...@gmail.com>,
> wrote:
>
> Does adding runtime.GoSched() calls make any difference?
>
> On Mon, Jul 1, 2019 at 5:37 PM 'Yunchi Luo' via golang-nuts <
> golang-nuts@googlegroups.com> wrote:
>
> Following that logic, a leak of TCP connections should manifest as a file
> descriptor leak. We have the process_open_fds metric from Prometheus, that
> is the number of open file descriptors as found in /proc/<pid>/fd. The
> number of descriptors overtime correlates well with the amount of traffic,
> pretty cyclic. There doesn't appear to be a leak.
>
> We don't do our own memory management and the binary is compiled with
> CGO_ENABLED=0.
>
> I still think the issue I'm seeing should be GC (or heap) related, given
> the explosion in mark & sweep time, HeapSys, HeapIdle, and HeapReleased
> just before the process dies. But I'm lacking ideas on how to track down
> the cause of the increase.
>
> On Mon, Jul 1, 2019 at 5:39 PM Robert Engels <reng...@ix.netcom.com>
> wrote:
>
> I think don't think you are going to find it in the 'heap', rather it
> would be in native memory.
>
> I would use the monitor the /proc/[pid] for the process, and pay attention
> to the 'fd','net' and 'statm' - if my theory is correct you will see growth
> here long before the process is killed. Since you are running under k8s and
> cgroups, you will need to do this along side the Go process (unless you
> have root access to the server).
>
> I 'think' depending on kernel version, that kernel memory used goes
> against the process for OOM purposes, so this is a likely candidate if
> pprof is showing nothing.
>
> Do you by chance do any of your own memory management (via malloc/CGO)? If
> so, this is not going to show in pprof either.
>
> -----Original Message-----
> From: 'Yunchi Luo' via golang-nuts
> Sent: Jul 1, 2019 4:26 PM
> To: Robert Engels
> Cc: golang-nuts@googlegroups.com, Alec Thomas
> Subject: Re: [go-nuts] OOM occurring with a small heap
>
> I actually have a heap profile (pasted at the bottom) from about 1 second
> before the service died (the goroutine that is logging "[Memory]" triggers
> heap profiles once RSS > 100MB). I don't see TCP connections there. Maybe
> it's too few to be sampled. How would I verify your theory? That the
> service dies within 2 seconds after several hours makes it very hard to
> debug.
>
> The top thing in the heap profile is from the reflect package. I initially
> found that suspect, but it turns out this comes from a use of
> httptrace.ClientTrace I had for counting new connections to DynamoDB.
>
>  tracer := &httptrace.ClientTrace{
>    ConnectStart: func(_, _ string) {
>      newConns.Inc()
>    },
> }
>
> `newConns` is just a prometheus counter. The `tracer` object itself is
> created once and re-used with every client request context. On request,
> `httptrace.WithClientTrace(ctx, tracer)` uses reflection to compose the
> trace functions under-the-hood and uses reflection to invoke it, hence the
> reflect.funcLayout and reflect.Value.call. Objects in `reflect` account for
> about 50% of heap in terms of size, and does seem to grow as the service is
> running out of memory, but that's only 12MB so I thought it was a red
> herring.
>
> Heap profile:
> Type: inuse_space
> Time: Jun 30, 2019 at 4:46am (EDT)
> Entering interactive mode (type "help" for commands, "o" for options)
> (pprof) inuse_objects
> (pprof) top
> Showing nodes accounting for 414485, 100% of 414485 total
> Showing top 10 nodes out of 81
>       flat  flat%   sum%        cum   cum%
>     344074 83.01% 83.01%     344074 83.01%  reflect.funcLayout.func1
>      32768  7.91% 90.92%     376842 90.92%  reflect.callReflect
>      16384  3.95% 94.87%      16384  3.95%
> github.com/json-iterator/go.processTags
>      10923  2.64% 97.51%      10923  2.64%  context.WithValue
>       8192  1.98% 99.48%       8192  1.98%  crypto/hmac.New
>       1260   0.3% 99.79%       1260   0.3%
> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).buildCanonicalHeaders
>        820   0.2%   100%        820   0.2%
> github.com/stripe/veneur/tdigest.NewMerging
>         64 0.015%   100%         64 0.015%  reflect.addReflectOff
>          0     0%   100%        820   0.2%
> git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).Observe
>          0     0%   100%        820   0.2%
> git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).openDigests
> (pprof) cum
> (pprof) top
> Showing nodes accounting for 376842, 90.92% of 414485 total
> Showing top 10 nodes out of 81
>       flat  flat%   sum%        cum   cum%
>          0     0%     0%     376842 90.92%
>  net/http/httptrace.(*ClientTrace).compose.func1
>          0     0%     0%     376842 90.92%  reflect.Value.Call
>          0     0%     0%     376842 90.92%  reflect.Value.call
>      32768  7.91%  7.91%     376842 90.92%  reflect.callReflect
>          0     0%  7.91%     376842 90.92%  reflect.makeFuncStub
>     344074 83.01% 90.92%     344074 83.01%  reflect.funcLayout.func1
>          0     0% 90.92%     344074 83.01%  sync.(*Pool).Get
>          0     0% 90.92%      16448  3.97%
> github.com/json-iterator/go._createDecoderOfType
>          0     0% 90.92%      16448  3.97%
> github.com/json-iterator/go.createDecoderOfType
>          0     0% 90.92%      16448  3.97%
> github.com/json-iterator/go.decoderOfStruct
>
>
> On Mon, Jul 1, 2019 at 4:32 PM Robert Engels <reng...@ix.netcom.com>
> wrote:
>
>
> A leak of the TCP connections (maybe not properly closed)? Each TCP
> connection will use kernel memory and process memory (local buffers), that
> won't be on the heap (the reference to the TCP connection will be in the Go
> heap, but is probably much smaller than the buffer allocation).
>
> That would be my guess - but just a guess.
>
> -----Original Message-----
> From: 'Yunchi Luo' via golang-nuts
> Sent: Jul 1, 2019 2:14 PM
> To: golang-nuts@googlegroups.com
> Cc: Alec Thomas
> Subject: [go-nuts] OOM occurring with a small heap
>
> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>
> I'm trying to debug OOM on a service we are running in k8s. The service is
> just a CRUD server hitting a database (DynamoDB). Each replica serves about
> 300 qps of traffic. There are no memory leaks. On occasion (seemingly
> correlated to small latency spikes on the backend), the service would OOM.
> This is surprising because it has a circuit breaker that drops requests
> after 200 concurrent connections that has never trips, and goroutine
> profiles confirm that there are nowhere 200 active goroutines.
>
> GC logs are pasted below. It's interlaced with dumps of runtime.Memstats
> (the RSS number is coming from /proc/<pid>/stats). Go version is 1.12.5,
> running an Alpine 3.10 container in an Amazon kernel
> 4.14.123-111.109.amzn2.x86_64.
>
> The service happily serves requests using ~50MB of RSS for hours, until
> the last 2 seconds, where GC mark&sweep time starts to 2-4X per cycle 
> (43+489/158/0.60+0.021
> ms cpu => 43+489/158/0.60+0.021 ms cpu), and RSS and Sys blow up. It’s
> also interesting that in the last log line: `Sys=995MB RSS=861MB
> HeapSys=199MB`. If I’m reading this correctly, there’s at least `662MB` of
> memory in RSS that is not assigned to the heap. Though this might be due to
> the change in 1.125 to use MADV_FREE, so the pages are freeable not yet
> reclaimed by the kernel.
>
> I don’t understand how heap can be so small across gc cycles (28->42->30MB
> on the last line means heap doesn't grow past 42MB?), yet RSS keeps
> growing. I'm assuming the increased RSS is causing the kernel to OOM the
> service, but that should only happen if the RSS is not freeable as marked
> by MADV_FREE. There doesn't seem to be any indication of that from the GC
> logs. I guess this all comes down to me not having a good understanding of
> how the GC algorithm works and how to read these logs. I'd really
> appreciate it if anyone can explain what's happening and why.
>
> gc 41833 @19135.227s 0%: 0.019+2.3+0.005 ms clock,
> 0.079+0.29/2.2/5.6+0.020 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> INFO 2019-06-30T08:46:04.886 [Memory]: Alloc=7MB TotalAlloc=230172MB
> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB
> gc 41834 @19135.869s 0%: 0.005+2.9+0.003 ms clock,
> 0.023+0.32/2.5/6.6+0.012 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> INFO 2019-06-30T08:46:05.886 [Memory]: Alloc=9MB TotalAlloc=230179MB
> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
> gc 41835 @19136.704s 0%: 0.038+2.1+0.004 ms clock, 0.15+0.35/2.1/5.3+0.016
> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> INFO 2019-06-30T08:46:06.886 [Memory]: Alloc=9MB TotalAlloc=230184MB
> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
> gc 41836 @19137.611s 0%: 0.009+2.1+0.003 ms clock,
> 0.036+0.39/2.0/5.7+0.015 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> INFO 2019-06-30T08:46:07.887 [Memory]: Alloc=10MB TotalAlloc=230190MB
> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB
> gc 41837 @19138.444s 0%: 0.008+2.1+0.004 ms clock,
> 0.035+0.51/2.1/5.7+0.017 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> INFO 2019-06-30T08:46:08.887 [Memory]: Alloc=10MB TotalAlloc=230195MB
> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB
> gc 41838 @19139.474s 0%: 0.005+2.6+0.003 ms clock,
> 0.023+0.37/2.5/4.3+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> gc 41839 @19140.173s 0%: 0.011+2.4+0.003 ms clock,
> 0.046+0.20/2.3/5.8+0.015 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> INFO 2019-06-30T08:46:09.887 [Memory]: Alloc=7MB TotalAlloc=230202MB
> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB
> gc 41840 @19140.831s 0%: 0.082+2.1+0.003 ms clock, 0.32+0.64/2.1/5.3+0.014
> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> INFO 2019-06-30T08:46:10.887 [Memory]: Alloc=9MB TotalAlloc=230209MB
> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
> gc 41841 @19141.655s 0%: 0.014+2.1+0.003 ms clock,
> 0.056+0.28/2.0/5.7+0.013 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> gc 41842 @19142.316s 0%: 0.006+2.7+0.003 ms clock,
> 0.027+0.29/2.6/6.2+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> INFO 2019-06-30T08:46:11.888 [Memory]: Alloc=6MB TotalAlloc=230216MB
> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB
> gc 41843 @19142.942s 0%: 0.010+2.1+0.005 ms clock,
> 0.040+0.29/2.0/5.7+0.023 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> INFO 2019-06-30T08:46:12.888 [Memory]: Alloc=9MB TotalAlloc=230223MB
> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB
> gc 41844 @19143.724s 0%: 0.008+2.4+0.004 ms clock,
> 0.035+0.38/2.0/5.7+0.017 ms cpu, 11->11->5 MB, 12 MB goal, 4 P
> gc 41845 @19144.380s 0%: 10+9.3+0.044 ms clock, 43+6.1/9.2/4.4+0.17 ms
> cpu, 11->11->6 MB, 12 MB goal, 4 P
> INFO 2019-06-30T08:46:13.901 [Memory]: Alloc=6MB TotalAlloc=230230MB
> Sys=136MB RSS=98MB HeapSys=94MB HeapIdle=83MB HeapInUse=11MB
> HeapReleased=35MB
> gc 41846 @19144.447s 0%: 0.008+26+0.005 ms clock, 0.033+0.46/7.8/26+0.020
> ms cpu, 11->12->9 MB, 12 MB goal, 4 P
> gc 41847 @19144.672s 0%: 0.013+76+0.006 ms clock, 0.053+0.20/6.4/80+0.024
> ms cpu, 17->18->8 MB, 18 MB goal, 4 P
> gc 41848 @19145.014s 0%: 0.008+172+0.005 ms clock,
> 0.035+0.13/8.5/177+0.022 ms cpu, 15->17->10 MB, 16 MB goal, 4 P
> gc 41849 @19145.298s 0%: 0.007+285+0.006 ms clock, 0.030+10/285/7.6+0.024
> ms cpu, 19->23->15 MB, 20 MB goal, 4 P
> INFO 2019-06-30T08:46:15.052 [Memory]: Alloc=22MB TotalAlloc=230264MB
> Sys=598MB RSS=531MB HeapSys=265MB HeapIdle=240MB HeapInUse=25MB
> HeapReleased=164MB
> gc 41850 @19145.665s 0%: 10+419+0.005 ms clock, 43+489/158/0.60+0.021 ms
> cpu, 26->30->17 MB, 30 MB goal, 4 P
> gc 41851 @19146.325s 0%: 21+798+0.036 ms clock, 86+990/401/0+0.14 ms cpu,
> 28->42->30 MB, 34 MB goal, 4 P
> INFO 2019-06-30T08:46:16.613 [Memory]: Alloc=41MB TotalAlloc=230303MB
> Sys=995MB RSS=861MB HeapSys=199MB HeapIdle=155MB HeapInUse=44MB
> HeapReleased=54MB
>
> I also captured the OOM log from dmesg here
> https://gist.github.com/mightyguava/7ecc6fc55f5cd925062d6beede3783b3.
>
> --
> Yunchi
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com
> <https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
>
>
> --
> Yunchi
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/golang-nuts/CANnT9siMD4L5oC0nvmGYYM3Qo4rVSSCT%2BK5__fR9%3DzuyXV6S0Q%40mail.gmail.com
> <https://groups.google.com/d/msgid/golang-nuts/CANnT9siMD4L5oC0nvmGYYM3Qo4rVSSCT%2BK5__fR9%3DzuyXV6S0Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
>
>
> --
> Yunchi
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/golang-nuts/CANnT9sjEo%2Bp2HYOVDjB_EzaGE474QuBy-9Yp_HaCn95wHSFsHQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/golang-nuts/CANnT9sjEo%2Bp2HYOVDjB_EzaGE474QuBy-9Yp_HaCn95wHSFsHQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
> --
>
> *Michael T. Jones michael.jo...@gmail.com <michael.jo...@gmail.com>*
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/golang-nuts/CALoEmQz%3DsJeq%3DpiDKQaa-uyTFXUsN69MsUUB8mQ8qbqVrOtCng%40mail.gmail.com
> <https://groups.google.com/d/msgid/golang-nuts/CALoEmQz%3DsJeq%3DpiDKQaa-uyTFXUsN69MsUUB8mQ8qbqVrOtCng%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>

-- 

*Michael T. jonesmichael.jo...@gmail.com <michael.jo...@gmail.com>*

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CALoEmQz_vw_4NF56Wd4yg_Q063Y5Ls%3DPJnUpE9-c47mXcD%3DR-Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to