I actually just managed to trace down the root cause of the bug, and it's
quite surprising. It's not a heap overflow, it's a stack overflow, due to a
bug in the stdlib! Specifically, adding the same http.ClientTrace twice
onto a request context causes a stack overflow.

https://github.com/mightyguava/tracereflectbug/blob/480217947e3729a3fa833f6ece8bcf0d8012aaa3/main.go#L20-L22
<https://github.com/mightyguava/tracereflectbug/blob/master/main.go#L36-L38>

This explains everything we've been seeing in the memory/gc logs.

The story goes as follows:

There was a bug in my code for adding a http trace (copied earlier in this
thread). I added the trace to the request the AWS SDK Client's Send
handler, which is triggered on all retries, not just the first request. On
rare occasions (every few hours apparently), a network issue or dynamodb
backend issue causes retry in multiple client goroutines to happen at the
same time. Since multiple goroutines are experiencing runaway stacks at the
same time, service gets killed for OOM before any stack reaches the Go
1000000000-byte limit, so we never see a stack overflow error.

This explains relatively small increase in heap and much larger increase in
RSS. I bet if I actually logged StackInUse and StackSys from
runtime.Memstats, they would have shown an increase much closer to that of
RSS.

Bug filed: https://github.com/golang/go/issues/32925. Second stdlib bug
building my little CRUD http server...

Thanks all for helping with the debugging!

On Wed, Jul 3, 2019 at 11:34 AM Tom Mitchell <mi...@niftyegg.com> wrote:

>
> On Mon, Jul 1, 2019 at 12:42 PM 'Yunchi Luo' via golang-nuts <
> golang-nuts@googlegroups.com> wrote:
>
>> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>>
>> I'm trying to debug OOM on a service we are running in k8s. The service
>> is just a CRUD server hitting a database (DynamoDB). Each replica serves
>> about 300 qps of traffic. There are no memory leaks. On occasion (seemingly
>> correlated to small latency spikes on the backend), the service would OOM.
>> This is surprising because it has a circuit breaker that drops requests
>> after 200 concurrent connections that has never trips, and goroutine
>> profiles confirm that there are nowhere 200 active goroutines.
>>
>
> Just curious about the network connections.
> Is there a chance that the network connections are not getting closed and
> cleaned up for some reason.
> It was common for sockets to hang around in the thousands because user was
> killing a slow tab or the browser
> and the full socket close never completed.    The solution was  to allow
> reliable connections to time out and finish closing
> freeing up the memory.   The application has closed the socket but the
> protocol has yet to get the last packet to complete the
> handshake.  The shell equivalent would be zombe processes that still need
> to return the exit status but no process waits
> on the status.   Debugging can be interesting in the shell case because of
> implied waits done by ps.
>
> How many connections does the the system kernel think there are and what
> state are they are in.
> Look both locally and on the DB machine.
> The latency spikes can be a cause or a symptom.
> Look at the connections being made to the CRUD server and make sure they
> are being setup with short enough timers
> that they clean themselves up quickly enough.   Is the CRUD server at risk
> of a denial of service or random curious probe burst
> from a nmap script.  Even firewall drops near or far can leave connections
> hanging in an incomplete state when an invalid connection
> is detected and blocked and long timer reliable network connections are
> involved.
>
>
>
>
>
>
>

-- 
Yunchi

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CANnT9sj69JLgVKU8YMAYrqKjnPUOfOpe9UzAcdttcLcDmtzVQg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to