On Wed, Jun 13, 2018 at 08:01:30PM +0200, Janusz Dziemidowicz wrote:
> 2018-06-13 19:14 GMT+02:00 Willy Tarreau <w...@1wt.eu>:
> > On Wed, Jun 13, 2018 at 07:06:58PM +0200, Janusz Dziemidowicz wrote:
> >> 2018-06-13 14:42 GMT+02:00 Willy Tarreau <w...@1wt.eu>:
> >> > Hi Milan, hi Janusz,
> >> >
> >> > thanks to your respective traces, I may have come up with a possible
> >> > scenario explaining the CLOSE_WAIT you're facing. Could you please
> >> > try the attached patch ?
> >>
> >> Unfortunately there is no change for me. CLOSE_WAIT sockets still
> >> accumulate if I switch native h2 on. Milan should probably double
> >> check this though.
> >> https://pasteboard.co/HpJj72H.png
> >
> > :-(
> >
> > With still the same perfectly straight line really making me think of either
> > a periodic activity which I'm unable to guess nor model, or something 
> > related
> > to our timeouts.
> 
> It is not exactly straight. While it looks like this for short test,
> when I did this earlier, for much longer period of time, it was
> slowing down during night, when I have less traffic.

OK but there's definitely something very stable in this.

> >> I'll try move some low traffic site to a separate instance tomorrow,
> >> maybe I'll be able to capture some traffic too.
> >
> > Unfortunately with H2 that will not help much, there's the TLS layer
> > under it that makes it a real pain. TLS is designed to avoid observability
> > and it does it well :-/
> >
> > I've suspected a received shutdown at the TLS layer, which I was not
> > able to model at all. Tools are missing at this point. I even tried
> > to pass the traffic through haproxy in TCP mode to help but I couldn't
> > reproduce the problem.
> 
> When I disable native h2 in haproxy I switch back to tcp mode going
> though nghttpx. The traffic is obviously the same, yet there is no
> problem.

I'm not surprized since it doens't use our H2 engine anymore :-) In my
case the purpose was to try to abuse haproxy's TCP proxy to modify the
behaviour befor reaching its H2 engine.

> > It could possibly help if you can look for the affected client's IP:port
> > in your logs to see if they are perfectly normal or if you notice they
> > have something in common (eg: always the exact same requests, or they
> > never made a request from the affected connections, etc).
> 
> I'm aware of the problems :) However, if I can get some traffic dumps,
> knowing my application I might be able to reproduce this, which would
> be a huge win. I've already tried some experiments with various tools
> with no luck unfortunately.

Yep it's really not easy and probably once we find it I'll be ashamed
saying "I thought this code was not merged"... By the way yesterday I
found another suspect but I'm undecided on it ; the current architecture
of the H2 mux complicates the code analysis. If you want to give it a
try on top of previous one, I'd appreciate it, even if it doesn't change
anything. Please find it attached.

Thanks,
willy
>From f055296f7598ba84b08e78f8309ffd7fa0c9522b Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Wed, 13 Jun 2018 09:19:29 +0200
Subject: WIP: h2: try to address possible causes for the close_wait issues

It is uncertain whether certain errors could prevent pending outgoing
data from being emitted, and from releasing attached streams. Indeed,
for h2_release() to be called, the mux buf must be empty or an error
must have been met. A clean shutdown will not constitute an error and
it's likely that refraining from sending may prevent the buffer from
flushing. Thus maybe we can end up with data forever in the buffer.

The timeout task should normally take care of this though. It's worth
noting that if there's no more stream and the output buffer is empty
on wake(), the task's timeout is eternity.

This fix should be backported to 1.8.
---
 src/mux_h2.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/src/mux_h2.c b/src/mux_h2.c
index bc33ce2..2f2384d 100644
--- a/src/mux_h2.c
+++ b/src/mux_h2.c
@@ -2271,14 +2271,6 @@ static int h2_wake(struct connection *conn)
                        h2_release(conn);
                        return -1;
                }
-               else {
-                       /* some streams still there, we need to signal them all 
and
-                        * wait for their departure.
-                        */
-                       __conn_xprt_stop_recv(conn);
-                       __conn_xprt_stop_send(conn);
-                       return 0;
-               }
        }
 
        if (!h2c->dbuf->i)
@@ -2294,6 +2286,7 @@ static int h2_wake(struct connection *conn)
 
        /* adjust output polling */
        if (!(conn->flags & CO_FL_SOCK_WR_SH) &&
+           h2c->st0 != H2_CS_ERROR2 && !(h2c->flags & H2_CF_GOAWAY_FAILED) &&
            (h2c->st0 == H2_CS_ERROR ||
             h2c->mbuf->o ||
             (h2c->mws > 0 && !LIST_ISEMPTY(&h2c->fctl_list)) ||
-- 
1.7.12.1

Reply via email to