Hi Milan,

On Tue, Jul 24, 2018 at 12:23:37PM +0200, Milan Petruzelka wrote:
> Hi Willy,
> 
> Do you *think* that you got less CLOSE_WAITs or that the latest fixes
> > didn't change anything ? I suspect that for some reason you might be
> > hit by several bugs, which is what has complicated the diagnostic, but
> > that's just pure guess.
> >
> >
> I'm not sure. I left patched haproxy running over last weekend. I have a
> slight feeling that there was less hanging connections than over last week,
> but it could be because of lower weekend traffic.

OK, it's not that obvious at least.

> Now i'm running latest
> GIT version (1.8.12-5e100b-15) and i'll compare the speed of blocked
> connections increase against vannila 1.8.12 from last week.
> 
> I'll send more extended "show fd" dumps as soon as i catch some more.
> Please let me know If you add more h2 state info into 1.8 git version and
> i'll run it in production to get more info.

So I'm having one update to emit the missing info on "show fd" (patch merged
and pushed already, that I'm attaching here if it's easier for you), and
two other ones which fix a situation which can definitely cause this to
happen, but which I still fail to reproduce using the various tools I
have (curl, h2c, nghttp, ...). The case is the following and derives from
your previous capture :

  - multiple streams fight for the mux and are queued in the send_list
  - at this point the mux has to emit a GOAWAY for a reason that's still
    to be figured (probably it received a bad message but we could guess
    anything)
  - the streams are woken up, notified about the error
  - h2_detach() is called for each of them
  - they are detached from the h2 stream (struct h2s technically
    speaking, which is the internal representation of the h2 stream
    state)
  - since the streams are marked as blocked for some room, they are orphaned
    and nothing more is done on them.
  - at this point, any activity on the connection goes through h2_wake()
    which sees the conneciton in ERROR2 state, tries again to release the
    streams, cannot, and stops polling.

=> from this point, no more events can be received on the connection, and
   the streams remain orphaned forever. This case was partially addressed
   by the patch I sent a while ago but I wasn't completely sure about the
   sequence that could lead to this. I'm attaching this patch (0001-WIP-*).

However this patch alone (0001-WIP-*) cannot prevent the streams from being
orphaned (6th step above) so while it's needed, it's not enough. The third
one is required for this (h2-error.diff). 

And I *think* (and hope) that with these 2 patches on top of latest 1.8
we're OK now. What I would appreciate quite a lot if you're willing to
let me abuse your time is to either git pull or apply
0001-MINOR-h2-add-the-error-code-and-the-max-last-stream-.patch on top
of your up-to-date branch, then apply
0001-WIP-h2-try-to-address-possible-causes-for-the-close_.patch then
apply h2-error.diff and test again. The last two will apply with an
offset but that's not a problem.

If you still see some close_waits (I really hope not), it means they're
caused by something totally different and then "show fd" will help better
than previously thanks to the extra info in the first patch.

I think we're about to nail it down, to be honest ;-)

Thanks!
Willy
>From 12a4b5c69d3eb0cc01bf92df215c7cc8d70ee8bc Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Tue, 24 Jul 2018 14:12:42 +0200
Subject: MINOR: h2: add the error code and the max/last stream IDs to "show
 fd"

This is intented to help debugging H2 in field.

(cherry picked from commit 616ac81dec5759990ee600047d8ad900f6eba6e8)
[wt: adapted context a little bit]
Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 src/mux_h2.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/mux_h2.c b/src/mux_h2.c
index 7f14de4..95effb9 100644
--- a/src/mux_h2.c
+++ b/src/mux_h2.c
@@ -3515,8 +3515,11 @@ static void h2_show_fd(struct chunk *msg, struct 
connection *conn)
                node = eb32_next(node);
        }
 
-       chunk_appendf(msg, " st0=%d flg=0x%08x nbst=%u nbcs=%u fctl_cnt=%d 
send_cnt=%d tree_cnt=%d orph_cnt=%d dbuf=%u/%u mbuf=%u/%u",
-                     h2c->st0, h2c->flags, h2c->nb_streams, h2c->nb_cs, 
fctl_cnt, send_cnt, tree_cnt, orph_cnt, h2c->dbuf->i, h2c->dbuf->size, 
h2c->mbuf->o, h2c->mbuf->size);
+       chunk_appendf(msg, " st0=%d err=%d maxid=%d lastid=%d flg=0x%08x 
nbst=%u nbcs=%u"
+                     " fctl_cnt=%d send_cnt=%d tree_cnt=%d orph_cnt=%d 
dbuf=%u/%u mbuf=%u/%u",
+                     h2c->st0, h2c->errcode, h2c->max_id, h2c->last_sid, 
h2c->flags,
+                     h2c->nb_streams, h2c->nb_cs, fctl_cnt, send_cnt, 
tree_cnt, orph_cnt,
+                     h2c->dbuf->i, h2c->dbuf->size, h2c->mbuf->o, 
h2c->mbuf->size);
 }
 
 /*******************************************************/
-- 
1.7.12.1

>From f055296f7598ba84b08e78f8309ffd7fa0c9522b Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Wed, 13 Jun 2018 09:19:29 +0200
Subject: WIP: h2: try to address possible causes for the close_wait issues

It is uncertain whether certain errors could prevent pending outgoing
data from being emitted, and from releasing attached streams. Indeed,
for h2_release() to be called, the mux buf must be empty or an error
must have been met. A clean shutdown will not constitute an error and
it's likely that refraining from sending may prevent the buffer from
flushing. Thus maybe we can end up with data forever in the buffer.

The timeout task should normally take care of this though. It's worth
noting that if there's no more stream and the output buffer is empty
on wake(), the task's timeout is eternity.

This fix should be backported to 1.8.
---
 src/mux_h2.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/src/mux_h2.c b/src/mux_h2.c
index bc33ce2..2f2384d 100644
--- a/src/mux_h2.c
+++ b/src/mux_h2.c
@@ -2271,14 +2271,6 @@ static int h2_wake(struct connection *conn)
                        h2_release(conn);
                        return -1;
                }
-               else {
-                       /* some streams still there, we need to signal them all 
and
-                        * wait for their departure.
-                        */
-                       __conn_xprt_stop_recv(conn);
-                       __conn_xprt_stop_send(conn);
-                       return 0;
-               }
        }
 
        if (!h2c->dbuf->i)
@@ -2294,6 +2286,7 @@ static int h2_wake(struct connection *conn)
 
        /* adjust output polling */
        if (!(conn->flags & CO_FL_SOCK_WR_SH) &&
+           h2c->st0 != H2_CS_ERROR2 && !(h2c->flags & H2_CF_GOAWAY_FAILED) &&
            (h2c->st0 == H2_CS_ERROR ||
             h2c->mbuf->o ||
             (h2c->mws > 0 && !LIST_ISEMPTY(&h2c->fctl_list)) ||
-- 
1.7.12.1

diff --git a/src/mux_h2.c b/src/mux_h2.c
index 7f14de4..a4db89d 100644
--- a/src/mux_h2.c
+++ b/src/mux_h2.c
@@ -2512,6 +2512,7 @@ static void h2_detach(struct conn_stream *cs)
         * an ES or RST frame), so orphan it in this case.
         */
        if (!(cs->conn->flags & CO_FL_ERROR) &&
+           (h2c->st0 < H2_CS_ERROR) &&
            (h2s->flags & (H2_SF_BLK_MBUSY | H2_SF_BLK_MROOM | 
H2_SF_BLK_MFCTL)))
                return;
 

Reply via email to