Hi Milan,

On Mon, Jul 23, 2018 at 08:41:03AM +0200, Milan Petruzelka wrote:
> After weekend CLOSE_WAIT connections are still there.

Ah bad :-(

> What
> does cflg=0x80203300 in "show fd" mean?

These are the connection flags. You can figure them with contrib/debug/flags :

$ ./flags 0x80203300 | grep ^conn
conn->flags = CO_FL_XPRT_TRACKED | CO_FL_CONNECTED | CO_FL_ADDR_TO_SET | 
CO_FL_ADDR_FROM_SET | CO_FL_XPRT_READY | CO_FL_CTRL_READY

So basically it says that everything is configured on the connection and
that there is no request for polling (CO_FL_{CURR,SOCK,XPRT}_{WR,RD}_ENA).

> FDs with cflg=0x80203300 are either
> CLOSE_WAIT or "sock - protocol: TCP" - see FDs 14, 15, 16, 18, 19 and 25 in
> following dumps. And - sockets in lsof state "sock - protocol: TCP" can't
> be found in netstat.

That totally makes sense. If the connection is not monitored at all (but
why, this is the question), and has no timeout, definitely it will wander
forever.

Do you *think* that you got less CLOSE_WAITs or that the latest fixes
didn't change anything ? I suspect that for some reason you might be
hit by several bugs, which is what has complicated the diagnostic, but
that's just pure guess.

> 
> SHOW FD 3300
>      14 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x23d0340
> iocb=0x4d4c90(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
> fe=fe-http mux=H2 mux_ctx=0x2494cc0
(...)

If you run this with the latest 1.8 (not just the two patches above), some
extra debug information is provided in show fd :

$ echo show fd | socat - /tmp/sock1 | grep -i mux=H2
     19 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 
owner=0x7ffff7fd8e80 iocb=0x58cb44(conn_fd_handler) tmask=0x1 umask=0x0 
cflg=0x80201306 fe=httpgw mux=H2 mux_ctx=0x7ffff7f8ff10 st0=2 flg=0x00000000 
nbst=1 nbcs=1 fctl_cnt=0 send_cnt=0 tree_cnt=1 orph_cnt=0 dbuf=0/0 mbuf=0/0

In particular, st0, the mux flags, the number of streams and the buffer
states will give important information about what state the connection
is in (and whether it still has streams attached or not).

Oh I'm just seeing you already did that in the next e-mail. Thank you :-)

So we have this :

     25 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x24f0a70     
iocb=0x4d34c0(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300              
fe=fe-http mux=H2 mux_ctx=0x258a880 st0=7 flg=0x00001000 nbst=8 nbcs=0          
fctl_cnt=0 send_cnt=8 tree_cnt=8 orph_cnt=8 dbuf=0/0 mbuf=0/16384               

  - st0=7 => H2_CS_ERROR2 : an error was sent, either it succeeded or
    could not be sent and had to be aborted nonetheless ;

  - flg=1000 => H2_CF_GOAWAY_SENT : the GOAWAY frame was sent to the mux
    buffer.

  - nbst=8 => 8 streams still attached

  - nbcs=0 => 0 conn_streams found (application layer detached or not
    attached yet)

  - send_cnt=8 => 8 streams still in the send_list, waiting for the mux
    to pick their contentx.

  - tree_cnt=8 => 8 streams known in the tree (hence they are still valid
    from the H2 protocol perspective)

  - orph_cnt=8 => 8 streams are orphaned : these streams have quit at the
    application layer (very likely a timeout).

  - mbuf=0/16384 : the mux buffer is empty but allocated. It's not very
    common.

At this point what it indicates is that :
  - 8 streams were active on this connection and a response was sent (at
    least partially) and probably waited for the mux buffer to be empty
    due to data from other previous streams. I'm realising it would be
    nice to also report the highest stream index to get an idea of the
    number of past streams on the connection.

  - an error happened (protocol error, network issue, etc, no more info
    at the moment) and caused haproxy to emit a GOAWAY frame. While doing
    so, the pending streams in the send_list were not destroyed.

  - then for an unknown reason the situation doesn't move anymore. I'm
    realising that one case I figured in the past with an error possibly
    blocking the connection at least partially covers one point here, it
    causes the mux buffer to remain allocated, so this patch would have
    caused it to be released, but it's still incomplete.

Now I have some elements to dig through, I'll try to mentally reproduce
the complex sequence of a blocked response with a GOAWAY being sent at
the same time to see what happens.

Thank you very much for all these information!
Willy

Reply via email to