Hi Willy, I'm back at work after 2 weeks on the beach in Dalmatia. I've patched my Haproxy 1.8.11 with all three patches discussed here in last two weeks. It didn't help. Then tried to run Haproxy with debug enabled. The last logs from FD hanging in CLOSE_WAIT looks like this:
00020535:frntend.accept(0006)=000b from [ip:58321] ALPN=h2 00020535:frntend.clireq[000b:ffffffff]: POST /some/uri HTTP/1.1 00020535:backend.srvrep[000b:0015]: HTTP/1.1 200 OK 00020535:backend.srvcls[000b:adfd] 00020536:frntend.clicls[000b:ffffffff] 00020536:frntend.closed[000b:ffffffff] 00020537:frntend.clicls[000b:ffffffff] 00020537:frntend.closed[000b:ffffffff] 00020538:frntend.clicls[000b:ffffffff] 00020538:frntend.closed[000b:ffffffff] 00020516:backend.srvcls[000b:adfd] 0002051a:backend.srvcls[000b:adfd] 00020514:backend.srvcls[000b:adfd] 00020514:backend.clicls[000b:adfd] 00020514:backend.closed[000b:adfd] 00020516:backend.clicls[000b:adfd] 00020516:backend.closed[000b:adfd] 0002051a:backend.clicls[000b:adfd] 0002051a:backend.closed[000b:adfd] 0002051d:backend.srvcls[000b:adfd] 0002051d:backend.clicls[000b:adfd] 0002051d:backend.closed[000b:adfd] 0002051e:backend.clicls[000b:adfd] 0002051e:backend.closed[000b:adfd] 0002051f:backend.srvcls[000b:adfd] 0002051f:backend.clicls[000b:adfd] 0002051f:backend.closed[000b:adfd] 00020528:backend.srvcls[000b:adfd] 00020528:backend.clicls[000b:adfd] 00020528:backend.closed[000b:adfd] 00020529:backend.srvcls[000b:adfd] 00020529:backend.clicls[000b:adfd] 00020529:backend.closed[000b:adfd] 0002052a:backend.srvcls[000b:adfd] 0002052a:backend.clicls[000b:adfd] 0002052a:backend.closed[000b:adfd] 0002052b:backend.srvcls[000b:adfd] 0002052b:backend.clicls[000b:adfd] 0002052b:backend.closed[000b:adfd] 00020535:backend.clicls[000b:adfd] 00020535:backend.closed[000b:adfd] I decided to add some quick'n'dirty debug messages into mux_h2.c to see more details. I modified send_log function to write to console and used it for digging inside h2 mux. I'm not attaching the full patch, because it's ugly and nothing to be proud of. Just an example of patched function to get the feeling. static inline void h2s_close(struct h2s *h2s) { send_log(NULL, LOG_NOTICE, "mpeh2 fd%d h2s_close - id%08x st%02x fl%08x streams:%d\n", mpeh2_h2s_fd(h2s), mpeh2_h2s_id(h2s), mpeh2_h2s_st(h2s), mpeh2_h2s_flags(h2s), h2s->h2c->nb_streams); if (h2s->st != H2_SS_CLOSED) h2s->h2c->nb_streams--; h2s->st = H2_SS_CLOSED; } This is how successful connection looks like: 20180628.1614 mpeh2 fd17 h2c_frt_init 20180628.1614 mpeh2 fd17 h2c_stream_new - id00000001 st00 fl00000000 20180628.1614 0000004c:frntend.accept(0006)=0011 from [some_ip:5059] ALPN=h2 20180628.1614 0000004c:frntend.clireq[0011:ffffffff]: GET /some/uri HTTP/1.1 20180628.1614 0000004c:backend.srvrep[0011:0013]: HTTP/1.1 200 OK 20180628.1614 0000004c:backend.srvcls[0011:adfd] 20180628.1614 mpeh2 fd17 h2s_close - id00000001 st04 fl00003001 streams:1 20180628.1614 0000004f:frntend.clicls[0011:ffffffff] 20180628.1614 0000004f:frntend.closed[0011:ffffffff] 20180628.1614 mpeh2 fd17 h2s_destroy - id00000001 st07 fl00003003 20180628.1614 mpeh2 fd17 h2s_close - id00000001 st07 fl00003003 streams:0 20180628.1614 mpeh2 fd17 h2c_stream_new - id00000003 st00 fl00000000 20180628.1614 00000056:frntend.accept(0006)=0011 from [some_ip:5059] ALPN=h2 20180628.1614 00000056:frntend.clireq[0011:ffffffff]: GET /some/uri HTTP/1.1 20180628.1614 00000056:backend.srvrep[0011:0013]: HTTP/1.1 200 OK 20180628.1614 00000056:backend.srvcls[0011:adfd] 20180628.1614 mpeh2 fd17 h2s_close - id00000003 st04 fl00003001 streams:1 20180628.1614 00000057:frntend.clicls[0011:ffffffff] 20180628.1614 00000057:frntend.closed[0011:ffffffff] 20180628.1614 mpeh2 fd17 h2s_destroy - id00000003 st07 fl00003003 20180628.1614 mpeh2 fd17 h2s_close - id00000003 st07 fl00003003 streams:0 20180628.1614 mpeh2 fd17 h2c_stream_new - id00000005 st00 fl00000000 20180628.1614 00000059:frntend.accept(0006)=0011 from [some_ip:5059] ALPN=h2 20180628.1614 00000059:frntend.clireq[0011:ffffffff]: POST /some/uri HTTP/1.1 20180628.1614 00000059:backend.srvrep[0011:0013]: HTTP/1.1 200 OK 20180628.1614 00000059:backend.srvcls[0011:adfd] 20180628.1614 mpeh2 fd17 h2s_close - id00000005 st04 fl00003101 streams:1 20180628.1614 0000005b:frntend.clicls[0011:ffffffff] 20180628.1614 0000005b:frntend.closed[0011:ffffffff] 20180628.1614 mpeh2 fd17 h2s_destroy - id00000005 st07 fl00003103 20180628.1614 mpeh2 fd17 h2s_close - id00000005 st07 fl00003103 streams:0 20180628.1614 mpeh2 fd17 h2c_stream_new - id00000007 st00 fl00000000 20180628.1614 0000005d:frntend.accept(0006)=0011 from [some_ip:5059] ALPN=h2 20180628.1614 0000005d:frntend.clireq[0011:ffffffff]: GET /some/uri HTTP/1.1 20180628.1614 0000005d:backend.srvrep[0011:0013]: HTTP/1.1 200 OK 20180628.1614 0000005d:backend.srvcls[0011:adfd] 20180628.1614 mpeh2 fd17 h2s_close - id00000007 st04 fl00003001 streams:1 20180628.1614 0000005e:frntend.clicls[0011:ffffffff] 20180628.1614 0000005e:frntend.closed[0011:ffffffff] 20180628.1614 mpeh2 fd17 h2s_destroy - id00000007 st07 fl00003003 20180628.1614 mpeh2 fd17 h2s_close - id00000007 st07 fl00003003 streams:0 20180628.1615 mpeh2 fd17 h2_timeout_task/expired - st02 fl00000000 20180628.1615 mpeh2 fd17 h2c_error - st02 fl00000000 err00 20180628.1615 mpeh2 fd17 h2_release - st06 fl00000100 And this is last h2 connection in CLOSE_WAIT hanging file descriptor: 20180628.2201 mpeh2 fd17 h2c_frt_init 20180628.2201 mpeh2 fd17 h2c_stream_new - id00000001 st00 fl00000000 20180628.2201 0002b064:frntend.accept(0006)=0011 from [ip:57964] ALPN=h2 20180628.2201 0002b064:frntend.clireq[0011:ffffffff]: GET /some/uri HTTP/1.1 20180628.2201 0002b064:backend.srvrep[0011:001d]: HTTP/1.1 200 OK 20180628.2201 0002b064:backend.srvcls[0011:adfd] 20180628.2201 mpeh2 fd17 h2s_close - id00000001 st04 fl00003001 streams:1 20180628.2201 0002b065:frntend.clicls[0011:ffffffff] 20180628.2201 0002b065:frntend.closed[0011:ffffffff] 20180628.2201 mpeh2 fd17 h2s_destroy - id00000001 st07 fl00003003 20180628.2201 mpeh2 fd17 h2s_close - id00000001 st07 fl00003003 streams:0 20180628.2201 mpeh2 fd17 h2c_stream_new - id00000003 st00 fl00000000 20180628.2201 0002b066:frntend.accept(0006)=0011 from [ip:57964] ALPN=h2 20180628.2201 mpeh2 fd17 h2c_stream_new - id00000005 st00 fl00000000 20180628.2201 0002b067:frntend.accept(0006)=0011 from [ip:57964] ALPN=h2 20180628.2201 mpeh2 fd17 h2c_stream_new - id00000007 st00 fl00000000 20180628.2201 0002b068:frntend.accept(0006)=0011 from [ip:57964] ALPN=h2 20180628.2201 mpeh2 fd17 h2c_stream_new - id00000009 st00 fl00000000 20180628.2201 0002b069:frntend.accept(0006)=0011 from [ip:57964] ALPN=h2 20180628.2201 0002b066:frntend.clireq[0011:ffffffff]: GET /some/uri HTTP/1.1 20180628.2201 0002b067:frntend.clireq[0011:ffffffff]: GET /some/uri HTTP/1.1 20180628.2201 0002b068:frntend.clireq[0011:ffffffff]: GET /some/uri HTTP/1.1 20180628.2201 0002b069:frntend.clireq[0011:ffffffff]: GET /some/uri HTTP/1.1 20180628.2201 mpeh2 fd17 h2c_stream_new - id0000000b st00 fl00000000 20180628.2201 0002b06a:frntend.accept(0006)=0011 from [ip:57964] ALPN=h2 20180628.2201 mpeh2 fd17 h2c_stream_new - id0000000d st00 fl00000000 20180628.2201 0002b06b:frntend.accept(0006)=0011 from [ip:57964] ALPN=h2 20180628.2201 0002b06a:frntend.clireq[0011:ffffffff]: GET /some/uri HTTP/1.1 20180628.2201 0002b06b:frntend.clireq[0011:ffffffff]: GET /some/uri HTTP/1.1 20180628.2201 0002b066:backend.srvrep[0011:001d]: HTTP/1.1 200 OK 20180628.2201 0002b069:backend.srvrep[0011:0020]: HTTP/1.1 200 OK 20180628.2201 0002b068:backend.srvrep[0011:001f]: HTTP/1.1 200 OK 20180628.2201 0002b06b:backend.srvrep[0011:0022]: HTTP/1.1 200 OK 20180628.2201 0002b067:backend.srvrep[0011:001e]: HTTP/1.1 200 OK 20180628.2201 0002b06a:backend.srvrep[0011:0021]: HTTP/1.1 200 OK 20180628.2201 mpeh2 fd17 h2c_stream_new - id0000000f st00 fl00000000 20180628.2201 0002b06c:frntend.accept(0006)=0011 from [ip:57964] ALPN=h2 20180628.2201 0002b06c:frntend.clireq[0011:ffffffff]: GET /some/uri HTTP/1.1 20180628.2201 0002b06c:backend.srvrep[0011:0023]: HTTP/1.1 200 OK 20180628.2201 0002b06c:backend.srvcls[0011:adfd] 20180628.2201 mpeh2 fd17 h2c_stream_new - id00000011 st00 fl00000000 20180628.2201 0002b06d:frntend.accept(0006)=0011 from [ip:57964] ALPN=h2 20180628.2201 0002b06d:frntend.clireq[0011:ffffffff]: POST /some/uri HTTP/1.1 20180628.2201 0002b06d:backend.srvrep[0011:0023]: HTTP/1.1 200 OK 20180628.2201 0002b06d:backend.srvcls[0011:adfd] 20180628.2201 mpeh2 fd17 h2s_close - id0000000f st04 fl00003001 streams:8 20180628.2201 mpeh2 fd17 h2s_close - id00000011 st04 fl00003101 streams:7 20180628.2201 mpeh2 fd17 h2c_error - st04 fl00000000 err05 20180628.2201 0002b06e:frntend.clicls[0011:ffffffff] 20180628.2201 0002b06e:frntend.closed[0011:ffffffff] 20180628.2201 mpeh2 fd17 h2s_destroy - id0000000f st07 fl00003003 20180628.2201 mpeh2 fd17 h2s_close - id0000000f st07 fl00003003 streams:6 20180628.2201 0002b06f:frntend.clicls[0011:ffffffff] 20180628.2201 0002b06f:frntend.closed[0011:ffffffff] 20180628.2201 mpeh2 fd17 h2s_destroy - id00000011 st07 fl00003103 20180628.2201 mpeh2 fd17 h2s_close - id00000011 st07 fl00003103 streams:6 20180628.2201 0002b068:backend.srvcls[0011:adfd] 20180628.2201 0002b068:backend.clicls[0011:adfd] 20180628.2201 0002b068:backend.closed[0011:adfd] 20180628.2201 0002b066:backend.srvcls[0011:adfd] 20180628.2201 0002b066:backend.clicls[0011:adfd] 20180628.2201 0002b066:backend.closed[0011:adfd] 20180628.2201 mpeh2 fd17 h2s_destroy - id00000003 st06 fl00003081 20180628.2201 mpeh2 fd17 h2s_close - id00000003 st06 fl00003081 streams:6 20180628.2201 0002b067:backend.srvcls[0011:adfd] 20180628.2201 0002b067:backend.clicls[0011:adfd] 20180628.2201 0002b067:backend.closed[0011:adfd] 20180628.2201 0002b069:backend.srvcls[0011:adfd] 20180628.2201 0002b069:backend.clicls[0011:adfd] 20180628.2201 0002b069:backend.closed[0011:adfd] 20180628.2201 mpeh2 fd17 h2s_destroy - id00000009 st06 fl00003081 20180628.2201 mpeh2 fd17 h2s_close - id00000009 st06 fl00003081 streams:5 20180628.2201 0002b06a:backend.srvcls[0011:adfd] 20180628.2201 0002b06a:backend.clicls[0011:adfd] 20180628.2201 0002b06a:backend.closed[0011:adfd] 20180628.2201 0002b06b:backend.srvcls[0011:adfd] 20180628.2201 0002b06b:backend.clicls[0011:adfd] 20180628.2201 0002b06b:backend.closed[0011:adfd] H2 conection first handles some requests in h2 streams. After destroying some it hits h2c_error with H2_ERR_STREAM_CLOSED. After that it destroys a few more, but stops with non-zero h2c.nb_streams. It doesn't close the rest of h2 streams and doesn't reach the h2_release. I've added more debug into h2s_close to see not only h2s state and flags but also h2c state and flags. My only way to reproduce the bug is to let Haproxy run until some of its FD falls into CLOSE_WAIT. After I catch some, I'll report back. Please feel free to suggest where else in mux_h2.c should I add debugs to see why h2_release is not reached. Best Regards, Milan On Fri, 15 Jun 2018 at 11:21, Willy Tarreau <w...@1wt.eu> wrote: > On Fri, Jun 15, 2018 at 11:19:04AM +0200, Janusz Dziemidowicz wrote: > > 2018-06-14 19:49 GMT+02:00 Willy Tarreau <w...@1wt.eu>: > > > On Thu, Jun 14, 2018 at 07:22:34PM +0200, Janusz Dziemidowicz wrote: > > >> 2018-06-14 18:56 GMT+02:00 Willy Tarreau <w...@1wt.eu>: > > >> > > >> > If you'd like to run a test, I'm attaching the patch. > > >> > > >> Sure, but you forgot to attach it :) > > > > > > Ah, that's because I'm stupid :-) > > > > > > Here it comes this time. >