Hi Luke, On Wed, Jan 23, 2019 at 10:47:33AM +0000, Luke Seelenbinder wrote: > We were using http-reuse always and experiencing this > issue (as well as getting 80+% connection reuse). When I scaled it back to > http-reuse safe, the frequency of this issue seemed to be much lower. > (Perhaps because the bulk of my testing was with one client and somewhat > unscientific?)
It could be caused by various things. In my tests the client doesn't even use keep-alive so haproxy is less aggressive with connection reuse and that could explain some differences. > > Thus it > > definitely is a matter of bad interaction between two streams, or one > > stream affecting the connection and hurting the other stream. > > My debugging spidery-sense points to the same thing. So I have more info now. There are multiple issues which stack up and cause this : - the GOAWAY frame indicating the last stream id might be in flight while many more streams have been added. This results in batch deaths once the limit is met ; - the last stream ID received in the GOAWAY frame was not considered when calculating the number of available streams, leading to more than acceptable by the server to be created ; - there is an issue with how new streams are attached to idle connections making them non-retryable in case of a failure sur as above. I managed to fix this but it still requires some testing with other configs ; - another issue affects idle connections, some of them could remain in the idle list while they don't have room anymore because they are removed only when they deliver the last stream, thus the check doesn't support jumps in the number of available streams ; I suspect it could be related to the client aborts that cause server aborts, just because it allowed some excess streams to be sent to a mux which doesn't have room anymore, but I could be wrong ; And a less important one : the maximum number of concurrent streams per connection is global. In this case it's 100 so it's lower than nginx's 128 thus it doesn't cause any issue. But we could run into problems with this and I must address this to make it per-connection. With all these changes, I managed to run a long test with no more errors and only an immediate retry once in a while if nginx announced the GOAWAY too late. When we set the limit ourselves, there's not even any retry anymore. Thus I'll continue to work on this and we'll slightly delay 1.9.3 to collect these fixes. From there we'll be able to see if you still have problems and iterate. > Let me know if you want me to share our config (it's quite complex) with you > privately or if there's anything else we can do to assist. That's kind but now I don't need it anymore, I have everything needed to reproduce the whole issue it seems. Thanks, Willy