Hi Luke,

On Wed, Jan 23, 2019 at 10:47:33AM +0000, Luke Seelenbinder wrote:
> We were using http-reuse always and experiencing this
> issue (as well as getting 80+% connection reuse). When I scaled it back to
> http-reuse safe, the frequency of this issue seemed to be much lower.
> (Perhaps because the bulk of my testing was with one client and somewhat
> unscientific?)

It could be caused by various things. In my tests the client doesn't even
use keep-alive so haproxy is less aggressive with connection reuse and
that could explain some differences.

> > Thus it
> > definitely is a matter of bad interaction between two streams, or one
> > stream affecting the connection and hurting the other stream.
> 
> My debugging spidery-sense points to the same thing.

So I have more info now. There are multiple issues which stack up and
cause this :
  - the GOAWAY frame indicating the last stream id might be in flight
    while many more streams have been added. This results in batch
    deaths once the limit is met ;

  - the last stream ID received in the GOAWAY frame was not considered
    when calculating the number of available streams, leading to more
    than acceptable by the server to be created ;

  - there is an issue with how new streams are attached to idle connections
    making them non-retryable in case of a failure sur as above. I managed
    to fix this but it still requires some testing with other configs ;

  - another issue affects idle connections, some of them could remain
    in the idle list while they don't have room anymore because they
    are removed only when they deliver the last stream, thus the check
    doesn't support jumps in the number of available streams ; I suspect
    it could be related to the client aborts that cause server aborts,
    just because it allowed some excess streams to be sent to a mux which
    doesn't have room anymore, but I could be wrong ;

And a less important one : the maximum number of concurrent streams per
connection is global. In this case it's 100 so it's lower than nginx's
128 thus it doesn't cause any issue. But we could run into problems with
this and I must address this to make it per-connection.

With all these changes, I managed to run a long test with no more errors
and only an immediate retry once in a while if nginx announced the GOAWAY
too late. When we set the limit ourselves, there's not even any retry
anymore. Thus I'll continue to work on this and we'll slightly delay 1.9.3
to collect these fixes. From there we'll be able to see if you still have
problems and iterate.

> Let me know if you want me to share our config (it's quite complex) with you
> privately or if there's anything else we can do to assist.

That's kind but now I don't need it anymore, I have everything needed to
reproduce the whole issue it seems.

Thanks,
Willy

Reply via email to