Hi All,

I've raised this before during a meetup and on the dev slack
but I'd like to raise it again after a more thorough review
on my part. HTTP/2 seems to struggle with streaming large
responses relative to HTTP/1. I was hoping the problem would
"go away" with the latest versions but I can reproduce the same
slowness and occasional stalling that we saw with Solr 9.X/Jetty 9.x,
running very recent Solr main and Jetty 12.

After observing the same issues, I decided to do a deeper dive on
HTTP/2 and Jetty's HTTP/2 API. I found a variety of levers to tune
flow-control (one of the major architectural shifts of HTTP/2 over 1)
but TLDR; none of them really worked in improving the performance
reliably. You can read a more detailed version of the analysis here
https://issues.apache.org/jira/browse/SOLR-18087

Some of the tests I ran can be hopefully reproduced running the
benchmarks I added here 
https://github.com/apache/solr/pull/4079

The linked jira ticket has, among others, an attachment detailing
the stream benchmark results as well as the exact jmh command
that was run to achieve each result listed.

A possible next step would be to reproduce this with a minimal Jetty
example, without Solr in the mix. At a high level, we are streaming
several large files concurrently over the same HTTP/2 connection,
using Jetty’s InputStreamResponseListener to expose each response as
an InputStream. If we can demonstrate the degradation in a small
standalone test, we could share it with the Jetty project to see
if there are optimizations we are missing, or additional flow-control
control knobs that should be exposed.

My current understanding is that HTTP/2 is a bigger win for smaller
request/response traffic on connections that are often idle, where
header compression and multiplexing help and flow control is less
likely to be the bottleneck. For concurrent bulk streaming, HTTP
layer flow control seems to hurt performance vs standard TCP which
is famously performant for this kind of workflow.

One thing that has puzzled me is how no one else seems to be
complaining about this :-). It's possible that our set-up is
unique, i.e. the problem is exacerbated by multiple shards
co-located on a single, addressable node. We may also not be
fully utilizing our network bandwidth with a single TCP connection
and thus piling on to the flow control overhead (but my testing
suggests flow control is significantly contributing to this).

I'd appreciate any thoughts the community may have about this
issue. I'd also love to hear about your Solr topology (if you
are able to share), i.e. how many shards do you have on a single
process and whether these shards share a single address from the 
perspective of other nodes.

Thanks,
Luke

Reply via email to