Anyway, returning the original sender: I had not notice dist updates *are* using the http2 concurrent update client now.
So that makes a lot of sense on why http 1.1 force improves things. It’s not nearly as sensitive to connection management. So I I’d still recommend sticking to http1, but one thing that would *perhaps* improve what you are seeing with http2 is to try configuring the Jetty server to multiplex only a single request on a connection. These issues can show worse behavior when multiple requests are multiplexed over a single connection because when a connection has to be closed due to connection management, all the requests multiplexing on that connection take a hit. Http3 is supposed to solve this limitation. Might be worth the experimentation just to see if it’s any better, but you will still likely get similar behavior. If you file a Jira issue for this though, the more you can try things and collect behavior results, the more likely someone will be able to help out with a fix. MRM On Tue, Jun 8, 2021 at 11:25 PM Mark Miller <[email protected]> wrote: > A few details on how JavaBin can be involved in this type of behavior: > > JavaBin consumes data by by wrapping InputStream’s with a FastInputStream. > FastInputStream implements an InputStream, but it disregards the > InputStream contract that says you have reached the end of the stream when > -1 is returned. A -1 is just another bit of data to be shifted into a byte > and either gum the works or corrupt the type or size read. > > Jetty does not give you a raw InputStream to the network that that will > simply suck up data like a file from disk. You will get something different > depending on the version, depending on the client, depending on http1 or 2 > in some cases, at least in terms of behavior. > > Generally you are going to get something backed by ByteBuffers pretending > to be a raw stream. > > Depending on different circumstances, Jetty will decide to stop that > stream by jumping in with a -1 when it hits circumstances that lead to > killing the connection. > > The specifics for when and why will vary between http1 and 2. Http1 will > often be using chunked encoding and have little idea about the the scope of > the full content. It knows the size of each chunk. Http2 has no such > encoding and works with sessions and frames and an entirely different > protocol and needs to stop a channel or session. Multiplexing means this > can affect things across requests in a way that is different than Http1. In > Solr, you may also have different clients depending on the protocol with > different retry implementations. Old Jetty clients offered retry, their > view for the newer client is that it’s too application specific as an > option, implement retry yourself if you want it. The Apache http1 client > will retry based on the Solr retry configuration class. > > Given that JavaBin has little respect for streams and deals in expected > sizes or it’s own end markers, there is plenty of room to not clear a > request. Given we try and not close streams due to old webapp days of > tomcat and earlier jetty versions, there is more room for poor behavior. > Not closing streams is no longer helpful at all in most cases. Generally > there is an attempt to read to end of stream, but this is a hit ir miss > strategy, especially as currently done. > > Instead of respecting stream contracts, JavaBin is given how much to read > and it diligently works to read to that point or in some cases it’s own end > marker. > > So when Jetty can’t tell a stream is actually complete, it needs to > timeout before it will forcibly kill it. Varying behavior between http 1 > and 2. For various reasons it may try to do this as for other reasons as > well. That will cause negative affects (delays in many cases) and reset or > closed exceptions. The -1 it will inject into the stream won’t generally > get a prompt response either. Depending on timing and data, JavaBin may not > even see this an early end and simply spit out something off. Retries from > the same client won’t help for some time in some cases (HTTP2 worse than 1) > either. Mileage will vary depending on HTTP2 or 1.1 for most of this. Http2 > can send things like GoAway or session lifecycle info as well that will > lead to poorly handled exceptions in a similar vein. > > There are other things that could be at play, but this is one that behaves > worse with http2 and also appears worse with Jetty 10 / 11 than 9. The > better the system acts, the more visible these problems tend to be. > > MRM > > On Fri, Jun 4, 2021 at 3:55 PM Mark Miller <[email protected]> wrote: > >> Given you say you saw a lot of this and that occurred after a specific >> upgrade, I would guess your root issue is JavaBin. >> >> Unfortunately, I don’t believe you can change internode communication to >> another format if that’s the case. >> >> But also if that is the case, it will be in the way of some work I have >> to do over the coming months and so I will fix it. >> >> Mark >> >> On Wed, May 26, 2021 at 7:46 PM Mark Miller <[email protected]> >> wrote: >> >>> You can run into this fairly easily even if you are doing everything >>> right and none of your operations are intrinsically slow or problematic. >>> >>> DeleteByQuery, regardless of its cost, tends to be a large contributor, >>> though not always. You can mitigate a bit with cautious, controlled use of >>> it. >>> >>> I’m not surprised that http2 is even more prone to also being involved, >>> though I didn’t think that client was yet using an http2 version, so that >>> is a bit surprising, but these things can bleed over quite easily. Even >>> more so in new Jetty versions. >>> >>> Jetty client -> server communication (and http2 in general) can be much >>> more picky around handling connections that are not tightly managed for >>> resuse under http2 (which can multiplex many requests over a single >>> connection). If you don’t fully read input/output streams for example, the >>> server doesn’t know that you don’t intend to finish dealing with your >>> stream. It will wait some amount of time. And then it will whack that >>> connection. There are all sorts of things that can manifest from this >>> depending on all kinds of factors, but one of them is your client server >>> communication can be hosed for a bit. Similarity things can happen even if >>> you do always keep your connection pool connections in shape if say, you >>> set a content length header that doesn’t match the content. You can do a >>> lot poorly all day and hardly notice a peep unless you turn on debug >>> logging for jetty or monitor tcp stats. And most of the time, things won’t >>> be terrible as a result of it either. But every now and then you get pretty >>> annoying consequences. And if you have something aggravating in the mix, >>> maybe more than now and then. >>> >>> As I said though, you can run into this dist stall issue from a variety >>> of ways, you can march down the list of what can cause it and low and >>> behold, there that stall is again for another reason. >>> >>> I would try and tune towards what is working for you - http1 appears >>> better - go with it. If you can use delete by query more sparingly, I would >>> -perhaps batch them when updates are not so common, that’s a big >>> instigator. Careful mixing in commits with it. >>> >>> You can check and make sure your server idle timeouts are higher than >>> the clients - that can help a bit. >>> >>> If you are using the cloud client or even if you are not, you can hash >>> on ids client side and send updates right to the leader they belong on to >>> help reduce extraneous zig zag update traffic. >>> >>> In an intensive test world, this takes a large, multi pronged path to >>> fully eliminate. In a production world under your control, you should be >>> able to find a reasonable result with a bit of what you have been exploring >>> and some acceptance for imperfection or use adjustment. >>> >>> Things can also change as Jetty dependencies are updated - though I will >>> say in my experience not often for the better. A positive if one is working >>> on development, perhaps less so in production. And again, even still, http >>> 1.1 tends to be more forgiving. And delete by query and some other issues >>> are still a bit more persistent, but less likely to move backwards or >>> unexpectedly on you as the more ethereal connection issues and spurious EOF >>> parsing exceptions. >>> >>> Many users are not heavily troubled by the above, so likely you will >>> find a workable setup. That background is to essentially say, you may find >>> clear sky’s tomorrow, but also, it’s not just some setup or use issue on >>> your end, and also, while a single fix might set you up, keep in mind a >>> single fix may be what your waiting for only by chance of your situation, >>> so be proactive. >>> >>> MRM >>> >>> On Wed, May 19, 2021 at 8:48 AM Ding, Lehong (external - Project) >>> <[email protected]> wrote: >>> >>>> *Background:* >>>> >>>> Before we moving to solr 8.8.1 from 7.7.2, we performed some >>>> performance test on solr 8.8.1. We met a lot of concurrent update error in >>>> solr log. >>>> >>>> *Envrironment:* >>>> >>>> solrCloud with 3 cluster nodes with 500 collections, 5 collections have >>>> about 10m documents. >>>> >>>> (1 shard, 3 replica) >>>> >>>> *Threads:* >>>> >>>> 30 update/add threads + 10 deleteByQuery threads >>>> >>>> *Results:* >>>> >>>> During deleteByQuery thread runing, only one node (lead node) has >>>> update transactions, but other two node has none . >>>> >>>> *Errrors: * >>>> >>>> java.io.IOException: Request processing has stalled for 20091ms with >>>> 100 remaining elements in the queue.java.io.IOException: Request processing >>>> has stalled for 20091ms with 100 remaining elements in the queue. at >>>> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:449) >>>> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) at >>>> org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:345) >>>> at >>>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:338) >>>> at >>>> org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:244) >>>> at >>>> org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDistribAdd(DistributedZkUpdateProcessor.java:300) >>>> at >>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:230) >>>> at >>>> org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245) >>>> at >>>> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:106) >>>> at >>>> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110) >>>> at >>>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:343) >>>> at >>>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:291) >>>> at >>>> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338) >>>> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283) >>>> at >>>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:244) >>>> >>>> >>>> >>>> * Temporary Solution:* >>>> >>>> adding -Dsolr.http1=1 in solr start parameters >>>> >>>> There are still some error in error log but the number is much small. >>>> >>>> >>>> >>>> *My Questions:* >>>> >>>> 1 We found solr cluster will eventually get the data consistent. What’s >>>> the concurrent update error mainly impacted? >>>> >>>> 2 Adding -Dsolr.http1=1 in solr start parameters can reduce the error >>>> number. Do we realy need add this parameter? And does this parameter will >>>> be kept in later version? >>>> >>>> Many Thanks. >>>> >>>> >>>> >>>> >>>> >>>> Thanks and Regards, >>>> >>>> Lehong Ding >>>> >>>> >>>> >>>> >>>> >>> -- >>> - Mark >>> >>> http://about.me/markrmiller >>> >> -- >> - Mark >> >> http://about.me/markrmiller >> > -- > - Mark > > http://about.me/markrmiller > -- - Mark http://about.me/markrmiller
