Given you say you saw a lot of this and that occurred after a specific upgrade, I would guess your root issue is JavaBin.
Unfortunately, I don’t believe you can change internode communication to another format if that’s the case. But also if that is the case, it will be in the way of some work I have to do over the coming months and so I will fix it. Mark On Wed, May 26, 2021 at 7:46 PM Mark Miller <markrmil...@gmail.com> wrote: > You can run into this fairly easily even if you are doing everything right > and none of your operations are intrinsically slow or problematic. > > DeleteByQuery, regardless of its cost, tends to be a large contributor, > though not always. You can mitigate a bit with cautious, controlled use of > it. > > I’m not surprised that http2 is even more prone to also being involved, > though I didn’t think that client was yet using an http2 version, so that > is a bit surprising, but these things can bleed over quite easily. Even > more so in new Jetty versions. > > Jetty client -> server communication (and http2 in general) can be much > more picky around handling connections that are not tightly managed for > resuse under http2 (which can multiplex many requests over a single > connection). If you don’t fully read input/output streams for example, the > server doesn’t know that you don’t intend to finish dealing with your > stream. It will wait some amount of time. And then it will whack that > connection. There are all sorts of things that can manifest from this > depending on all kinds of factors, but one of them is your client server > communication can be hosed for a bit. Similarity things can happen even if > you do always keep your connection pool connections in shape if say, you > set a content length header that doesn’t match the content. You can do a > lot poorly all day and hardly notice a peep unless you turn on debug > logging for jetty or monitor tcp stats. And most of the time, things won’t > be terrible as a result of it either. But every now and then you get pretty > annoying consequences. And if you have something aggravating in the mix, > maybe more than now and then. > > As I said though, you can run into this dist stall issue from a variety of > ways, you can march down the list of what can cause it and low and behold, > there that stall is again for another reason. > > I would try and tune towards what is working for you - http1 appears > better - go with it. If you can use delete by query more sparingly, I would > -perhaps batch them when updates are not so common, that’s a big > instigator. Careful mixing in commits with it. > > You can check and make sure your server idle timeouts are higher than the > clients - that can help a bit. > > If you are using the cloud client or even if you are not, you can hash on > ids client side and send updates right to the leader they belong on to help > reduce extraneous zig zag update traffic. > > In an intensive test world, this takes a large, multi pronged path to > fully eliminate. In a production world under your control, you should be > able to find a reasonable result with a bit of what you have been exploring > and some acceptance for imperfection or use adjustment. > > Things can also change as Jetty dependencies are updated - though I will > say in my experience not often for the better. A positive if one is working > on development, perhaps less so in production. And again, even still, http > 1.1 tends to be more forgiving. And delete by query and some other issues > are still a bit more persistent, but less likely to move backwards or > unexpectedly on you as the more ethereal connection issues and spurious EOF > parsing exceptions. > > Many users are not heavily troubled by the above, so likely you will find > a workable setup. That background is to essentially say, you may find clear > sky’s tomorrow, but also, it’s not just some setup or use issue on your > end, and also, while a single fix might set you up, keep in mind a single > fix may be what your waiting for only by chance of your situation, so be > proactive. > > MRM > > On Wed, May 19, 2021 at 8:48 AM Ding, Lehong (external - Project) > <lehong.d...@sap.com.invalid> wrote: > >> *Background:* >> >> Before we moving to solr 8.8.1 from 7.7.2, we performed some performance >> test on solr 8.8.1. We met a lot of concurrent update error in solr log. >> >> *Envrironment:* >> >> solrCloud with 3 cluster nodes with 500 collections, 5 collections have >> about 10m documents. >> >> (1 shard, 3 replica) >> >> *Threads:* >> >> 30 update/add threads + 10 deleteByQuery threads >> >> *Results:* >> >> During deleteByQuery thread runing, only one node (lead node) has update >> transactions, but other two node has none . >> >> *Errrors: * >> >> java.io.IOException: Request processing has stalled for 20091ms with 100 >> remaining elements in the queue.java.io.IOException: Request processing has >> stalled for 20091ms with 100 remaining elements in the queue. at >> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:449) >> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) at >> org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:345) >> at >> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:338) >> at >> org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:244) >> at >> org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDistribAdd(DistributedZkUpdateProcessor.java:300) >> at >> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:230) >> at >> org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245) >> at >> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:106) >> at >> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110) >> at >> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:343) >> at >> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:291) >> at >> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338) >> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283) >> at >> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:244) >> >> >> >> * Temporary Solution:* >> >> adding -Dsolr.http1=1 in solr start parameters >> >> There are still some error in error log but the number is much small. >> >> >> >> *My Questions:* >> >> 1 We found solr cluster will eventually get the data consistent. What’s >> the concurrent update error mainly impacted? >> >> 2 Adding -Dsolr.http1=1 in solr start parameters can reduce the error >> number. Do we realy need add this parameter? And does this parameter will >> be kept in later version? >> >> Many Thanks. >> >> >> >> >> >> Thanks and Regards, >> >> Lehong Ding >> >> >> >> >> > -- > - Mark > > http://about.me/markrmiller > -- - Mark http://about.me/markrmiller