Given you say you saw a lot of this and that occurred after a specific
upgrade, I would guess your root issue is JavaBin.

Unfortunately, I don’t believe you can change internode communication to
another format if that’s the case.

But also if that is the case, it will be in the way of some work I have to
do over the coming months and so I will fix it.

Mark

On Wed, May 26, 2021 at 7:46 PM Mark Miller <markrmil...@gmail.com> wrote:

> You can run into this fairly easily even if you are doing everything right
> and none of your operations are intrinsically slow or problematic.
>
> DeleteByQuery, regardless of its cost, tends to be a large contributor,
> though not always. You can mitigate a bit with cautious, controlled use of
> it.
>
> I’m not surprised that http2 is even more prone to also being involved,
> though I didn’t think that client was yet using an http2 version, so that
> is a bit surprising, but these things can bleed over quite easily. Even
> more so in new Jetty versions.
>
> Jetty client -> server communication (and http2 in general) can be much
> more picky around handling connections that are not tightly managed for
> resuse under http2 (which can multiplex many requests over a single
> connection). If you don’t fully read input/output streams for example, the
> server doesn’t know that you don’t intend to finish dealing with your
> stream. It will wait some amount of time. And then it will whack that
> connection. There are all sorts of things that can manifest from this
> depending on all kinds of factors, but one of them is your client server
> communication can be hosed for a bit. Similarity things can happen even if
> you do always keep your connection pool connections in shape if say, you
> set a content length header that doesn’t match the content. You can do a
> lot poorly all day and hardly notice a peep unless you turn on debug
> logging for jetty or monitor tcp stats. And most of the time, things won’t
> be terrible as a result of it either. But every now and then you get pretty
> annoying consequences. And if you have something aggravating in the mix,
> maybe more than now and then.
>
> As I said though, you can run into this dist stall issue from a variety of
> ways, you can march down the list of what can cause it and low and behold,
> there that stall is again for another reason.
>
> I would try and tune towards what is working for you - http1 appears
> better - go with it. If you can use delete by query more sparingly, I would
> -perhaps batch them when updates are not so common, that’s a big
> instigator. Careful mixing in commits with it.
>
> You can check and make sure your server idle timeouts are higher than the
> clients - that can help a bit.
>
> If you are using the cloud client or even if you are not, you can hash on
> ids client side and send updates right to the leader they belong on to help
> reduce extraneous zig zag update traffic.
>
> In an intensive test world, this takes a large, multi pronged path to
> fully eliminate. In a production world under your control, you should be
> able to find a reasonable result with a bit of what you have been exploring
> and some acceptance for imperfection or use adjustment.
>
> Things can also change as Jetty dependencies are updated - though I will
> say in my experience not often for the better. A positive if one is working
> on development, perhaps less so in production. And again, even still, http
> 1.1 tends to be more forgiving. And delete by query and some other issues
> are still a bit more persistent, but less likely to move backwards or
> unexpectedly on you as the more ethereal connection issues and spurious EOF
> parsing exceptions.
>
> Many users are not heavily troubled by the above, so likely you will find
> a workable setup. That background is to essentially say, you may find clear
> sky’s tomorrow, but also, it’s not just some setup or use issue on your
> end, and also, while a single fix might set you up, keep in mind a single
> fix may be what your waiting for only by chance of your situation, so be
> proactive.
>
> MRM
>
> On Wed, May 19, 2021 at 8:48 AM Ding, Lehong (external - Project)
> <lehong.d...@sap.com.invalid> wrote:
>
>> *Background:*
>>
>> Before we moving to solr 8.8.1 from 7.7.2, we performed some performance
>> test on solr 8.8.1. We met a lot of concurrent update error in solr log.
>>
>> *Envrironment:*
>>
>> solrCloud with 3 cluster nodes with 500 collections, 5 collections have
>> about 10m documents.
>>
>> (1 shard, 3 replica)
>>
>> *Threads:*
>>
>> 30 update/add threads + 10 deleteByQuery threads
>>
>> *Results:*
>>
>> During deleteByQuery thread runing, only one node (lead node) has update
>> transactions, but other two node has none .
>>
>> *Errrors:  *
>>
>> java.io.IOException: Request processing has stalled for 20091ms with 100
>> remaining elements in the queue.java.io.IOException: Request processing has
>> stalled for 20091ms with 100 remaining elements in the queue. at
>> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:449)
>> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) at
>> org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:345)
>> at
>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:338)
>> at
>> org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:244)
>> at
>> org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDistribAdd(DistributedZkUpdateProcessor.java:300)
>> at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:230)
>> at
>> org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245)
>> at
>> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:106)
>> at
>> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110)
>> at
>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:343)
>> at
>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:291)
>> at
>> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338)
>> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283)
>> at
>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:244)
>>
>>
>>
>> * Temporary Solution:*
>>
>> adding -Dsolr.http1=1 in solr start parameters
>>
>> There are still some error in error log but the number is much small.
>>
>>
>>
>> *My Questions:*
>>
>> 1 We found solr cluster will eventually get the data consistent. What’s
>> the concurrent update error mainly impacted?
>>
>> 2 Adding  -Dsolr.http1=1 in solr start parameters can reduce the error
>> number. Do we realy need add this parameter? And does this parameter will
>> be kept in later version?
>>
>> Many Thanks.
>>
>>
>>
>>
>>
>> Thanks and Regards,
>>
>> Lehong Ding
>>
>>
>>
>>
>>
> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Reply via email to