You can run into this fairly easily even if you are doing everything right
and none of your operations are intrinsically slow or problematic.

DeleteByQuery, regardless of its cost, tends to be a large contributor,
though not always. You can mitigate a bit with cautious, controlled use of
it.

I’m not surprised that http2 is even more prone to also being involved,
though I didn’t think that client was yet using an http2 version, so that
is a bit surprising, but these things can bleed over quite easily. Even
more so in new Jetty versions.

Jetty client -> server communication (and http2 in general) can be much
more picky around handling connections that are not tightly managed for
resuse under http2 (which can multiplex many requests over a single
connection). If you don’t fully read input/output streams for example, the
server doesn’t know that you don’t intend to finish dealing with your
stream. It will wait some amount of time. And then it will whack that
connection. There are all sorts of things that can manifest from this
depending on all kinds of factors, but one of them is your client server
communication can be hosed for a bit. Similarity things can happen even if
you do always keep your connection pool connections in shape if say, you
set a content length header that doesn’t match the content. You can do a
lot poorly all day and hardly notice a peep unless you turn on debug
logging for jetty or monitor tcp stats. And most of the time, things won’t
be terrible as a result of it either. But every now and then you get pretty
annoying consequences. And if you have something aggravating in the mix,
maybe more than now and then.

As I said though, you can run into this dist stall issue from a variety of
ways, you can march down the list of what can cause it and low and behold,
there that stall is again for another reason.

I would try and tune towards what is working for you - http1 appears better
- go with it. If you can use delete by query more sparingly, I would
-perhaps batch them when updates are not so common, that’s a big
instigator. Careful mixing in commits with it.

You can check and make sure your server idle timeouts are higher than the
clients - that can help a bit.

If you are using the cloud client or even if you are not, you can hash on
ids client side and send updates right to the leader they belong on to help
reduce extraneous zig zag update traffic.

In an intensive test world, this takes a large, multi pronged path to fully
eliminate. In a production world under your control, you should be able to
find a reasonable result with a bit of what you have been exploring and
some acceptance for imperfection or use adjustment.

Things can also change as Jetty dependencies are updated - though I will
say in my experience not often for the better. A positive if one is working
on development, perhaps less so in production. And again, even still, http
1.1 tends to be more forgiving. And delete by query and some other issues
are still a bit more persistent, but less likely to move backwards or
unexpectedly on you as the more ethereal connection issues and spurious EOF
parsing exceptions.

Many users are not heavily troubled by the above, so likely you will find a
workable setup. That background is to essentially say, you may find clear
sky’s tomorrow, but also, it’s not just some setup or use issue on your
end, and also, while a single fix might set you up, keep in mind a single
fix may be what your waiting for only by chance of your situation, so be
proactive.

MRM

On Wed, May 19, 2021 at 8:48 AM Ding, Lehong (external - Project)
<lehong.d...@sap.com.invalid> wrote:

> *Background:*
>
> Before we moving to solr 8.8.1 from 7.7.2, we performed some performance
> test on solr 8.8.1. We met a lot of concurrent update error in solr log.
>
> *Envrironment:*
>
> solrCloud with 3 cluster nodes with 500 collections, 5 collections have
> about 10m documents.
>
> (1 shard, 3 replica)
>
> *Threads:*
>
> 30 update/add threads + 10 deleteByQuery threads
>
> *Results:*
>
> During deleteByQuery thread runing, only one node (lead node) has update
> transactions, but other two node has none .
>
> *Errrors:  *
>
> java.io.IOException: Request processing has stalled for 20091ms with 100
> remaining elements in the queue.java.io.IOException: Request processing has
> stalled for 20091ms with 100 remaining elements in the queue. at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:449)
> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) at
> org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:345)
> at
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:338)
> at
> org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:244)
> at
> org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDistribAdd(DistributedZkUpdateProcessor.java:300)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:230)
> at
> org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245)
> at
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:106)
> at
> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:343)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:291)
> at
> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338)
> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:244)
>
>
>
> * Temporary Solution:*
>
> adding -Dsolr.http1=1 in solr start parameters
>
> There are still some error in error log but the number is much small.
>
>
>
> *My Questions:*
>
> 1 We found solr cluster will eventually get the data consistent. What’s
> the concurrent update error mainly impacted?
>
> 2 Adding  -Dsolr.http1=1 in solr start parameters can reduce the error
> number. Do we realy need add this parameter? And does this parameter will
> be kept in later version?
>
> Many Thanks.
>
>
>
>
>
> Thanks and Regards,
>
> Lehong Ding
>
>
>
>
>
-- 
- Mark

http://about.me/markrmiller

Reply via email to