Re: ConcurrentUpdate issue of solr 8.8.1

Mark Miller Thu, 10 Jun 2021 08:03:14 -0700

Anyway, returning the original sender:

I had not notice dist updates *are* using the http2 concurrent update
client now.


So that makes a lot of sense on why http 1.1 force improves things. It’s
not nearly as sensitive to connection management.

So I I’d still recommend sticking to http1, but one thing that would
*perhaps* improve what you are seeing with http2 is to try configuring the
Jetty server to multiplex only a single request on a connection.

These issues can show worse behavior when multiple requests are multiplexed
over a single connection because when a connection has to be closed due to
connection management, all the requests multiplexing on that connection
take a hit. Http3 is supposed to solve this limitation.

Might be worth the experimentation just to see if it’s any better, but you
will still likely get similar behavior. If you file a Jira issue for this
though, the more you can try things and collect behavior results, the more
likely someone will be able to help out with a fix.

MRM

On Tue, Jun 8, 2021 at 11:25 PM Mark Miller <[email protected]> wrote:

> A few details on how JavaBin can be involved in this type of behavior:
>
> JavaBin consumes data by by wrapping InputStream’s with a FastInputStream.
> FastInputStream implements an InputStream, but it disregards the
> InputStream contract that says you have reached the end of the stream when
> -1 is returned. A -1 is just another bit of data to be shifted into a byte
> and either gum the works or corrupt the type or size read.
>
> Jetty does not give you a raw InputStream to the network that that will
> simply suck up data like a file from disk. You will get something different
> depending on the version, depending on the client, depending on http1 or 2
> in some cases, at least in terms of behavior.
>
> Generally you are going to get something backed by ByteBuffers pretending
> to be a raw stream.
>
> Depending on different circumstances, Jetty will decide to stop that
> stream by jumping in with a -1 when it hits circumstances that lead to
> killing the connection.
>
> The specifics for when and why will vary between http1 and 2. Http1 will
> often be using chunked encoding and have little idea about the the scope of
> the full content. It knows the size of each chunk. Http2 has no such
> encoding and works with sessions and frames and an entirely different
> protocol and needs to stop a channel or session. Multiplexing means this
> can affect things across requests in a way that is different than Http1. In
> Solr, you may also have different clients depending on the protocol with
> different retry implementations. Old Jetty clients offered retry, their
> view for the newer client is that it’s too application specific as an
> option, implement retry yourself if you want it. The Apache http1 client
> will retry based on the Solr retry configuration class.
>
> Given that JavaBin has little respect for streams and deals in expected
> sizes or it’s own end markers, there is plenty of room to not clear a
> request. Given we try and not close streams due to old webapp days of
> tomcat and earlier jetty versions, there is more room for poor behavior.
> Not closing streams is no longer helpful at all in most cases. Generally
> there is an attempt to read to end of stream, but this is a hit ir miss
> strategy, especially as currently done.
>
> Instead of respecting stream contracts, JavaBin is given how much to read
> and it diligently works to read to that point or in some cases it’s own end
> marker.
>
> So when Jetty can’t tell a stream is actually complete, it needs to
> timeout before it will forcibly kill it. Varying behavior between http 1
> and 2. For various reasons it may try to do this as for other reasons as
> well. That will cause negative affects (delays in many cases) and reset or
> closed exceptions. The -1 it will inject into the stream won’t generally
> get a prompt response either. Depending on timing and data, JavaBin may not
> even see this an early end and simply spit out something off.  Retries from
> the same client won’t help for some time in some cases (HTTP2 worse than 1)
> either. Mileage will vary depending on HTTP2 or 1.1 for most of this. Http2
> can send things like GoAway or session lifecycle info as well that will
> lead to poorly handled exceptions in a similar vein.
>
> There are other things that could be at play, but this is one that behaves
> worse with http2 and also appears worse with Jetty 10 / 11 than 9. The
> better the system acts, the more visible these problems tend to be.
>
> MRM
>
> On Fri, Jun 4, 2021 at 3:55 PM Mark Miller <[email protected]> wrote:
>
>> Given you say you saw a lot of this and that occurred after a specific
>> upgrade, I would guess your root issue is JavaBin.
>>
>> Unfortunately, I don’t believe you can change internode communication to
>> another format if that’s the case.
>>
>> But also if that is the case, it will be in the way of some work I have
>> to do over the coming months and so I will fix it.
>>
>> Mark
>>
>> On Wed, May 26, 2021 at 7:46 PM Mark Miller <[email protected]>
>> wrote:
>>
>>> You can run into this fairly easily even if you are doing everything
>>> right and none of your operations are intrinsically slow or problematic.
>>>
>>> DeleteByQuery, regardless of its cost, tends to be a large contributor,
>>> though not always. You can mitigate a bit with cautious, controlled use of
>>> it.
>>>
>>> I’m not surprised that http2 is even more prone to also being involved,
>>> though I didn’t think that client was yet using an http2 version, so that
>>> is a bit surprising, but these things can bleed over quite easily. Even
>>> more so in new Jetty versions.
>>>
>>> Jetty client -> server communication (and http2 in general) can be much
>>> more picky around handling connections that are not tightly managed for
>>> resuse under http2 (which can multiplex many requests over a single
>>> connection). If you don’t fully read input/output streams for example, the
>>> server doesn’t know that you don’t intend to finish dealing with your
>>> stream. It will wait some amount of time. And then it will whack that
>>> connection. There are all sorts of things that can manifest from this
>>> depending on all kinds of factors, but one of them is your client server
>>> communication can be hosed for a bit. Similarity things can happen even if
>>> you do always keep your connection pool connections in shape if say, you
>>> set a content length header that doesn’t match the content. You can do a
>>> lot poorly all day and hardly notice a peep unless you turn on debug
>>> logging for jetty or monitor tcp stats. And most of the time, things won’t
>>> be terrible as a result of it either. But every now and then you get pretty
>>> annoying consequences. And if you have something aggravating in the mix,
>>> maybe more than now and then.
>>>
>>> As I said though, you can run into this dist stall issue from a variety
>>> of ways, you can march down the list of what can cause it and low and
>>> behold, there that stall is again for another reason.
>>>
>>> I would try and tune towards what is working for you - http1 appears
>>> better - go with it. If you can use delete by query more sparingly, I would
>>> -perhaps batch them when updates are not so common, that’s a big
>>> instigator. Careful mixing in commits with it.
>>>
>>> You can check and make sure your server idle timeouts are higher than
>>> the clients - that can help a bit.
>>>
>>> If you are using the cloud client or even if you are not, you can hash
>>> on ids client side and send updates right to the leader they belong on to
>>> help reduce extraneous zig zag update traffic.
>>>
>>> In an intensive test world, this takes a large, multi pronged path to
>>> fully eliminate. In a production world under your control, you should be
>>> able to find a reasonable result with a bit of what you have been exploring
>>> and some acceptance for imperfection or use adjustment.
>>>
>>> Things can also change as Jetty dependencies are updated - though I will
>>> say in my experience not often for the better. A positive if one is working
>>> on development, perhaps less so in production. And again, even still, http
>>> 1.1 tends to be more forgiving. And delete by query and some other issues
>>> are still a bit more persistent, but less likely to move backwards or
>>> unexpectedly on you as the more ethereal connection issues and spurious EOF
>>> parsing exceptions.
>>>
>>> Many users are not heavily troubled by the above, so likely you will
>>> find a workable setup. That background is to essentially say, you may find
>>> clear sky’s tomorrow, but also, it’s not just some setup or use issue on
>>> your end, and also, while a single fix might set you up, keep in mind a
>>> single fix may be what your waiting for only by chance of your situation,
>>> so be proactive.
>>>
>>> MRM
>>>
>>> On Wed, May 19, 2021 at 8:48 AM Ding, Lehong (external - Project)
>>> <[email protected]> wrote:
>>>
>>>> *Background:*
>>>>
>>>> Before we moving to solr 8.8.1 from 7.7.2, we performed some
>>>> performance test on solr 8.8.1. We met a lot of concurrent update error in
>>>> solr log.
>>>>
>>>> *Envrironment:*
>>>>
>>>> solrCloud with 3 cluster nodes with 500 collections, 5 collections have
>>>> about 10m documents.
>>>>
>>>> (1 shard, 3 replica)
>>>>
>>>> *Threads:*
>>>>
>>>> 30 update/add threads + 10 deleteByQuery threads
>>>>
>>>> *Results:*
>>>>
>>>> During deleteByQuery thread runing, only one node (lead node) has
>>>> update transactions, but other two node has none .
>>>>
>>>> *Errrors:  *
>>>>
>>>> java.io.IOException: Request processing has stalled for 20091ms with
>>>> 100 remaining elements in the queue.java.io.IOException: Request processing
>>>> has stalled for 20091ms with 100 remaining elements in the queue. at
>>>> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:449)
>>>> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) at
>>>> org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:345)
>>>> at
>>>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:338)
>>>> at
>>>> org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:244)
>>>> at
>>>> org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDistribAdd(DistributedZkUpdateProcessor.java:300)
>>>> at
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:230)
>>>> at
>>>> org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245)
>>>> at
>>>> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:106)
>>>> at
>>>> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110)
>>>> at
>>>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:343)
>>>> at
>>>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:291)
>>>> at
>>>> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338)
>>>> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283)
>>>> at
>>>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:244)
>>>>
>>>>
>>>>
>>>> * Temporary Solution:*
>>>>
>>>> adding -Dsolr.http1=1 in solr start parameters
>>>>
>>>> There are still some error in error log but the number is much small.
>>>>
>>>>
>>>>
>>>> *My Questions:*
>>>>
>>>> 1 We found solr cluster will eventually get the data consistent. What’s
>>>> the concurrent update error mainly impacted?
>>>>
>>>> 2 Adding  -Dsolr.http1=1 in solr start parameters can reduce the error
>>>> number. Do we realy need add this parameter? And does this parameter will
>>>> be kept in later version?
>>>>
>>>> Many Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Lehong Ding
>>>>
>>>>
>>>>
>>>>
>>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: ConcurrentUpdate issue of solr 8.8.1

Reply via email to