Re: ConcurrentUpdate issue of solr 8.8.1

Mark Miller Thu, 10 Jun 2021 05:24:28 -0700

There is no line of code to point to.

I see in the stack trace now that SolrCmdDistributor was updated to the
http2 client.


Lots more stalls and exceptions seen unless that client is changed to use
http1 instead of 2.

But still a lesser problem remains with 1.1.

That is a bit of info on why this would be possible by switching from the
Apache client to the Jetty client. And why Http2 can be worse than one.

Poor connection management. JavaBin additional aggravation.

That’s not a line of code. Probably not even a single Jira issue.

MRM

On Wed, Jun 9, 2021 at 8:47 AM David Smiley <dsmi...@apache.org> wrote:

> Mark, can you file a JIRA issue (if one doesn't already exist), ideally
> pointing to a line of code that is problematic?
>
> Do you think we should drop the "closeShield" stuff in
> SolrDispatchFilter?  Not sure if there is a test exercising the scenario
> that led to its inclusion; I doubt it.
>
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, Jun 9, 2021 at 12:25 AM Mark Miller <markrmil...@gmail.com> wrote:
>
>> A few details on how JavaBin can be involved in this type of behavior:
>>
>> JavaBin consumes data by by wrapping InputStream’s with a
>> FastInputStream. FastInputStream implements an InputStream, but it
>> disregards the InputStream contract that says you have reached the end of
>> the stream when -1 is returned. A -1 is just another bit of data to be
>> shifted into a byte and either gum the works or corrupt the type or size
>> read.
>>
>> Jetty does not give you a raw InputStream to the network that that will
>> simply suck up data like a file from disk. You will get something different
>> depending on the version, depending on the client, depending on http1 or 2
>> in some cases, at least in terms of behavior.
>>
>> Generally you are going to get something backed by ByteBuffers pretending
>> to be a raw stream.
>>
>> Depending on different circumstances, Jetty will decide to stop that
>> stream by jumping in with a -1 when it hits circumstances that lead to
>> killing the connection.
>>
>> The specifics for when and why will vary between http1 and 2. Http1 will
>> often be using chunked encoding and have little idea about the the scope of
>> the full content. It knows the size of each chunk. Http2 has no such
>> encoding and works with sessions and frames and an entirely different
>> protocol and needs to stop a channel or session. Multiplexing means this
>> can affect things across requests in a way that is different than Http1. In
>> Solr, you may also have different clients depending on the protocol with
>> different retry implementations. Old Jetty clients offered retry, their
>> view for the newer client is that it’s too application specific as an
>> option, implement retry yourself if you want it. The Apache http1 client
>> will retry based on the Solr retry configuration class.
>>
>> Given that JavaBin has little respect for streams and deals in expected
>> sizes or it’s own end markers, there is plenty of room to not clear a
>> request. Given we try and not close streams due to old webapp days of
>> tomcat and earlier jetty versions, there is more room for poor behavior.
>> Not closing streams is no longer helpful at all in most cases. Generally
>> there is an attempt to read to end of stream, but this is a hit ir miss
>> strategy, especially as currently done.
>>
>> Instead of respecting stream contracts, JavaBin is given how much to read
>> and it diligently works to read to that point or in some cases it’s own end
>> marker.
>>
>> So when Jetty can’t tell a stream is actually complete, it needs to
>> timeout before it will forcibly kill it. Varying behavior between http 1
>> and 2. For various reasons it may try to do this as for other reasons as
>> well. That will cause negative affects (delays in many cases) and reset or
>> closed exceptions. The -1 it will inject into the stream won’t generally
>> get a prompt response either. Depending on timing and data, JavaBin may not
>> even see this an early end and simply spit out something off.  Retries from
>> the same client won’t help for some time in some cases (HTTP2 worse than 1)
>> either. Mileage will vary depending on HTTP2 or 1.1 for most of this. Http2
>> can send things like GoAway or session lifecycle info as well that will
>> lead to poorly handled exceptions in a similar vein.
>>
>> There are other things that could be at play, but this is one that
>> behaves worse with http2 and also appears worse with Jetty 10 / 11 than 9.
>> The better the system acts, the more visible these problems tend to be.
>>
>> MRM
>>
>> On Fri, Jun 4, 2021 at 3:55 PM Mark Miller <markrmil...@gmail.com> wrote:
>>
>>> Given you say you saw a lot of this and that occurred after a specific
>>> upgrade, I would guess your root issue is JavaBin.
>>>
>>> Unfortunately, I don’t believe you can change internode communication to
>>> another format if that’s the case.
>>>
>>> But also if that is the case, it will be in the way of some work I have
>>> to do over the coming months and so I will fix it.
>>>
>>> Mark
>>>
>>> On Wed, May 26, 2021 at 7:46 PM Mark Miller <markrmil...@gmail.com>
>>> wrote:
>>>
>>>> You can run into this fairly easily even if you are doing everything
>>>> right and none of your operations are intrinsically slow or problematic.
>>>>
>>>> DeleteByQuery, regardless of its cost, tends to be a large contributor,
>>>> though not always. You can mitigate a bit with cautious, controlled use of
>>>> it.
>>>>
>>>> I’m not surprised that http2 is even more prone to also being involved,
>>>> though I didn’t think that client was yet using an http2 version, so that
>>>> is a bit surprising, but these things can bleed over quite easily. Even
>>>> more so in new Jetty versions.
>>>>
>>>> Jetty client -> server communication (and http2 in general) can be much
>>>> more picky around handling connections that are not tightly managed for
>>>> resuse under http2 (which can multiplex many requests over a single
>>>> connection). If you don’t fully read input/output streams for example, the
>>>> server doesn’t know that you don’t intend to finish dealing with your
>>>> stream. It will wait some amount of time. And then it will whack that
>>>> connection. There are all sorts of things that can manifest from this
>>>> depending on all kinds of factors, but one of them is your client server
>>>> communication can be hosed for a bit. Similarity things can happen even if
>>>> you do always keep your connection pool connections in shape if say, you
>>>> set a content length header that doesn’t match the content. You can do a
>>>> lot poorly all day and hardly notice a peep unless you turn on debug
>>>> logging for jetty or monitor tcp stats. And most of the time, things won’t
>>>> be terrible as a result of it either. But every now and then you get pretty
>>>> annoying consequences. And if you have something aggravating in the mix,
>>>> maybe more than now and then.
>>>>
>>>> As I said though, you can run into this dist stall issue from a variety
>>>> of ways, you can march down the list of what can cause it and low and
>>>> behold, there that stall is again for another reason.
>>>>
>>>> I would try and tune towards what is working for you - http1 appears
>>>> better - go with it. If you can use delete by query more sparingly, I would
>>>> -perhaps batch them when updates are not so common, that’s a big
>>>> instigator. Careful mixing in commits with it.
>>>>
>>>> You can check and make sure your server idle timeouts are higher than
>>>> the clients - that can help a bit.
>>>>
>>>> If you are using the cloud client or even if you are not, you can hash
>>>> on ids client side and send updates right to the leader they belong on to
>>>> help reduce extraneous zig zag update traffic.
>>>>
>>>> In an intensive test world, this takes a large, multi pronged path to
>>>> fully eliminate. In a production world under your control, you should be
>>>> able to find a reasonable result with a bit of what you have been exploring
>>>> and some acceptance for imperfection or use adjustment.
>>>>
>>>> Things can also change as Jetty dependencies are updated - though I
>>>> will say in my experience not often for the better. A positive if one is
>>>> working on development, perhaps less so in production. And again, even
>>>> still, http 1.1 tends to be more forgiving. And delete by query and some
>>>> other issues are still a bit more persistent, but less likely to move
>>>> backwards or unexpectedly on you as the more ethereal connection issues and
>>>> spurious EOF parsing exceptions.
>>>>
>>>> Many users are not heavily troubled by the above, so likely you will
>>>> find a workable setup. That background is to essentially say, you may find
>>>> clear sky’s tomorrow, but also, it’s not just some setup or use issue on
>>>> your end, and also, while a single fix might set you up, keep in mind a
>>>> single fix may be what your waiting for only by chance of your situation,
>>>> so be proactive.
>>>>
>>>> MRM
>>>>
>>>> On Wed, May 19, 2021 at 8:48 AM Ding, Lehong (external - Project)
>>>> <lehong.d...@sap.com.invalid> wrote:
>>>>
>>>>> *Background:*
>>>>>
>>>>> Before we moving to solr 8.8.1 from 7.7.2, we performed some
>>>>> performance test on solr 8.8.1. We met a lot of concurrent update error in
>>>>> solr log.
>>>>>
>>>>> *Envrironment:*
>>>>>
>>>>> solrCloud with 3 cluster nodes with 500 collections, 5 collections
>>>>> have about 10m documents.
>>>>>
>>>>> (1 shard, 3 replica)
>>>>>
>>>>> *Threads:*
>>>>>
>>>>> 30 update/add threads + 10 deleteByQuery threads
>>>>>
>>>>> *Results:*
>>>>>
>>>>> During deleteByQuery thread runing, only one node (lead node) has
>>>>> update transactions, but other two node has none .
>>>>>
>>>>> *Errrors:  *
>>>>>
>>>>> java.io.IOException: Request processing has stalled for 20091ms with
>>>>> 100 remaining elements in the queue.java.io.IOException: Request 
>>>>> processing
>>>>> has stalled for 20091ms with 100 remaining elements in the queue. at
>>>>> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:449)
>>>>> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) 
>>>>> at
>>>>> org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:345)
>>>>> at
>>>>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:338)
>>>>> at
>>>>> org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:244)
>>>>> at
>>>>> org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDistribAdd(DistributedZkUpdateProcessor.java:300)
>>>>> at
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:230)
>>>>> at
>>>>> org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245)
>>>>> at
>>>>> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:106)
>>>>> at
>>>>> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110)
>>>>> at
>>>>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:343)
>>>>> at
>>>>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:291)
>>>>> at
>>>>> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338)
>>>>> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283)
>>>>> at
>>>>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:244)
>>>>>
>>>>>
>>>>>
>>>>> * Temporary Solution:*
>>>>>
>>>>> adding -Dsolr.http1=1 in solr start parameters
>>>>>
>>>>> There are still some error in error log but the number is much small.
>>>>>
>>>>>
>>>>>
>>>>> *My Questions:*
>>>>>
>>>>> 1 We found solr cluster will eventually get the data consistent.
>>>>> What’s the concurrent update error mainly impacted?
>>>>>
>>>>> 2 Adding  -Dsolr.http1=1 in solr start parameters can reduce the error
>>>>> number. Do we realy need add this parameter? And does this parameter will
>>>>> be kept in later version?
>>>>>
>>>>> Many Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks and Regards,
>>>>>
>>>>> Lehong Ding
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
- Mark

http://about.me/markrmiller

Re: ConcurrentUpdate issue of solr 8.8.1

Reply via email to