Re: SolrCloud cluster does not accept new documents for indexing

Erick Erickson Thu, 19 Apr 2018 12:02:10 -0700

When all indexing threads are occupied merging, incoming updates block
until at least one thread frees up IIUC.


The fact that you're not opening searchers doesn't matter as far as
merging is concerned, that happens regardless on hard commits.

Bumping your ram buffer up to 2G is usually unnecessary, we rarely see
increased throughput over about 100M but again I doubt that's the root
cause.

Unless AWS is throttling your I/O......

So this is puzzling to me, I don't have any more fresh ideas.

Best,
Erick

On Thu, Apr 19, 2018 at 10:48 AM, Denis Demichev <demic...@gmail.com> wrote:
> Mikhail,
>
> I see what you're saying. Thank you for the clarification.
> Yes, there's no single line in the client code that contains a commit
> statement.
> The only thing I do: solr.add(collectionName, dataToSend); where solr is a
> SolrClient.
> Autocommits are set up on the server side for 2 minutes and they do not open
> a searcher.
>
> What is interesting, if I look at CPU utilization by thread, I am seeing a
> pic attached.
> So at least 3 different threads are executing merges kinda concurrently.
>
> If I am not mistaken, based on the code I see, this stacktrace tells me that
> I have 7 merging threads running and they limit themselves to limit the CPU
> consumption...
> Still don't fully understand how this impacts the acceptor behavior and
> ability to handle "add" requests. Don't see this logical connection yet.
>
>
> Regards,
> Denis
>
>
> On Thu, Apr 19, 2018 at 1:37 PM Mikhail Khludnev <m...@apache.org> wrote:
>>
>> Threads are hanging on merge io throthling
>>
>>         at
>> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
>>         at
>> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
>>         at
>> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
>>         at
>> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
>>
>> It seems odd. Please confirm that you don't commit on every update
>> request.
>> The only way to monitor io throthling is to enable infostream and read a
>> lot of logs.
>>
>>
>> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev <demic...@gmail.com>
>> wrote:
>>
>> > Erick,
>> >
>> > Thank you for your quick response.
>> >
>> > I/O bottleneck: Please see another screenshot attached, as you can see
>> > disk r/w operations are pretty low or not significant.
>> > iostat==========
>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> > avgrq-sz
>> > avgqu-sz   await r_await w_await  svctm  %util
>> > xvda              0.00     0.00    0.00    0.00     0.00     0.00
>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>> >
>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >           12.52    0.00    0.00    0.00    0.00   87.48
>> >
>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> > avgrq-sz
>> > avgqu-sz   await r_await w_await  svctm  %util
>> > xvda              0.00     0.00    0.00    0.00     0.00     0.00
>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>> >
>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >           12.51    0.00    0.00    0.00    0.00   87.49
>> >
>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> > avgrq-sz
>> > avgqu-sz   await r_await w_await  svctm  %util
>> > xvda              0.00     0.00    0.00    0.00     0.00     0.00
>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>> > ==========================
>> >
>> > Merging threads: I don't see any modifications of a merging policy
>> > comparing to the default solrconfig.
>> > Index config: <ramBufferSizeMB>2000</ramBufferSizeMB><
>> > maxBufferedDocs>500000</maxBufferedDocs>
>> > Update handler: <updateHandler class="solr.DirectUpdateHandler2">
>> > Could you please help me understand how can I validate this theory?
>> > Another note here. Even if I remove the stress from the cluster I still
>> > see that merging thread is consuming CPU for some time. It may take
>> > hours
>> > and if I try to return the stress back nothing changes.
>> > If this is overloaded merging process, it should take some time to
>> > reduce
>> > the queue length and it should start accepting new indexing requests.
>> > Maybe I am wrong, but I need some help to understand how to check it.
>> >
>> > AWS - Sorry, I don't have any physical hardware to replicate this test
>> > locally
>> >
>> > GC - I monitored GC closely. If you take a look at CPU utilization
>> > screenshot you will see a blue graph that is GC consumption. In addition
>> > to
>> > that I am using Visual GC plugin from VisualVM to understand how GC
>> > performs under the stress and don't see any anomalies.
>> > There are several GC pauses from time to time but those are not
>> > significant. Heap utilization graph tells me that GC is not struggling a
>> > lot.
>> >
>> > Thank you again for your comments, hope the information above will help
>> > you understand the problem.
>> >
>> >
>> > Regards,
>> > Denis
>> >
>> >
>> > On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson
>> > <erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Have you changed any of the merge policy parameters? I doubt it but
>> >> just
>> >> asking.
>> >>
>> >> My guess: your I/O is your bottleneck. There are a limited number of
>> >> threads (tunable) that are used for background merging. When they're
>> >> all busy, incoming updates are queued up. This squares with your
>> >> statement that queries are fine and CPU activity is moderate.
>> >>
>> >> A quick test there would be to try this on a non-AWS setup if you have
>> >> some hardware you can repurpose.
>> >>
>> >> an 80G heap is a red flag. Most of the time that's too large by far.
>> >> So one thing I'd do is hook up some GC monitoring, you may be spending
>> >> a horrible amount of time in GC cycles.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev <demic...@gmail.com>
>> >> wrote:
>> >> >
>> >> > All,
>> >> >
>> >> > I would like to request some assistance with a situation described
>> >> below. My
>> >> > SolrCloud cluster accepts the update requests at a very low pace
>> >> > making
>> >> it
>> >> > impossible to index new documents.
>> >> >
>> >> > Cluster Setup:
>> >> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data
>> >> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB physical
>> >> memory,
>> >> > 80GB Java Heap space, AWS
>> >> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment (build
>> >> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed
>> >> > mode)
>> >> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor
>> >> >
>> >> > Symptoms:
>> >> > 1. 4 instances running 4 threads each are using SolrJ client to
>> >> > submit
>> >> > documents to SolrCloud for indexing, do not perform any manual
>> >> > commits.
>> >> Each
>> >> > document  batch is 10 documents big, containing ~200 text fields per
>> >> > document.
>> >> > 2. After some time (~20-30 minutes, by that time I see only ~50-60K
>> >> > of
>> >> > documents in a collection, node restarts do not help) I notice that
>> >> clients
>> >> > cannot submit new documents to the cluster for indexing anymore, each
>> >> > operation takes enormous amount of time
>> >> > 3. Cluster is not loaded at all, CPU consumption is moderate (I am
>> >> seeing
>> >> > that merging is performed all the time though), memory consumption is
>> >> > adequate, but still updates are not accepted from external clients
>> >> > 4. Search requests are handled fine
>> >> > 5. I don't see any significant activity in SolrCloud logs anywhere,
>> >> > just
>> >> > regular replication attempts only. No errors.
>> >> >
>> >> >
>> >> > Additional information
>> >> > 1. Please see Thread Dump attached.
>> >> > 2. Please see SolrAdmin info with physical memory and file descriptor
>> >> > utilization
>> >> > 3. Please see VisualVM screenshots with CPU and memory utilization
>> >> > and
>> >> CPU
>> >> > profiling data. Physical memory utilization is about 60-70 percent
>> >> > all
>> >> the
>> >> > time.
>> >> > 4. Schema file contains ~10 permanent fields 5 of which are mapped
>> >> > and
>> >> > mandatory and persisted, the rest of the fields are optional and
>> >> > dynamic
>> >> > 5. Solr config configures autoCommit to be set to 2 minutes and
>> >> openSearcher
>> >> > set to false
>> >> > 6. Caches are set up with autoWarmCount = 0
>> >> > 7. GC was fine tuned and I don't see any significant CPU utilization
>> >> > by
>> >> GC
>> >> > or any lengthy pauses. Majority of the garbage is collected in young
>> >> > gen
>> >> > space.
>> >> >
>> >> > My primary question: I see that the cluster is alive and performs
>> >> > some
>> >> > merging and commits but does not accept new documents for indexing.
>> >> What is
>> >> > causing this slowdown and why it does not accept new submissions?
>> >> >
>> >> >
>> >> > Regards,
>> >> > Denis
>> >>
>> >
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev

Re: SolrCloud cluster does not accept new documents for indexing

Reply via email to