Threads are hanging on merge io throthling

        at 
org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
        at 
org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
        at 
org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
        at 
org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)

It seems odd. Please confirm that you don't commit on every update request.
The only way to monitor io throthling is to enable infostream and read a
lot of logs.


On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev <demic...@gmail.com> wrote:

> Erick,
>
> Thank you for your quick response.
>
> I/O bottleneck: Please see another screenshot attached, as you can see
> disk r/w operations are pretty low or not significant.
> iostat==========
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> xvda              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           12.52    0.00    0.00    0.00    0.00   87.48
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> xvda              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           12.51    0.00    0.00    0.00    0.00   87.49
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> xvda              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> ==========================
>
> Merging threads: I don't see any modifications of a merging policy
> comparing to the default solrconfig.
> Index config: <ramBufferSizeMB>2000</ramBufferSizeMB><
> maxBufferedDocs>500000</maxBufferedDocs>
> Update handler: <updateHandler class="solr.DirectUpdateHandler2">
> Could you please help me understand how can I validate this theory?
> Another note here. Even if I remove the stress from the cluster I still
> see that merging thread is consuming CPU for some time. It may take hours
> and if I try to return the stress back nothing changes.
> If this is overloaded merging process, it should take some time to reduce
> the queue length and it should start accepting new indexing requests.
> Maybe I am wrong, but I need some help to understand how to check it.
>
> AWS - Sorry, I don't have any physical hardware to replicate this test
> locally
>
> GC - I monitored GC closely. If you take a look at CPU utilization
> screenshot you will see a blue graph that is GC consumption. In addition to
> that I am using Visual GC plugin from VisualVM to understand how GC
> performs under the stress and don't see any anomalies.
> There are several GC pauses from time to time but those are not
> significant. Heap utilization graph tells me that GC is not struggling a
> lot.
>
> Thank you again for your comments, hope the information above will help
> you understand the problem.
>
>
> Regards,
> Denis
>
>
> On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Have you changed any of the merge policy parameters? I doubt it but just
>> asking.
>>
>> My guess: your I/O is your bottleneck. There are a limited number of
>> threads (tunable) that are used for background merging. When they're
>> all busy, incoming updates are queued up. This squares with your
>> statement that queries are fine and CPU activity is moderate.
>>
>> A quick test there would be to try this on a non-AWS setup if you have
>> some hardware you can repurpose.
>>
>> an 80G heap is a red flag. Most of the time that's too large by far.
>> So one thing I'd do is hook up some GC monitoring, you may be spending
>> a horrible amount of time in GC cycles.
>>
>> Best,
>> Erick
>>
>> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev <demic...@gmail.com>
>> wrote:
>> >
>> > All,
>> >
>> > I would like to request some assistance with a situation described
>> below. My
>> > SolrCloud cluster accepts the update requests at a very low pace making
>> it
>> > impossible to index new documents.
>> >
>> > Cluster Setup:
>> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data
>> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB physical
>> memory,
>> > 80GB Java Heap space, AWS
>> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment (build
>> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor
>> >
>> > Symptoms:
>> > 1. 4 instances running 4 threads each are using SolrJ client to submit
>> > documents to SolrCloud for indexing, do not perform any manual commits.
>> Each
>> > document  batch is 10 documents big, containing ~200 text fields per
>> > document.
>> > 2. After some time (~20-30 minutes, by that time I see only ~50-60K of
>> > documents in a collection, node restarts do not help) I notice that
>> clients
>> > cannot submit new documents to the cluster for indexing anymore, each
>> > operation takes enormous amount of time
>> > 3. Cluster is not loaded at all, CPU consumption is moderate (I am
>> seeing
>> > that merging is performed all the time though), memory consumption is
>> > adequate, but still updates are not accepted from external clients
>> > 4. Search requests are handled fine
>> > 5. I don't see any significant activity in SolrCloud logs anywhere, just
>> > regular replication attempts only. No errors.
>> >
>> >
>> > Additional information
>> > 1. Please see Thread Dump attached.
>> > 2. Please see SolrAdmin info with physical memory and file descriptor
>> > utilization
>> > 3. Please see VisualVM screenshots with CPU and memory utilization and
>> CPU
>> > profiling data. Physical memory utilization is about 60-70 percent all
>> the
>> > time.
>> > 4. Schema file contains ~10 permanent fields 5 of which are mapped and
>> > mandatory and persisted, the rest of the fields are optional and dynamic
>> > 5. Solr config configures autoCommit to be set to 2 minutes and
>> openSearcher
>> > set to false
>> > 6. Caches are set up with autoWarmCount = 0
>> > 7. GC was fine tuned and I don't see any significant CPU utilization by
>> GC
>> > or any lengthy pauses. Majority of the garbage is collected in young gen
>> > space.
>> >
>> > My primary question: I see that the cluster is alive and performs some
>> > merging and commits but does not accept new documents for indexing.
>> What is
>> > causing this slowdown and why it does not accept new submissions?
>> >
>> >
>> > Regards,
>> > Denis
>>
>


-- 
Sincerely yours
Mikhail Khludnev

Reply via email to