Erick,

Thank you for your quick response.

I/O bottleneck: Please see another screenshot attached, as you can see disk
r/w operations are pretty low or not significant.
iostat==========
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.52    0.00    0.00    0.00    0.00   87.48

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.51    0.00    0.00    0.00    0.00   87.49

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
==========================

Merging threads: I don't see any modifications of a merging policy
comparing to the default solrconfig.
Index config:
<ramBufferSizeMB>2000</ramBufferSizeMB><maxBufferedDocs>500000</maxBufferedDocs>
Update handler: <updateHandler class="solr.DirectUpdateHandler2">
Could you please help me understand how can I validate this theory?
Another note here. Even if I remove the stress from the cluster I still see
that merging thread is consuming CPU for some time. It may take hours and
if I try to return the stress back nothing changes.
If this is overloaded merging process, it should take some time to reduce
the queue length and it should start accepting new indexing requests.
Maybe I am wrong, but I need some help to understand how to check it.

AWS - Sorry, I don't have any physical hardware to replicate this test
locally

GC - I monitored GC closely. If you take a look at CPU utilization
screenshot you will see a blue graph that is GC consumption. In addition to
that I am using Visual GC plugin from VisualVM to understand how GC
performs under the stress and don't see any anomalies.
There are several GC pauses from time to time but those are not
significant. Heap utilization graph tells me that GC is not struggling a
lot.

Thank you again for your comments, hope the information above will help you
understand the problem.


Regards,
Denis


On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> Have you changed any of the merge policy parameters? I doubt it but just
> asking.
>
> My guess: your I/O is your bottleneck. There are a limited number of
> threads (tunable) that are used for background merging. When they're
> all busy, incoming updates are queued up. This squares with your
> statement that queries are fine and CPU activity is moderate.
>
> A quick test there would be to try this on a non-AWS setup if you have
> some hardware you can repurpose.
>
> an 80G heap is a red flag. Most of the time that's too large by far.
> So one thing I'd do is hook up some GC monitoring, you may be spending
> a horrible amount of time in GC cycles.
>
> Best,
> Erick
>
> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev <demic...@gmail.com>
> wrote:
> >
> > All,
> >
> > I would like to request some assistance with a situation described
> below. My
> > SolrCloud cluster accepts the update requests at a very low pace making
> it
> > impossible to index new documents.
> >
> > Cluster Setup:
> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data
> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB physical
> memory,
> > 80GB Java Heap space, AWS
> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment (build
> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor
> >
> > Symptoms:
> > 1. 4 instances running 4 threads each are using SolrJ client to submit
> > documents to SolrCloud for indexing, do not perform any manual commits.
> Each
> > document  batch is 10 documents big, containing ~200 text fields per
> > document.
> > 2. After some time (~20-30 minutes, by that time I see only ~50-60K of
> > documents in a collection, node restarts do not help) I notice that
> clients
> > cannot submit new documents to the cluster for indexing anymore, each
> > operation takes enormous amount of time
> > 3. Cluster is not loaded at all, CPU consumption is moderate (I am seeing
> > that merging is performed all the time though), memory consumption is
> > adequate, but still updates are not accepted from external clients
> > 4. Search requests are handled fine
> > 5. I don't see any significant activity in SolrCloud logs anywhere, just
> > regular replication attempts only. No errors.
> >
> >
> > Additional information
> > 1. Please see Thread Dump attached.
> > 2. Please see SolrAdmin info with physical memory and file descriptor
> > utilization
> > 3. Please see VisualVM screenshots with CPU and memory utilization and
> CPU
> > profiling data. Physical memory utilization is about 60-70 percent all
> the
> > time.
> > 4. Schema file contains ~10 permanent fields 5 of which are mapped and
> > mandatory and persisted, the rest of the fields are optional and dynamic
> > 5. Solr config configures autoCommit to be set to 2 minutes and
> openSearcher
> > set to false
> > 6. Caches are set up with autoWarmCount = 0
> > 7. GC was fine tuned and I don't see any significant CPU utilization by
> GC
> > or any lengthy pauses. Majority of the garbage is collected in young gen
> > space.
> >
> > My primary question: I see that the cluster is alive and performs some
> > merging and commits but does not accept new documents for indexing. What
> is
> > causing this slowdown and why it does not accept new submissions?
> >
> >
> > Regards,
> > Denis
>

Reply via email to