Threads are hanging on merge io throthling at org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150) at org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148) at org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93) at org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
It seems odd. Please confirm that you don't commit on every update request. The only way to monitor io throthling is to enable infostream and read a lot of logs. On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev <demic...@gmail.com> wrote: > Erick, > > Thank you for your quick response. > > I/O bottleneck: Please see another screenshot attached, as you can see > disk r/w operations are pretty low or not significant. > iostat========== > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > xvda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > avg-cpu: %user %nice %system %iowait %steal %idle > 12.52 0.00 0.00 0.00 0.00 87.48 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > xvda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > avg-cpu: %user %nice %system %iowait %steal %idle > 12.51 0.00 0.00 0.00 0.00 87.49 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > xvda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > ========================== > > Merging threads: I don't see any modifications of a merging policy > comparing to the default solrconfig. > Index config: <ramBufferSizeMB>2000</ramBufferSizeMB>< > maxBufferedDocs>500000</maxBufferedDocs> > Update handler: <updateHandler class="solr.DirectUpdateHandler2"> > Could you please help me understand how can I validate this theory? > Another note here. Even if I remove the stress from the cluster I still > see that merging thread is consuming CPU for some time. It may take hours > and if I try to return the stress back nothing changes. > If this is overloaded merging process, it should take some time to reduce > the queue length and it should start accepting new indexing requests. > Maybe I am wrong, but I need some help to understand how to check it. > > AWS - Sorry, I don't have any physical hardware to replicate this test > locally > > GC - I monitored GC closely. If you take a look at CPU utilization > screenshot you will see a blue graph that is GC consumption. In addition to > that I am using Visual GC plugin from VisualVM to understand how GC > performs under the stress and don't see any anomalies. > There are several GC pauses from time to time but those are not > significant. Heap utilization graph tells me that GC is not struggling a > lot. > > Thank you again for your comments, hope the information above will help > you understand the problem. > > > Regards, > Denis > > > On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson <erickerick...@gmail.com> > wrote: > >> Have you changed any of the merge policy parameters? I doubt it but just >> asking. >> >> My guess: your I/O is your bottleneck. There are a limited number of >> threads (tunable) that are used for background merging. When they're >> all busy, incoming updates are queued up. This squares with your >> statement that queries are fine and CPU activity is moderate. >> >> A quick test there would be to try this on a non-AWS setup if you have >> some hardware you can repurpose. >> >> an 80G heap is a red flag. Most of the time that's too large by far. >> So one thing I'd do is hook up some GC monitoring, you may be spending >> a horrible amount of time in GC cycles. >> >> Best, >> Erick >> >> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev <demic...@gmail.com> >> wrote: >> > >> > All, >> > >> > I would like to request some assistance with a situation described >> below. My >> > SolrCloud cluster accepts the update requests at a very low pace making >> it >> > impossible to index new documents. >> > >> > Cluster Setup: >> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data >> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB physical >> memory, >> > 80GB Java Heap space, AWS >> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment (build >> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode) >> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor >> > >> > Symptoms: >> > 1. 4 instances running 4 threads each are using SolrJ client to submit >> > documents to SolrCloud for indexing, do not perform any manual commits. >> Each >> > document batch is 10 documents big, containing ~200 text fields per >> > document. >> > 2. After some time (~20-30 minutes, by that time I see only ~50-60K of >> > documents in a collection, node restarts do not help) I notice that >> clients >> > cannot submit new documents to the cluster for indexing anymore, each >> > operation takes enormous amount of time >> > 3. Cluster is not loaded at all, CPU consumption is moderate (I am >> seeing >> > that merging is performed all the time though), memory consumption is >> > adequate, but still updates are not accepted from external clients >> > 4. Search requests are handled fine >> > 5. I don't see any significant activity in SolrCloud logs anywhere, just >> > regular replication attempts only. No errors. >> > >> > >> > Additional information >> > 1. Please see Thread Dump attached. >> > 2. Please see SolrAdmin info with physical memory and file descriptor >> > utilization >> > 3. Please see VisualVM screenshots with CPU and memory utilization and >> CPU >> > profiling data. Physical memory utilization is about 60-70 percent all >> the >> > time. >> > 4. Schema file contains ~10 permanent fields 5 of which are mapped and >> > mandatory and persisted, the rest of the fields are optional and dynamic >> > 5. Solr config configures autoCommit to be set to 2 minutes and >> openSearcher >> > set to false >> > 6. Caches are set up with autoWarmCount = 0 >> > 7. GC was fine tuned and I don't see any significant CPU utilization by >> GC >> > or any lengthy pauses. Majority of the garbage is collected in young gen >> > space. >> > >> > My primary question: I see that the cluster is alive and performs some >> > merging and commits but does not accept new documents for indexing. >> What is >> > causing this slowdown and why it does not accept new submissions? >> > >> > >> > Regards, >> > Denis >> > -- Sincerely yours Mikhail Khludnev