Denis, Can you enable infoSteam https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#IndexConfiginSolrConfig-OtherIndexingSettings and examine logs about throttling? And what if you try without auto-commit?
On Tue, Apr 24, 2018 at 12:37 AM, Denis Demichev <demic...@gmail.com> wrote: > I conducted another experiment today with local SSD drives, but this did > not seem to fix my problem. > Don't see any extensive I/O in this case: > > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > > xvda 1.76 88.83 5.52 1256191 77996 > > xvdb 13.95 111.30 56663.93 1573961 801303364 > > xvdb - is the device where SolrCloud is installed and data files are kept. > > What I see: > - There are 17 "Lucene Merge Thread #..." running. Some of them are > blocked, some of them are RUNNING > - updateExecutor-N-thread-M threads are in parked mode and number of docs > that I am able to submit is still low > - Tried to change maxIndexingThreads, set it to something high. This seems > to prolong the time when cluster is accepting new indexing requests and > keeps CPU utilization a lot higher while the cluster is merging indexes > > Could anyone please point me to the right direction (documentation or Java > classes) where I can read about how data is passed from updateExecutor > thread pool to Merge Threads? I assume there should be some internal > blocking queue or something similar. > Still cannot wrap my head around how Solr blocks incoming connections. Non > merged indexes are not kept in memory so I don't clearly understand why > Solr cannot keep writing index file to HDD while other threads are merging > indexes (since this is a continuous process anyway). > > Does anyone use SPM monitoring tool for that type of problems? Is it of > any use at all? > > > Thank you in advance. > > [image: image.png] > > > Regards, > Denis > > > On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev <demic...@gmail.com> wrote: > >> Mikhail, >> >> Sure, I will keep everyone posted. Moving to non-HVM instance may take >> some time, so hopefully I will be able to share my observations in the next >> couple of days or so. >> Thanks again for all the help. >> >> Regards, >> Denis >> >> >> On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev <m...@apache.org> wrote: >> >>> Denis, please let me know what it ends up with. I'm really curious >>> regarding this case and AWS instace flavours. fwiw since 7.4 we'll have >>> ioThrottle=false option. >>> >>> On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev <demic...@gmail.com> >>> wrote: >>> >>>> Mikhail, Erick, >>>> >>>> Thank you. >>>> >>>> What just occurred to me - we don't use local SSD but instead we're >>>> using EBS volumes. >>>> This was a wrong instance type that I looked at. >>>> Will try to set up a cluster with SSD nodes and retest. >>>> >>>> Regards, >>>> Denis >>>> >>>> >>>> On Thu, Apr 19, 2018 at 2:56 PM Mikhail Khludnev <m...@apache.org> >>>> wrote: >>>> >>>>> I'm not sure it's the right context, but here is one guy shows really >>>>> low throthle boundary >>>>> https://issues.apache.org/jira/browse/SOLR-11200? >>>>> focusedCommentId=16115348&page=com.atlassian.jira. >>>>> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348 >>>>> >>>>> >>>>> On Thu, Apr 19, 2018 at 8:37 PM, Mikhail Khludnev <m...@apache.org> >>>>> wrote: >>>>> >>>>>> Threads are hanging on merge io throthling >>>>>> >>>>>> at >>>>>> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150) >>>>>> at >>>>>> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148) >>>>>> at >>>>>> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93) >>>>>> at >>>>>> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78) >>>>>> >>>>>> It seems odd. Please confirm that you don't commit on every update >>>>>> request. >>>>>> The only way to monitor io throthling is to enable infostream and >>>>>> read a lot of logs. >>>>>> >>>>>> >>>>>> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev <demic...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Erick, >>>>>>> >>>>>>> Thank you for your quick response. >>>>>>> >>>>>>> I/O bottleneck: Please see another screenshot attached, as you can >>>>>>> see disk r/w operations are pretty low or not significant. >>>>>>> iostat========== >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> xvda 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> >>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>> 12.52 0.00 0.00 0.00 0.00 87.48 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> xvda 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> >>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>> 12.51 0.00 0.00 0.00 0.00 87.49 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> xvda 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> ========================== >>>>>>> >>>>>>> Merging threads: I don't see any modifications of a merging policy >>>>>>> comparing to the default solrconfig. >>>>>>> Index config: <ramBufferSizeMB>2000</ramBufferSizeMB>< >>>>>>> maxBufferedDocs>500000</maxBufferedDocs> >>>>>>> Update handler: <updateHandler class="solr.DirectUpdateHandler2"> >>>>>>> Could you please help me understand how can I validate this theory? >>>>>>> Another note here. Even if I remove the stress from the cluster I >>>>>>> still see that merging thread is consuming CPU for some time. It may >>>>>>> take >>>>>>> hours and if I try to return the stress back nothing changes. >>>>>>> If this is overloaded merging process, it should take some time to >>>>>>> reduce the queue length and it should start accepting new indexing >>>>>>> requests. >>>>>>> Maybe I am wrong, but I need some help to understand how to check it. >>>>>>> >>>>>>> AWS - Sorry, I don't have any physical hardware to replicate this >>>>>>> test locally >>>>>>> >>>>>>> GC - I monitored GC closely. If you take a look at CPU utilization >>>>>>> screenshot you will see a blue graph that is GC consumption. In >>>>>>> addition to >>>>>>> that I am using Visual GC plugin from VisualVM to understand how GC >>>>>>> performs under the stress and don't see any anomalies. >>>>>>> There are several GC pauses from time to time but those are not >>>>>>> significant. Heap utilization graph tells me that GC is not struggling a >>>>>>> lot. >>>>>>> >>>>>>> Thank you again for your comments, hope the information above will >>>>>>> help you understand the problem. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Denis >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson < >>>>>>> erickerick...@gmail.com> wrote: >>>>>>> >>>>>>>> Have you changed any of the merge policy parameters? I doubt it but >>>>>>>> just asking. >>>>>>>> >>>>>>>> My guess: your I/O is your bottleneck. There are a limited number of >>>>>>>> threads (tunable) that are used for background merging. When they're >>>>>>>> all busy, incoming updates are queued up. This squares with your >>>>>>>> statement that queries are fine and CPU activity is moderate. >>>>>>>> >>>>>>>> A quick test there would be to try this on a non-AWS setup if you >>>>>>>> have >>>>>>>> some hardware you can repurpose. >>>>>>>> >>>>>>>> an 80G heap is a red flag. Most of the time that's too large by far. >>>>>>>> So one thing I'd do is hook up some GC monitoring, you may be >>>>>>>> spending >>>>>>>> a horrible amount of time in GC cycles. >>>>>>>> >>>>>>>> Best, >>>>>>>> Erick >>>>>>>> >>>>>>>> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev <demic...@gmail.com> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > All, >>>>>>>> > >>>>>>>> > I would like to request some assistance with a situation >>>>>>>> described below. My >>>>>>>> > SolrCloud cluster accepts the update requests at a very low pace >>>>>>>> making it >>>>>>>> > impossible to index new documents. >>>>>>>> > >>>>>>>> > Cluster Setup: >>>>>>>> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data >>>>>>>> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB >>>>>>>> physical memory, >>>>>>>> > 80GB Java Heap space, AWS >>>>>>>> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment >>>>>>>> (build >>>>>>>> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed >>>>>>>> mode) >>>>>>>> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor >>>>>>>> > >>>>>>>> > Symptoms: >>>>>>>> > 1. 4 instances running 4 threads each are using SolrJ client to >>>>>>>> submit >>>>>>>> > documents to SolrCloud for indexing, do not perform any manual >>>>>>>> commits. Each >>>>>>>> > document batch is 10 documents big, containing ~200 text fields >>>>>>>> per >>>>>>>> > document. >>>>>>>> > 2. After some time (~20-30 minutes, by that time I see only >>>>>>>> ~50-60K of >>>>>>>> > documents in a collection, node restarts do not help) I notice >>>>>>>> that clients >>>>>>>> > cannot submit new documents to the cluster for indexing anymore, >>>>>>>> each >>>>>>>> > operation takes enormous amount of time >>>>>>>> > 3. Cluster is not loaded at all, CPU consumption is moderate (I >>>>>>>> am seeing >>>>>>>> > that merging is performed all the time though), memory >>>>>>>> consumption is >>>>>>>> > adequate, but still updates are not accepted from external clients >>>>>>>> > 4. Search requests are handled fine >>>>>>>> > 5. I don't see any significant activity in SolrCloud logs >>>>>>>> anywhere, just >>>>>>>> > regular replication attempts only. No errors. >>>>>>>> > >>>>>>>> > >>>>>>>> > Additional information >>>>>>>> > 1. Please see Thread Dump attached. >>>>>>>> > 2. Please see SolrAdmin info with physical memory and file >>>>>>>> descriptor >>>>>>>> > utilization >>>>>>>> > 3. Please see VisualVM screenshots with CPU and memory >>>>>>>> utilization and CPU >>>>>>>> > profiling data. Physical memory utilization is about 60-70 >>>>>>>> percent all the >>>>>>>> > time. >>>>>>>> > 4. Schema file contains ~10 permanent fields 5 of which are >>>>>>>> mapped and >>>>>>>> > mandatory and persisted, the rest of the fields are optional and >>>>>>>> dynamic >>>>>>>> > 5. Solr config configures autoCommit to be set to 2 minutes and >>>>>>>> openSearcher >>>>>>>> > set to false >>>>>>>> > 6. Caches are set up with autoWarmCount = 0 >>>>>>>> > 7. GC was fine tuned and I don't see any significant CPU >>>>>>>> utilization by GC >>>>>>>> > or any lengthy pauses. Majority of the garbage is collected in >>>>>>>> young gen >>>>>>>> > space. >>>>>>>> > >>>>>>>> > My primary question: I see that the cluster is alive and performs >>>>>>>> some >>>>>>>> > merging and commits but does not accept new documents for >>>>>>>> indexing. What is >>>>>>>> > causing this slowdown and why it does not accept new submissions? >>>>>>>> > >>>>>>>> > >>>>>>>> > Regards, >>>>>>>> > Denis >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Sincerely yours >>>>>> Mikhail Khludnev >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Sincerely yours >>>>> Mikhail Khludnev >>>>> >>>> >>> >>> >>> -- >>> Sincerely yours >>> Mikhail Khludnev >>> >> -- Sincerely yours Mikhail Khludnev