Out of curiosity do you have any suggesters configured? They tend to do hard lock ups if they need to rebuild. Not saying that’s what’s going on but I’ve definitely had issues with them on restarting the servers and freezing the entire thing until they were done
> On Jun 16, 2021, at 11:15 AM, Charlie Hubbard <[email protected]> > wrote: > > Hi, > > I had an event recently where my Solr Server CPU spiked, and stopped > answering incoming messages in a timely manner. I'm having a hard time > pinpointing the cause. I'm reviewing the log files, but I can't explain > why the CPU suddenly spiked. Once this happened all of the > queries suddenly started to back up. There doesn't appear to be many > queries coming in when this happened, but since queries are logged after > they complete I think it's hard to know for sure. > > My setup is: > > - Single Node standalone server > - OpenJDK 1.8 > - Solr 6.6.6 > - No replication configured > - 4 collections ranging from 200mb - 20gb > - 112 million docs in the collection in question > > Here is the section in the logs where the queries started to pile up. You > see the QTimes start to go way up. > > 2021-06-15 16:27:35.647 INFO (qtp1836643189-506) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/replication >> params={wt=json&command=details&_=1623773588323} status=0 QTime=60736 >> 2021-06-15 16:27:35.647 INFO (qtp1836643189-483) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/replication >> params={wt=json&command=details&_=1623773588323} status=0 QTime=132802 >> 2021-06-15 16:27:35.647 INFO (qtp1836643189-474) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/replication >> params={wt=json&command=details&_=1623773588323} status=0 QTime=154802 >> 2021-06-15 16:27:35.647 INFO (qtp1836643189-467) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/replication >> params={wt=json&command=details&_=1623773588323} status=0 QTime=121797 >> 2021-06-15 16:27:35.648 INFO (qtp1836643189-459) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/replication >> params={wt=json&command=details&_=1623773588323} status=0 QTime=165804 >> 2021-06-15 16:27:35.648 INFO (qtp1836643189-487) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/admin/luke >> params={numTerms=0&show=index&wt=json&_=1623773588322} status=0 QTime=110803 >> 2021-06-15 16:27:35.648 INFO (qtp1836643189-497) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/admin/luke >> params={numTerms=0&show=index&wt=json&_=1623773588322} status=0 QTime=798 >> 2021-06-15 16:27:35.647 INFO (qtp1836643189-495) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/replication >> params={wt=json&command=details&_=1623773588323} status=0 QTime=99798 >> 2021-06-15 16:27:35.650 INFO (qtp1836643189-448) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/replication >> params={wt=json&command=details&_=1623773588323} status=0 QTime=176807 >> 2021-06-15 16:27:35.650 INFO (qtp1836643189-464) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/admin/luke >> params={numTerms=0&show=index&wt=json&_=1623773588322} status=0 QTime=176809 >> 2021-06-15 16:27:35.650 INFO (qtp1836643189-479) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/admin/luke >> params={numTerms=0&show=index&wt=json&_=1623773588322} status=0 QTime=99804 >> 2021-06-15 16:27:35.650 INFO (qtp1836643189-445) [ x:customer] >> o.a.s.c.S.Request [customer] webapp=/solr path=/replication >> params={wt=json&command=details&_=1623773588323} status=0 QTime=187795 > > > That burst of replication and luke handlers lasted for 1 second, but there > are 155 entries of that over and over. Prior to this there isn't any > evidence that I can see there were any queries happening that might have > slowed things down. There was some updating happening on the smaller > collection, but not on the collection in question. > > From the GC logs I did find some overlap where thread stoppages were taking > place around and leading up to the point in time. > > Heap after GC invocations=92 (full 1): >> par new generation total 436928K, used 86155K [0x0000000540000000, >> 0x0000000560000000, 0x00000005e0000000) >> eden space 349568K, 0% used [0x0000000540000000, 0x0000000540000000, >> 0x0000000555560000) >> from space 87360K, 98% used [0x0000000555560000, 0x000000055a982ef8, >> 0x000000055aab0000) >> to space 87360K, 0% used [0x000000055aab0000, 0x000000055aab0000, >> 0x0000000560000000) >> concurrent mark-sweep generation total 1572864K, used 681710K >> [0x00000005e0000000, 0x0000000640000000, 0x00000007c0000000) >> Metaspace used 41348K, capacity 42508K, committed 42772K, reserved >> 1087488K >> class space used 4812K, capacity 5121K, committed 5140K, reserved >> 1048576K >> } >> 2021-06-15T16:27:39.384+0000: 1462.170: Total time for which application >> threads were stopped: 0.0653007 seconds, Stopping threads took: 0.0019805 >> seconds >> 2021-06-15T16:27:40.389+0000: 1463.174: Total time for which application >> threads were stopped: 0.0041873 seconds, Stopping threads took: 0.0037697 >> seconds >> 2021-06-15T16:27:40.921+0000: 1463.706: Total time for which application >> threads were stopped: 0.0029477 seconds, Stopping threads took: 0.0024678 >> seconds >> 2021-06-15T16:27:41.268+0000: 1464.053: Total time for which application >> threads were stopped: 0.0224393 seconds, Stopping threads took: 0.0219226 >> seconds >> 2021-06-15T16:27:41.388+0000: 1464.173: [GC (CMS Initial Mark) [1 >> CMS-initial-mark: 788750K(1572864K)] 1052906K(2009792K), 0.0022256 secs] >> [Times: user=0.01 sys=0.00, real=0.00 secs] >> 2021-06-15T16:27:41.390+0000: 1464.176: Total time for which application >> threads were stopped: 0.0060337 seconds, Stopping threads took: 0.0032933 >> seconds >> 2021-06-15T16:27:41.390+0000: 1464.176: [CMS-concurrent-mark-start] >> 2021-06-15T16:27:41.440+0000: 1464.225: [CMS-concurrent-mark: 0.050/0.050 >> secs] [Times: user=0.39 sys=0.00, real=0.05 secs] >> 2021-06-15T16:27:41.440+0000: 1464.225: [CMS-concurrent-preclean-start] >> 2021-06-15T16:27:41.445+0000: 1464.231: [CMS-concurrent-preclean: >> 0.005/0.005 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] >> 2021-06-15T16:27:41.445+0000: 1464.231: >> [CMS-concurrent-abortable-preclean-start] >> 2021-06-15T16:27:41.531+0000: 1464.316: Total time for which application >> threads were stopped: 0.0018167 seconds, Stopping threads took: 0.0012697 >> seconds >> 2021-06-15T16:27:41.532+0000: 1464.317: Total time for which application >> threads were stopped: 0.0009634 seconds, Stopping threads took: 0.0005877 >> seconds >> 2021-06-15T16:27:41.533+0000: 1464.318: Total time for which application >> threads were stopped: 0.0006194 seconds, Stopping threads took: 0.0003124 >> seconds >> 2021-06-15T16:27:41.534+0000: 1464.319: Total time for which application >> threads were stopped: 0.0006736 seconds, Stopping threads took: 0.0003113 >> seconds >> 2021-06-15T16:27:41.735+0000: 1464.521: Total time for which application >> threads were stopped: 0.0009390 seconds, Stopping threads took: 0.0005340 >> seconds >> 2021-06-15T16:27:42.213+0000: 1464.998: Total time for which application >> threads were stopped: 0.0016779 seconds, Stopping threads took: 0.0003837 >> seconds >> 2021-06-15T16:27:42.214+0000: 1464.999: Total time for which application >> threads were stopped: 0.0006721 seconds, Stopping threads took: 0.0002727 >> seconds >> 2021-06-15T16:27:42.215+0000: 1465.000: Total time for which application >> threads were stopped: 0.0003734 seconds, Stopping threads took: 0.0000991 >> seconds >> 2021-06-15T16:27:42.217+0000: 1465.002: Total time for which application >> threads were stopped: 0.0004387 seconds, Stopping threads took: 0.0001799 >> seconds >> 2021-06-15T16:27:42.219+0000: 1465.004: Total time for which application >> threads were stopped: 0.0004952 seconds, Stopping threads took: 0.0002235 >> seconds >> 2021-06-15T16:27:42.837+0000: 1465.622: Total time for which application >> threads were stopped: 0.0016946 seconds, Stopping threads took: 0.0013149 >> seconds >> 2021-06-15T16:27:43.433+0000: 1466.218: Total time for which application >> threads were stopped: 0.0008748 seconds, Stopping threads took: 0.0004336 >> seconds >> 2021-06-15T16:27:44.337+0000: 1467.122: Total time for which application >> threads were stopped: 0.0624611 seconds, Stopping threads took: 0.0610425 >> seconds >> 2021-06-15T16:27:45.458+0000: 1468.243: Total time for which application >> threads were stopped: 0.0329742 seconds, Stopping threads took: 0.0324816 >> seconds >> 2021-06-15T16:27:46.182+0000: 1468.967: Total time for which application >> threads were stopped: 0.0010628 seconds, Stopping threads took: 0.0006594 >> seconds >> 2021-06-15T16:27:46.241+0000: 1469.026: Total time for which application >> threads were stopped: 0.0583459 seconds, Stopping threads took: 0.0579144 >> seconds >> 2021-06-15T16:27:47.308+0000: 1470.093: Total time for which application >> threads were stopped: 0.0007076 seconds, Stopping threads took: 0.0003400 >> seconds >> 2021-06-15T16:27:47.371+0000: 1470.157: Total time for which application >> threads were stopped: 0.0020045 seconds, Stopping threads took: 0.0014785 >> seconds >> 2021-06-15T16:27:47.444+0000: 1470.229: Total time for which application >> threads were stopped: 0.0007924 seconds, Stopping threads took: 0.0004075 >> seconds >> CMS: abort preclean due to time 2021-06-15T16:27:47.450+0000: 1470.235: >> [CMS-concurrent-abortable-preclean: 0.391/6.004 secs] [Times: user=9.74 >> sys=0.27, real=6.00 secs] >> 2021-06-15T16:27:47.451+0000: 1470.236: [GC (CMS Final Remark) [YG >> occupancy: 412669 K (436928 K)]{Heap before GC invocations=92 (full 2): > > > Also I had restarted the service at 14:50 which is almost 2 hours prior to > this. I found a similar pattern in this one but involving a somewhat > smaller collection: > > 2021-06-15 15:44:09.698 INFO (qtp1836643189-60) [ x:archiver] >> o.a.s.c.S.Request [fusearchiver] webapp=/solr path=/replication >> params={wt=json&command=details&_=1623771837117} status=0 QTime=8397 >> 2021-06-15 15:44:09.698 INFO (qtp1836643189-23) [ x:archiver] >> o.a.s.c.S.Request [fusearchiver] webapp=/solr path=/admin/luke >> params={numTerms=0&show=index&wt=json&_=1623771837117} status=0 QTime=8399 >> 2021-06-15 15:44:10.035 INFO (qtp1836643189-24) [ x:archiver] >> o.a.s.c.S.Request [fusearchiver] webapp=/solr path=/select >> params={q=*:*&indent=on&fq=_accountId:12&wt=json&_=1623771792912} >> hits=1272929 status=0 QTime=50886 >> 2021-06-15 15:44:10.039 INFO (qtp1836643189-53) [ x:archiver] >> o.a.s.c.S.Request [fusearchiver] webapp=/solr path=/select >> params={q=*:*&indent=on&fq=_accountId:12&wt=json&_=1623771792912} >> hits=1272929 status=0 QTime=42102 >> 2021-06-15 15:44:10.077 INFO (qtp1836643189-53) [ x:archiver] >> o.a.s.s.HttpSolrCall Unable to write response, client closed connection or >> we are shutting down > > > Again this is preceded by these calls to replication and luke handler that > have fairly large QTime. > > I am not using replication, but it is enabled. From researching the > command this is just a status call. So where could this be triggered from? > > Beyond that strange event what else should I be looking at or changing in > my configuration to find these issues? Memory was around 7gb of the max > 8gb. GC doesn't seem excessive. The memory is growing, but not spiking. > Should I turn up the memory allocation? > > TIA > Charlie
