Hi,

I had an event recently where my Solr Server CPU spiked, and stopped
answering incoming messages in a timely manner.  I'm having a hard time
pinpointing the cause.  I'm reviewing the log files, but I can't explain
why the CPU suddenly spiked.  Once this happened all of the
queries suddenly started to back up.  There doesn't appear to be many
queries coming in when this happened, but since queries are logged after
they complete I think it's hard to know for sure.

My setup is:

   - Single Node standalone server
   - OpenJDK 1.8
   - Solr 6.6.6
   - No replication configured
   -  4 collections ranging from 200mb - 20gb
   - 112 million docs in the collection in question

Here is the section in the logs where the queries started to pile up.  You
see the QTimes start to go way up.

2021-06-15 16:27:35.647 INFO  (qtp1836643189-506) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/replication
> params={wt=json&command=details&_=1623773588323} status=0 QTime=60736
> 2021-06-15 16:27:35.647 INFO  (qtp1836643189-483) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/replication
> params={wt=json&command=details&_=1623773588323} status=0 QTime=132802
> 2021-06-15 16:27:35.647 INFO  (qtp1836643189-474) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/replication
> params={wt=json&command=details&_=1623773588323} status=0 QTime=154802
> 2021-06-15 16:27:35.647 INFO  (qtp1836643189-467) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/replication
> params={wt=json&command=details&_=1623773588323} status=0 QTime=121797
> 2021-06-15 16:27:35.648 INFO  (qtp1836643189-459) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/replication
> params={wt=json&command=details&_=1623773588323} status=0 QTime=165804
> 2021-06-15 16:27:35.648 INFO  (qtp1836643189-487) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/admin/luke
> params={numTerms=0&show=index&wt=json&_=1623773588322} status=0 QTime=110803
> 2021-06-15 16:27:35.648 INFO  (qtp1836643189-497) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/admin/luke
> params={numTerms=0&show=index&wt=json&_=1623773588322} status=0 QTime=798
> 2021-06-15 16:27:35.647 INFO  (qtp1836643189-495) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/replication
> params={wt=json&command=details&_=1623773588323} status=0 QTime=99798
> 2021-06-15 16:27:35.650 INFO  (qtp1836643189-448) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/replication
> params={wt=json&command=details&_=1623773588323} status=0 QTime=176807
> 2021-06-15 16:27:35.650 INFO  (qtp1836643189-464) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/admin/luke
> params={numTerms=0&show=index&wt=json&_=1623773588322} status=0 QTime=176809
> 2021-06-15 16:27:35.650 INFO  (qtp1836643189-479) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/admin/luke
> params={numTerms=0&show=index&wt=json&_=1623773588322} status=0 QTime=99804
> 2021-06-15 16:27:35.650 INFO  (qtp1836643189-445) [   x:customer]
> o.a.s.c.S.Request [customer]  webapp=/solr path=/replication
> params={wt=json&command=details&_=1623773588323} status=0 QTime=187795


That burst of replication and luke handlers lasted for 1 second, but there
are 155 entries of that over and over.  Prior to this there isn't any
evidence that I can see there were any queries happening that might have
slowed things down.  There was some updating happening on the smaller
collection, but not on the collection in question.

>From the GC logs I did find some overlap where thread stoppages were taking
place around and leading up to the point in time.

Heap after GC invocations=92 (full 1):
>  par new generation   total 436928K, used 86155K [0x0000000540000000,
> 0x0000000560000000, 0x00000005e0000000)
>   eden space 349568K,   0% used [0x0000000540000000, 0x0000000540000000,
> 0x0000000555560000)
>   from space 87360K,  98% used [0x0000000555560000, 0x000000055a982ef8,
> 0x000000055aab0000)
>   to   space 87360K,   0% used [0x000000055aab0000, 0x000000055aab0000,
> 0x0000000560000000)
>  concurrent mark-sweep generation total 1572864K, used 681710K
> [0x00000005e0000000, 0x0000000640000000, 0x00000007c0000000)
>  Metaspace       used 41348K, capacity 42508K, committed 42772K, reserved
> 1087488K
>   class space    used 4812K, capacity 5121K, committed 5140K, reserved
> 1048576K
> }
> 2021-06-15T16:27:39.384+0000: 1462.170: Total time for which application
> threads were stopped: 0.0653007 seconds, Stopping threads took: 0.0019805
> seconds
> 2021-06-15T16:27:40.389+0000: 1463.174: Total time for which application
> threads were stopped: 0.0041873 seconds, Stopping threads took: 0.0037697
> seconds
> 2021-06-15T16:27:40.921+0000: 1463.706: Total time for which application
> threads were stopped: 0.0029477 seconds, Stopping threads took: 0.0024678
> seconds
> 2021-06-15T16:27:41.268+0000: 1464.053: Total time for which application
> threads were stopped: 0.0224393 seconds, Stopping threads took: 0.0219226
> seconds
> 2021-06-15T16:27:41.388+0000: 1464.173: [GC (CMS Initial Mark) [1
> CMS-initial-mark: 788750K(1572864K)] 1052906K(2009792K), 0.0022256 secs]
> [Times: user=0.01 sys=0.00, real=0.00 secs]
> 2021-06-15T16:27:41.390+0000: 1464.176: Total time for which application
> threads were stopped: 0.0060337 seconds, Stopping threads took: 0.0032933
> seconds
> 2021-06-15T16:27:41.390+0000: 1464.176: [CMS-concurrent-mark-start]
> 2021-06-15T16:27:41.440+0000: 1464.225: [CMS-concurrent-mark: 0.050/0.050
> secs] [Times: user=0.39 sys=0.00, real=0.05 secs]
> 2021-06-15T16:27:41.440+0000: 1464.225: [CMS-concurrent-preclean-start]
> 2021-06-15T16:27:41.445+0000: 1464.231: [CMS-concurrent-preclean:
> 0.005/0.005 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
> 2021-06-15T16:27:41.445+0000: 1464.231:
> [CMS-concurrent-abortable-preclean-start]
> 2021-06-15T16:27:41.531+0000: 1464.316: Total time for which application
> threads were stopped: 0.0018167 seconds, Stopping threads took: 0.0012697
> seconds
> 2021-06-15T16:27:41.532+0000: 1464.317: Total time for which application
> threads were stopped: 0.0009634 seconds, Stopping threads took: 0.0005877
> seconds
> 2021-06-15T16:27:41.533+0000: 1464.318: Total time for which application
> threads were stopped: 0.0006194 seconds, Stopping threads took: 0.0003124
> seconds
> 2021-06-15T16:27:41.534+0000: 1464.319: Total time for which application
> threads were stopped: 0.0006736 seconds, Stopping threads took: 0.0003113
> seconds
> 2021-06-15T16:27:41.735+0000: 1464.521: Total time for which application
> threads were stopped: 0.0009390 seconds, Stopping threads took: 0.0005340
> seconds
> 2021-06-15T16:27:42.213+0000: 1464.998: Total time for which application
> threads were stopped: 0.0016779 seconds, Stopping threads took: 0.0003837
> seconds
> 2021-06-15T16:27:42.214+0000: 1464.999: Total time for which application
> threads were stopped: 0.0006721 seconds, Stopping threads took: 0.0002727
> seconds
> 2021-06-15T16:27:42.215+0000: 1465.000: Total time for which application
> threads were stopped: 0.0003734 seconds, Stopping threads took: 0.0000991
> seconds
> 2021-06-15T16:27:42.217+0000: 1465.002: Total time for which application
> threads were stopped: 0.0004387 seconds, Stopping threads took: 0.0001799
> seconds
> 2021-06-15T16:27:42.219+0000: 1465.004: Total time for which application
> threads were stopped: 0.0004952 seconds, Stopping threads took: 0.0002235
> seconds
> 2021-06-15T16:27:42.837+0000: 1465.622: Total time for which application
> threads were stopped: 0.0016946 seconds, Stopping threads took: 0.0013149
> seconds
> 2021-06-15T16:27:43.433+0000: 1466.218: Total time for which application
> threads were stopped: 0.0008748 seconds, Stopping threads took: 0.0004336
> seconds
> 2021-06-15T16:27:44.337+0000: 1467.122: Total time for which application
> threads were stopped: 0.0624611 seconds, Stopping threads took: 0.0610425
> seconds
> 2021-06-15T16:27:45.458+0000: 1468.243: Total time for which application
> threads were stopped: 0.0329742 seconds, Stopping threads took: 0.0324816
> seconds
> 2021-06-15T16:27:46.182+0000: 1468.967: Total time for which application
> threads were stopped: 0.0010628 seconds, Stopping threads took: 0.0006594
> seconds
> 2021-06-15T16:27:46.241+0000: 1469.026: Total time for which application
> threads were stopped: 0.0583459 seconds, Stopping threads took: 0.0579144
> seconds
> 2021-06-15T16:27:47.308+0000: 1470.093: Total time for which application
> threads were stopped: 0.0007076 seconds, Stopping threads took: 0.0003400
> seconds
> 2021-06-15T16:27:47.371+0000: 1470.157: Total time for which application
> threads were stopped: 0.0020045 seconds, Stopping threads took: 0.0014785
> seconds
> 2021-06-15T16:27:47.444+0000: 1470.229: Total time for which application
> threads were stopped: 0.0007924 seconds, Stopping threads took: 0.0004075
> seconds
>  CMS: abort preclean due to time 2021-06-15T16:27:47.450+0000: 1470.235:
> [CMS-concurrent-abortable-preclean: 0.391/6.004 secs] [Times: user=9.74
> sys=0.27, real=6.00 secs]
> 2021-06-15T16:27:47.451+0000: 1470.236: [GC (CMS Final Remark) [YG
> occupancy: 412669 K (436928 K)]{Heap before GC invocations=92 (full 2):


Also I had restarted the service at 14:50 which is almost 2 hours prior to
this.  I found a similar pattern in this one but involving a somewhat
smaller collection:

2021-06-15 15:44:09.698 INFO  (qtp1836643189-60) [   x:archiver]
> o.a.s.c.S.Request [fusearchiver]  webapp=/solr path=/replication
> params={wt=json&command=details&_=1623771837117} status=0 QTime=8397
> 2021-06-15 15:44:09.698 INFO  (qtp1836643189-23) [   x:archiver]
> o.a.s.c.S.Request [fusearchiver]  webapp=/solr path=/admin/luke
> params={numTerms=0&show=index&wt=json&_=1623771837117} status=0 QTime=8399
> 2021-06-15 15:44:10.035 INFO  (qtp1836643189-24) [   x:archiver]
> o.a.s.c.S.Request [fusearchiver]  webapp=/solr path=/select
> params={q=*:*&indent=on&fq=_accountId:12&wt=json&_=1623771792912}
> hits=1272929 status=0 QTime=50886
> 2021-06-15 15:44:10.039 INFO  (qtp1836643189-53) [   x:archiver]
> o.a.s.c.S.Request [fusearchiver]  webapp=/solr path=/select
> params={q=*:*&indent=on&fq=_accountId:12&wt=json&_=1623771792912}
> hits=1272929 status=0 QTime=42102
> 2021-06-15 15:44:10.077 INFO  (qtp1836643189-53) [   x:archiver]
> o.a.s.s.HttpSolrCall Unable to write response, client closed connection or
> we are shutting down


Again this is preceded by these calls to replication and luke handler that
have fairly large QTime.

I am not using replication, but it is enabled.  From researching the
command this is just a status call.  So where could this be triggered from?

Beyond that strange event what else should I be looking at or changing in
my configuration to find these issues?  Memory was around 7gb of the max
8gb.  GC doesn't seem excessive.  The memory is growing, but not spiking.
Should I turn up the memory allocation?

TIA
Charlie

Reply via email to