Hi Everyone,

We are using solr8.5.2 (Solr cloud mode), external zookeeper ensemble
(hosted on the separate node)
All of a sudden we are seeing sudden spike in CPU but at the same same time
neither any heavy indexing is performed nor any sudden increase in request
rate.

Collection info:
Collection has 6 shards and each shard has 5 replicas (NRT type) and each
replica is hosted on a separate VM. Total we have 30 VMs running.

Each shard have 14 million docs, Avg size/doc: 909.5b, Size of each shard is
around 12GB.

This sudden spike first started on one VM and immediately (within 1 minute),
such CPU spikes also occurred on 2-3 more VMs. At the same time remaining
unaffected VMs were running fine. We have had let these high CPU VMs running
for some time (more than 8 hours) but still, CPU was not coming down. 



VM detail:
OS: centos 7.7.1908
Java: openjdk version "1.8.0_262" 
CPU/RAM: 8 vcpus, 64 GiB memory
OS disk size: 256 GB (SSD)

JVM memory allocated to each machine => 26GB

GC Parameter
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=150 \
-XX:InitiatingHeapOccupancyPercent=70 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"


Caching layer parameter
<filterCache class="solr.CaffeineCache"
             size="8192"
             initialSize="512"
             autowarmCount="512"/>

<queryResultCache class="solr.CaffeineCache"
                  size="8192"
                  initialSize="3000"
                  autowarmCount="0"/>

<documentCache class="solr.CaffeineCache"
               size="8192"
               initialSize="3072"
               autowarmCount="0"/>



Here are details of different metrics during CPU spike

1. At 21.06 timestamp you can see that there is sudden spike in CPU but at
the same time request rate is constant. 
https://drive.google.com/file/d/1cJhFFIkfEdBJouw0A6PRz-HAHhIpudba/view?usp=sharing

2.Processes running at 21.06 timestamp
https://drive.google.com/file/d/1Qsfv-ivy664ShFihcb--EgapMcpgOij8/view?usp=sharing
https://drive.google.com/file/d/1Nak4bI7PqroNmImpsunUcMNm6pZ41TuG/view?usp=sharing
https://drive.google.com/file/d/1q3iuSZtK4rlzrM7vIIXdSTrNPOYd_XQ6/view?usp=sharing

3. 26 GB is allocated to JVM. And used JVM memory is hardly crossing 20GB
https://drive.google.com/file/d/1zSZFcqscXmWZbj-aMWhql28kbxbyW_Qa/view?usp=sharing

4. GC metrics is also normal
https://drive.google.com/file/d/1zBTjL6tbzM_xQeMcqbVyCIt4qAaBDprP/view?usp=sharing


We decided to replace these VMs. On running the DELETEREPLICA command it
throws timeout error. We observe that replica is deregistered from
state.json (in zookeeper config) but its replica folder was still available
on the physical VM.
In-fact after DELETEREPLICA command, though no replica was hosted on VM and
request rate on it was 0 req/sec , its CPU was still high (check below image
for reference). CPU came down to zero only after stopping solr process.
https://drive.google.com/file/d/1HXe5jjs5kJCUWXBfl2FZ0Z3BtiCOaW3V/view?usp=sharing

I'm not able to figure out what is wrong with the configuration. I have read
few blogs and most of them are pointing to look into GC. Ongoing through the
gc metrics I don't see any unusual. Also, why it's happening to only few
VMs. In the last week, this issue has occurred thrice.




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to