Hi Everyone, We are using solr8.5.2 (Solr cloud mode), external zookeeper ensemble (hosted on the separate node) All of a sudden we are seeing sudden spike in CPU but at the same same time neither any heavy indexing is performed nor any sudden increase in request rate.
Collection info: Collection has 6 shards and each shard has 5 replicas (NRT type) and each replica is hosted on a separate VM. Total we have 30 VMs running. Each shard have 14 million docs, Avg size/doc: 909.5b, Size of each shard is around 12GB. This sudden spike first started on one VM and immediately (within 1 minute), such CPU spikes also occurred on 2-3 more VMs. At the same time remaining unaffected VMs were running fine. We have had let these high CPU VMs running for some time (more than 8 hours) but still, CPU was not coming down. VM detail: OS: centos 7.7.1908 Java: openjdk version "1.8.0_262" CPU/RAM: 8 vcpus, 64 GiB memory OS disk size: 256 GB (SSD) JVM memory allocated to each machine => 26GB GC Parameter GC_TUNE=" \ -XX:+UseG1GC \ -XX:+PerfDisableSharedMem \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=8m \ -XX:MaxGCPauseMillis=150 \ -XX:InitiatingHeapOccupancyPercent=70 \ -XX:+UseLargePages \ -XX:+AggressiveOpts \ " Caching layer parameter <filterCache class="solr.CaffeineCache" size="8192" initialSize="512" autowarmCount="512"/> <queryResultCache class="solr.CaffeineCache" size="8192" initialSize="3000" autowarmCount="0"/> <documentCache class="solr.CaffeineCache" size="8192" initialSize="3072" autowarmCount="0"/> Here are details of different metrics during CPU spike 1. At 21.06 timestamp you can see that there is sudden spike in CPU but at the same time request rate is constant. https://drive.google.com/file/d/1cJhFFIkfEdBJouw0A6PRz-HAHhIpudba/view?usp=sharing 2.Processes running at 21.06 timestamp https://drive.google.com/file/d/1Qsfv-ivy664ShFihcb--EgapMcpgOij8/view?usp=sharing https://drive.google.com/file/d/1Nak4bI7PqroNmImpsunUcMNm6pZ41TuG/view?usp=sharing https://drive.google.com/file/d/1q3iuSZtK4rlzrM7vIIXdSTrNPOYd_XQ6/view?usp=sharing 3. 26 GB is allocated to JVM. And used JVM memory is hardly crossing 20GB https://drive.google.com/file/d/1zSZFcqscXmWZbj-aMWhql28kbxbyW_Qa/view?usp=sharing 4. GC metrics is also normal https://drive.google.com/file/d/1zBTjL6tbzM_xQeMcqbVyCIt4qAaBDprP/view?usp=sharing We decided to replace these VMs. On running the DELETEREPLICA command it throws timeout error. We observe that replica is deregistered from state.json (in zookeeper config) but its replica folder was still available on the physical VM. In-fact after DELETEREPLICA command, though no replica was hosted on VM and request rate on it was 0 req/sec , its CPU was still high (check below image for reference). CPU came down to zero only after stopping solr process. https://drive.google.com/file/d/1HXe5jjs5kJCUWXBfl2FZ0Z3BtiCOaW3V/view?usp=sharing I'm not able to figure out what is wrong with the configuration. I have read few blogs and most of them are pointing to look into GC. Ongoing through the gc metrics I don't see any unusual. Also, why it's happening to only few VMs. In the last week, this issue has occurred thrice. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html