Hello all,
     We have been struggling with an issue where solr will intermittently use 
all available CPU and become unresponsive.  It will remain in this state until 
we restart.  Solr will remain stable for some time, usually a few hours to a 
few days, before this happens again.  We've tried adjusting the caches and 
adding memory to both the VM and JVM, but we haven't been able to solve the 
issue yet.

Here is some info about our server:
Solr:
  Solr 7.3.1, running on Java 1.8
  Running in cloud mode, but there's only one core

Host:
  CentOS7
  8 CPU, 56GB RAM
  The only other processes running on this VM are two zookeepers, one for this 
Solr instance, one for another Solr instance

Solr Config:
 - One Core
 - 36 Million documents (Max Doc), 28 million (Num Docs)
 - ~15GB
 - 10-20 Requests/second
 - The schema is fairly large (~100 fields) and we allow faceting and searching 
on many, but not all, of the fields
 - Data are imported once per minute through the DataImportHandler, with a hard 
commit at the end.  We usually index ~100-500 documents per minute, with many 
of these being updates to existing documents.

Cache settings:
    <filterCache class="solr.FastLRUCache"
                 size="256"
                 initialSize="256"
                 autowarmCount="8"
                 showItems="64"/>

    <queryResultCache class="solr.LRUCache"
                      size="256"
                      initialSize="256"
                      autowarmCount="0"/>

    <documentCache class="solr.LRUCache"
                   size="1024"
                   initialSize="1024"
                   autowarmCount="0"/>

For the filterCache, we have tried sizes as low as 128, which caused our CPU 
usage to go up and didn't solve our issue.  autowarmCount used to be much 
higher, but we have reduced it to try to address this issue.


The behavior we see:

Solr is normally using ~3-6GB of heap and we usually have ~20GB of free memory. 
 Occasionally, though, solr is not able to free up memory and the heap usage 
climbs.  Analyzing the GC logs shows a sharp incline of usage with the GC (the 
default CMS) working hard to free memory, but not accomplishing much.  
Eventually, it fills up the heap, maxes out the CPUs, and never recovers.  We 
have tried to analyze the logs to see if there are particular queries causing 
issues or if there are network issues to zookeeper, but we haven't been able to 
find any patterns.  After the issues start, we often see session timeouts to 
zookeeper, but it doesn't appear​ that they are the cause.



Does anyone have any recommendations on things to try or metrics to look into 
or configuration issues I may be overlooking?

Thanks,
Jeremy

Reply via email to