9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

mmb1234 Fri, 02 Feb 2018 09:01:03 -0800

Hello,

In our solr non-cloud env., we are seeing lots of CLOSE_WAIT, causing jvm to
stop "working" with 3 mins of solr start.


solr [ /opt/solr ]$ netstat -anp | grep 8983 | grep CLOSE_WAIT | grep
10.xxx.xxx.xxx | wc -l
9453

Only option is then`kill -9` because even `jcmd <pid> Thread.print` is
unable to connect to the jvm. The problem can be reproduced at will.

Any suggestions what could be causing this or the fix ?

Details of system are as follows and has been setup for "bulk indexing".

-------------

Solr / server:
v6.2.2 non-solrcloud in a docker with kubernetes
java: 1.8.0_151 25.151-b12 HotSpot 64bit | Oracle
jvm: heap 30GB
os: Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64
GNU/Linux
os memory: 230GB | no swap configured
os cpu: 32vCPU
jvm:  "-XX:+UseLargePages",
        "-XX:LargePageSizeInBytes=2m",
        "-Xms512m",
        "-Xmx512m",
        "-XX:NewRatio=3",
        "-XX:SurvivorRatio=4",
        "-XX:TargetSurvivorRatio=90",
        "-XX:MaxTenuringThreshold=8",
        "-XX:+UseConcMarkSweepGC",
        "-XX:+UseParNewGC",
        "-XX:ConcGCThreads=4",
        "-XX:ParallelGCThreads=4",
        "-XX:+CMSScavengeBeforeRemark",
        "-XX:PretenureSizeThreshold=64m",
        "-XX:+UseCMSInitiatingOccupancyOnly",
        "-XX:CMSInitiatingOccupancyFraction=50",
        "-XX:CMSMaxAbortablePrecleanTime=6000",
        "-XX:+CMSParallelRemarkEnabled",
        "-XX:+ParallelRefProcEnabled",

non-cloud solr.xml:
transientCacheSize = 30
shareSchema = true
Also only 4 cores are POSTed to.

Client / java8 app:
An AsyncHTTPClient POST-ing gzip payloads.
PoolingNHttpClientConnectionManager maxtotal=10,000 and maxperroute=1000)
ConnectionRequestTimeout = ConnectTimeout = SocketTimeout = 4000 (4 secs)

Gzip payloads:
About 800 json messages like this.
[
  {id:"abcdefxxxxx", datetimestamp:"xxxxxx", key1:"xxxxxx", key2:"zzzzz",
....},
   ....
]

POST rate:
Each of 4 solr core receives ~32 payloads per second from the custom java
app (plugin handler metrics in solr reports the same).
Approx ~102,000 docs per sec in total (32 payload x 800 docs x 4 solr cores)

Document uniqueness:
No doc or id is ever repeated or concurrently sent.
No atomic updates needed (overwrite=false in AddUpdateCommand was set in
solr handler)

Solrconfig.xml
For bulk indexing requirement, updatelog and softcommit were minimized /
removed.
  <indexConfig>
    <lockType>none</lockType>
    <ramBufferSizeMB>200</ramBufferSizeMB>
    <mergeScheduler
class="org.apache.lucene.index.ConcurrentMergeScheduler">
      <int name="maxThreadCount">1</int>
      <int name="maxMergeCount">6</int>
    </mergeScheduler>
  </indexConfig>
  <updateHandler class="solr.DirectUpdateHandler2">
    <autoCommit>
      <maxTime>${solr.autoCommit.maxTime:10000}</maxTime>
      <openSearcher>true</openSearcher>
    </autoCommit>
    <autoSoftCommit>
      <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
      <maxDocs>${solr.autoSoftCommit.maxDocs:-1}</maxDocs>
      <openSearcher>false</openSearcher>
    </autoSoftCommit>
  </updateHandler>

-M



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

Reply via email to