Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
FYI. This issue went away after solrconfig.xml was tuned. "Hard commits blocked | non-solrcloud v6.6.2" thread has the details. http://lucene.472066.n3.nabble.com/Hard-commits-blocked-non-solrcloud-v6-6-2-td4374386.html -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
> Maybe this is the issue: https://github.com/eclipse/jetty.project/issues/2169 Looks like it is the issue. (I've readacted IP addresses below for security reasons) solr [ /opt/solr ]$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c 8425 CLOSE_WAIT - 92 ESTABLISHED - 1 FIN_WAIT2 - 1 Foreign Address 6 LISTEN - 712 TIME_WAIT - 1 established) solr [ /opt/solr ]$ echo "run -b org.eclipse.jetty.server:context=HTTP/1.1@63e2203c,id=0,type=serverconnector dump " | java -jar jmxterm-1.0.0-uber.jar -l localhost:18983 -v silent -n > jettyJmx.out solr [ /opt/solr ]$ netstat -anp | grep CLOSE_WAIT | head -n1 tcp1 0 10.xxx.x.xx:898310.xxx.x.xxx:53873 CLOSE_WAIT 1/java solr [ /opt/solr ]$ grep "10.xxx.x.xxx:53873" jettyJmx.out | | +- SelectionKey@5ee4ef9f{i=0}->SelectChannelEndPoint@69feb476{/10.xxx.x.xxx:53873<->8983,Open,in,out,-,-,1712/5000,HttpConnection@c93d7fa}{io=0/0,kio=0,kro=1} solr [ /opt/solr ]$ cat jettyJmx.out | grep 8983 | grep SelectChannelEndPoint | grep "{io=0/0,kio=0,kro=1}" | wc -l 8220 -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
On 2/5/2018 12:29 PM, mmb1234 wrote: > Maybe this is the issue: https://github.com/eclipse/jetty.project/issues/2169 Could be. That was one of the Jetty issues I found, and the one that said 9.4 doesn't have the problem. Your Solr version would be using Jetty 9.3, not 9.2. > The custom client code is: I don't know a lot about all the code you shared. But this caught my eye: > idleConnectionFuture = executor.scheduleAtFixedRate(() -> { > cmAsync.closeExpiredConnections(); > cmAsync.closeIdleConnections(1, TimeUnit.SECONDS); > }, 1, 1, TimeUnit.SECONDS); You're really going to close connections that have been idle for only one second? A lot of Solr requests are going to need more than one second to complete. As I said in the earlier message, I very much recommend a socket/idle timeout of at least one minute. Make it 30 seconds if one minute seems too long. > Also /solr/admin/cores takes a very long time to respond (QTime = 300secs+). > So curl with a timeout also causes an additional CLOSE_WAIT (as expected) > curl -m 5 'http://:/solr/admin/cores' It literally takes five minutes for /solr/admin/cores to respond on your server? That means something is REALLY REALLY wrong. When I hit it on one of my servers, which has a few dozen cores, the QTime is 39 milliseconds, and I see the response in the browser very quickly. > Note there are 100's of solr cores on each solr jvm but only few needed to > be open at any given time in each solr jvm to avoid heap memory bloat. > solr.xml has this setting therefore because we're not running solr cloud. > ${transientCacheSize:30} Loading and unloading large cores on a frequent basis is going to make operations pretty slow. If your index data is too large to have loaded all at once, perhaps you need more memory or more servers. > Also was removed from solrconfig.xml because we saw 1000's of > threads BLOCKED on VersionBucket even with. >name="numVersionBuckets">${solr.ulog.numVersionBuckets:655360} > > The side effect is lots of merges [ we'll tackle that when solr stops > die-ing :-) ] The presence, absence, or configuration of Solr's transaction log is not going to affect segment merging in any way. Segment merging is a function of Lucene, not Solr, and Lucene has absolutely no knowledge of Solr features like the transaction log. I would strongly recommend that you do NOT disable the transaction log. Doing so might speed things up a very small amount, but that speedup is NOT a good tradeoff, considering that you'd be more likely to lose data in the event of a serious problem. Lots of segment merges is a result of doing a lot of indexing. Thanks, Shawn
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
Maybe this is the issue: https://github.com/eclipse/jetty.project/issues/2169 I have noticed when number of http requests / sec are increased, CLOSE_WAITS increase linearly until solr stops accepting socket connections. Netstat output is $ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c 9784 CLOSE_WAIT 1/java 1 ESTABLISHED - 3 ESTABLISHED 1/java 1 Foreign Address 2 LISTEN - 4 LISTEN 1/java 27 TIME_WAIT - 1 established) The custom client code is: RequestConfig config = RequestConfig.custom() .setConnectionRequestTimeout(4000) .setConnectTimeout(4000) .setSocketTimeout(4000) .build(); ConnectingIOReactor ioReactor = new DefaultConnectingIOReactor(); PoolingNHttpClientConnectionManager cmAsync = new PoolingNHttpClientConnectionManager(ioReactor); cmAsync.setMaxTotal(1); cmAsync.setDefaultMaxPerRoute(1000); asynClient = HttpAsyncClients.custom() .setDefaultRequestConfig(config) .setConnectionManager(cmAsync).build(); asynClient.start(); executor = Executors.newScheduledThreadPool(1); idleConnectionFuture = executor.scheduleAtFixedRate(() -> { cmAsync.closeExpiredConnections(); cmAsync.closeIdleConnections(1, TimeUnit.SECONDS); }, 1, 1, TimeUnit.SECONDS); Also /solr/admin/cores takes a very long time to respond (QTime = 300secs+). So curl with a timeout also causes an additional CLOSE_WAIT (as expected) curl -m 5 'http://:/solr/admin/cores' customhandler QTime = ~150 (ms) or lower even under max load for the active cores. Note there are 100's of solr cores on each solr jvm but only few needed to be open at any given time in each solr jvm to avoid heap memory bloat. solr.xml has this setting therefore because we're not running solr cloud. ${transientCacheSize:30} Also was removed from solrconfig.xml because we saw 1000's of threads BLOCKED on VersionBucket even with. ${solr.ulog.numVersionBuckets:655360} The side effect is lots of merges [ we'll tackle that when solr stops die-ing :-) ] -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
On 2/2/2018 10:00 AM, mmb1234 wrote: Client / java8 app: An AsyncHTTPClient POST-ing gzip payloads. PoolingNHttpClientConnectionManager maxtotal=10,000 and maxperroute=1000) ConnectionRequestTimeout = ConnectTimeout = SocketTimeout = 4000 (4 secs) I have to concur with the person who identifies themselves as "S G", later in the thread. You should absolutely be using SolrJ, not a java http client. SolrJ handles parsing of the response for you, providing you with java objects that are REALLY easy to extract information from. It also handles all of the HTTP details for you, including URL encoding, etc. It's easy enough to provide a custom HttpClient object (HttpClient is from anaother Apache project, it is used by Solr/SolrJ) to the CloudSolrClient or HttpSolrClient object. That lets you change the timeouts and the limits on max concurrent connections. I just noticed that you said the socket timeout is 4 seconds. I didn't notice this when I read your message the first time. While Solr is known for blazing speed, 4 seconds is WAY too short for this timeout. If a client resets the TCP connection because of a socket timeout, Jetty (which Solr is probably running in) is going to report an EofException -- specifically, org.eclipse.jetty.io.EofException. Some of the bug reports I saw on Jetty specifically mentioned EofException in connection with a high CLOSE_WAIT count. Your socket timeout should be at least one minute. Four seconds would be fine for connection timeouts. In my own software configurations for various client software, I have been known to use one second for the connection timeout -- on a LAN, if it doesn't connect in far less than one second, something is probably wrong somewhere. I'm curious about the mention of gzip. The Jetty that Solr ships with doesn't have gzip compression enabled for HTTP, and I'm not aware of anything in Solr that handles gzip files. Maybe the Extracting Request Handler does, but if you've been following recommendations that are common on this list, you won't be using ERH except as a proof of concept. Thanks, Shawn
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
Hi Arcadius, Most of the clients use Solrj to interact with Solr. Does it not automatically handle the connection pools? SG On Fri, Feb 2, 2018 at 4:47 PM, Arcadius Ahouansou wrote: > I have seen a lot of CLOSE_WAIT in the past. > In many cases, it was that the client application was not releasing/closing > or pooling connections properly. > > I would suggest you double check the client code first. > > Arcadius. > > On 2 February 2018 at 23:52, mmb1234 wrote: > > > > You said that you're running Solr 6.2.2, but there is no 6.2.2 version. > > > but the JVM argument list includes "-Xmx512m" which is a 512MB heap > > > > My typos. They're 6.6.2 and -Xmx30g respectively. > > > > > many open connections causes is a large number of open file handles, > > > > solr [ /opt/solr/server/logs ]$ sysctl -a | grep vm.max_map_count > > vm.max_map_count = 262144 > > > > The only thing I notice right before solr shutdown messages in solr.log > the > > /update QTime goes from ~500ms to ~25. > > > > There is an automated health check that issues a kill on the > due > > to http connection timeout. > > > > > > > > > > > > -- > > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > > > > > > -- > Arcadius Ahouansou > Menelic Ltd | Applied Knowledge Is Power > Office : +441444702101 > Mobile: +447908761999 > Menelic Ltd: menelic.com > Visitor Management System: menelicvisitor.com > --- >
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
I have seen a lot of CLOSE_WAIT in the past. In many cases, it was that the client application was not releasing/closing or pooling connections properly. I would suggest you double check the client code first. Arcadius. On 2 February 2018 at 23:52, mmb1234 wrote: > > You said that you're running Solr 6.2.2, but there is no 6.2.2 version. > > but the JVM argument list includes "-Xmx512m" which is a 512MB heap > > My typos. They're 6.6.2 and -Xmx30g respectively. > > > many open connections causes is a large number of open file handles, > > solr [ /opt/solr/server/logs ]$ sysctl -a | grep vm.max_map_count > vm.max_map_count = 262144 > > The only thing I notice right before solr shutdown messages in solr.log the > /update QTime goes from ~500ms to ~25. > > There is an automated health check that issues a kill on the due > to http connection timeout. > > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > -- Arcadius Ahouansou Menelic Ltd | Applied Knowledge Is Power Office : +441444702101 Mobile: +447908761999 Menelic Ltd: menelic.com Visitor Management System: menelicvisitor.com ---
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
> You said that you're running Solr 6.2.2, but there is no 6.2.2 version. > but the JVM argument list includes "-Xmx512m" which is a 512MB heap My typos. They're 6.6.2 and -Xmx30g respectively. > many open connections causes is a large number of open file handles, solr [ /opt/solr/server/logs ]$ sysctl -a | grep vm.max_map_count vm.max_map_count = 262144 The only thing I notice right before solr shutdown messages in solr.log the /update QTime goes from ~500ms to ~25. There is an automated health check that issues a kill on the due to http connection timeout. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
On 2/2/2018 10:00 AM, mmb1234 wrote: In our solr non-cloud env., we are seeing lots of CLOSE_WAIT, causing jvm to stop "working" with 3 mins of solr start. solr [ /opt/solr ]$ netstat -anp | grep 8983 | grep CLOSE_WAIT | grep 10.xxx.xxx.xxx | wc -l 9453 Solr isn't handling network services. That's being done by the servlet container. The servlet container included with Solr is Jetty. Since version 5.0, Jetty is the only option with official support. There are a number of issues against Jetty where people are seeing many CLOSE_WAIT problems. At least one of them does acknowledge that there is a bug in Jetty, but in discussion with them on IRC, they have apparently never been able to actually reproduce the problem. One issue says that they don't think Jetty 9.4.x will have the problem, but Solr won't have Jetty 9.4.x until Solr 7.3.0 is released. You said that you're running Solr 6.2.2, but there is no 6.2.2 version. There is 6.4.2 and 6.6.2. Solr 6.6.x and 6.4.x include Jetty 9.3.14, 6.2.x includes Jetty 9.3.8. Another oddity: Your info says that Solr has a 30GB heap, but the JVM argument list includes "-Xmx512m" which is a 512MB heap. One issue that a many open connections causes is a large number of open file handles, and that can apparently cause software to become unresponsive. Are you seeing any errors in Solr's logfile before the process becomes unresponsive? Thanks, Shawn
9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
Hello, In our solr non-cloud env., we are seeing lots of CLOSE_WAIT, causing jvm to stop "working" with 3 mins of solr start. solr [ /opt/solr ]$ netstat -anp | grep 8983 | grep CLOSE_WAIT | grep 10.xxx.xxx.xxx | wc -l 9453 Only option is then`kill -9` because even `jcmd Thread.print` is unable to connect to the jvm. The problem can be reproduced at will. Any suggestions what could be causing this or the fix ? Details of system are as follows and has been setup for "bulk indexing". - Solr / server: v6.2.2 non-solrcloud in a docker with kubernetes java: 1.8.0_151 25.151-b12 HotSpot 64bit | Oracle jvm: heap 30GB os: Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux os memory: 230GB | no swap configured os cpu: 32vCPU jvm: "-XX:+UseLargePages", "-XX:LargePageSizeInBytes=2m", "-Xms512m", "-Xmx512m", "-XX:NewRatio=3", "-XX:SurvivorRatio=4", "-XX:TargetSurvivorRatio=90", "-XX:MaxTenuringThreshold=8", "-XX:+UseConcMarkSweepGC", "-XX:+UseParNewGC", "-XX:ConcGCThreads=4", "-XX:ParallelGCThreads=4", "-XX:+CMSScavengeBeforeRemark", "-XX:PretenureSizeThreshold=64m", "-XX:+UseCMSInitiatingOccupancyOnly", "-XX:CMSInitiatingOccupancyFraction=50", "-XX:CMSMaxAbortablePrecleanTime=6000", "-XX:+CMSParallelRemarkEnabled", "-XX:+ParallelRefProcEnabled", non-cloud solr.xml: transientCacheSize = 30 shareSchema = true Also only 4 cores are POSTed to. Client / java8 app: An AsyncHTTPClient POST-ing gzip payloads. PoolingNHttpClientConnectionManager maxtotal=10,000 and maxperroute=1000) ConnectionRequestTimeout = ConnectTimeout = SocketTimeout = 4000 (4 secs) Gzip payloads: About 800 json messages like this. [ {id:"abcdefx", datetimestamp:"xx", key1:"xx", key2:"z", }, ] POST rate: Each of 4 solr core receives ~32 payloads per second from the custom java app (plugin handler metrics in solr reports the same). Approx ~102,000 docs per sec in total (32 payload x 800 docs x 4 solr cores) Document uniqueness: No doc or id is ever repeated or concurrently sent. No atomic updates needed (overwrite=false in AddUpdateCommand was set in solr handler) Solrconfig.xml For bulk indexing requirement, updatelog and softcommit were minimized / removed. none 200 1 6 ${solr.autoCommit.maxTime:1} true ${solr.autoSoftCommit.maxTime:-1} ${solr.autoSoftCommit.maxDocs:-1} false -M -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
CLOSE_WAIT connections
Hi list, I have looked into the CLOSE_WAIT problem and created an issue with a patch to fix this. A search for CLOSE_WAIT shows that there are many Apache projects hit by this problem. https://issues.apache.org/jira/browse/SOLR-3280 Can someone recheck the patch (it belongs to SnapPuller) and give the OK for release? The patch is against branch_3x (3.6). Regards Bernd