Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-19 Thread mmb1234

FYI. This issue went away after solrconfig.xml was tuned.

"Hard commits blocked | non-solrcloud v6.6.2" thread has the details.
http://lucene.472066.n3.nabble.com/Hard-commits-blocked-non-solrcloud-v6-6-2-td4374386.html




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-07 Thread mmb1234
> Maybe this is the issue:
https://github.com/eclipse/jetty.project/issues/2169

Looks like it is the issue. (I've readacted IP addresses below for security
reasons)

solr [ /opt/solr ]$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq
-c
   8425 CLOSE_WAIT -
 92 ESTABLISHED -
  1 FIN_WAIT2 -
  1 Foreign Address
  6 LISTEN -
712 TIME_WAIT -
  1 established)

solr [ /opt/solr ]$ echo "run -b
org.eclipse.jetty.server:context=HTTP/1.1@63e2203c,id=0,type=serverconnector
dump " | java
 -jar jmxterm-1.0.0-uber.jar -l localhost:18983 -v silent -n > jettyJmx.out

solr [ /opt/solr ]$ netstat -anp | grep CLOSE_WAIT | head -n1
tcp1  0 10.xxx.x.xx:898310.xxx.x.xxx:53873 
CLOSE_WAIT  1/java

solr [ /opt/solr ]$ grep "10.xxx.x.xxx:53873" jettyJmx.out
 |   |   +-
SelectionKey@5ee4ef9f{i=0}->SelectChannelEndPoint@69feb476{/10.xxx.x.xxx:53873<->8983,Open,in,out,-,-,1712/5000,HttpConnection@c93d7fa}{io=0/0,kio=0,kro=1}

solr [ /opt/solr ]$ cat jettyJmx.out | grep 8983 | grep
SelectChannelEndPoint | grep "{io=0/0,kio=0,kro=1}" | wc -l
8220





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-05 Thread Shawn Heisey
On 2/5/2018 12:29 PM, mmb1234 wrote:
> Maybe this is the issue: https://github.com/eclipse/jetty.project/issues/2169

Could be.  That was one of the Jetty issues I found, and the one that
said 9.4 doesn't have the problem.  Your Solr version would be using
Jetty 9.3, not 9.2.

> The custom client code is:

I don't know a lot about all the code you shared.  But this caught my eye:

> idleConnectionFuture = executor.scheduleAtFixedRate(() -> {
> cmAsync.closeExpiredConnections();
> cmAsync.closeIdleConnections(1, TimeUnit.SECONDS);
> }, 1, 1, TimeUnit.SECONDS);

You're really going to close connections that have been idle for only
one second?  A lot of Solr requests are going to need more than one
second to complete.  As I said in the earlier message, I very much
recommend a socket/idle timeout of at least one minute.  Make it 30
seconds if one minute seems too long.

> Also /solr/admin/cores takes a very long time to respond (QTime = 300secs+). 
> So curl with a timeout also causes an additional CLOSE_WAIT (as expected)
> curl -m 5 'http://:/solr/admin/cores'

It literally takes five minutes for /solr/admin/cores to respond on your
server? That means something is REALLY REALLY wrong.  When I hit it on
one of my servers, which has a few dozen cores, the QTime is 39
milliseconds, and I see the response in the browser very quickly.

> Note there are 100's of solr cores on each solr jvm but only few needed to
> be open at any given time in each solr jvm to avoid heap memory bloat.
> solr.xml has this setting therefore because we're not running solr cloud.
>   ${transientCacheSize:30}

Loading and unloading large cores on a frequent basis is going to make
operations pretty slow.  If your index data is too large to have loaded
all at once, perhaps you need more memory or more servers.

> Also  was removed from solrconfig.xml because we saw 1000's of
> threads BLOCKED on VersionBucket even with.
>name="numVersionBuckets">${solr.ulog.numVersionBuckets:655360}
>
> The side effect is lots of merges [ we'll tackle that when solr stops
> die-ing :-) ]

The presence, absence, or configuration of Solr's transaction log is not
going to affect segment merging in any way.  Segment merging is a
function of Lucene, not Solr, and Lucene has absolutely no knowledge of
Solr features like the transaction log.  I would strongly recommend that
you do NOT disable the transaction log.  Doing so might speed things up
a very small amount, but that speedup is NOT a good tradeoff,
considering that you'd be more likely to lose data in the event of a
serious problem.

Lots of segment merges is a result of doing a lot of indexing.

Thanks,
Shawn



Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-05 Thread mmb1234
Maybe this is the issue: https://github.com/eclipse/jetty.project/issues/2169

I have noticed when number of http requests / sec are increased, CLOSE_WAITS
increase linearly until solr stops accepting socket connections. Netstat
output is
$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c
   9784 CLOSE_WAIT 1/java
  1 ESTABLISHED -
  3 ESTABLISHED 1/java
  1 Foreign Address
  2 LISTEN -
  4 LISTEN 1/java
 27 TIME_WAIT -
  1 established)


The custom client code is:
RequestConfig config = RequestConfig.custom()
.setConnectionRequestTimeout(4000)
.setConnectTimeout(4000)
.setSocketTimeout(4000)
.build();

ConnectingIOReactor ioReactor = new
DefaultConnectingIOReactor();
PoolingNHttpClientConnectionManager cmAsync = new
PoolingNHttpClientConnectionManager(ioReactor);
cmAsync.setMaxTotal(1);
cmAsync.setDefaultMaxPerRoute(1000);

asynClient = HttpAsyncClients.custom()
.setDefaultRequestConfig(config)
.setConnectionManager(cmAsync).build();
asynClient.start();

executor = Executors.newScheduledThreadPool(1);
idleConnectionFuture = executor.scheduleAtFixedRate(() -> {
cmAsync.closeExpiredConnections();
cmAsync.closeIdleConnections(1, TimeUnit.SECONDS);
}, 1, 1, TimeUnit.SECONDS);


Also /solr/admin/cores takes a very long time to respond (QTime = 300secs+). 
So curl with a timeout also causes an additional CLOSE_WAIT (as expected)
curl -m 5 'http://:/solr/admin/cores' 

customhandler QTime = ~150 (ms) or lower even under max load for the active
cores.
 
Note there are 100's of solr cores on each solr jvm but only few needed to
be open at any given time in each solr jvm to avoid heap memory bloat.
solr.xml has this setting therefore because we're not running solr cloud.
  ${transientCacheSize:30}

Also  was removed from solrconfig.xml because we saw 1000's of
threads BLOCKED on VersionBucket even with.
  ${solr.ulog.numVersionBuckets:655360}

The side effect is lots of merges [ we'll tackle that when solr stops
die-ing :-) ]



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-04 Thread Shawn Heisey

On 2/2/2018 10:00 AM, mmb1234 wrote:

Client / java8 app:
An AsyncHTTPClient POST-ing gzip payloads.
PoolingNHttpClientConnectionManager maxtotal=10,000 and maxperroute=1000)
ConnectionRequestTimeout = ConnectTimeout = SocketTimeout = 4000 (4 secs)


I have to concur with the person who identifies themselves as "S G", 
later in the thread.


You should absolutely be using SolrJ, not a java http client.  SolrJ 
handles parsing of the response for you, providing you with java objects 
that are REALLY easy to extract information from.  It also handles all 
of the HTTP details for you, including URL encoding, etc.


It's easy enough to provide a custom HttpClient object (HttpClient is 
from anaother Apache project, it is used by Solr/SolrJ) to the 
CloudSolrClient or HttpSolrClient object.  That lets you change the 
timeouts and the limits on max concurrent connections.


I just noticed that you said the socket timeout is 4 seconds.  I didn't 
notice this when I read your message the first time.  While Solr is 
known for blazing speed, 4 seconds is WAY too short for this timeout. 
If a client resets the TCP connection because of a socket timeout, Jetty 
(which Solr is probably running in) is going to report an EofException 
-- specifically, org.eclipse.jetty.io.EofException.


Some of the bug reports I saw on Jetty specifically mentioned 
EofException in connection with a high CLOSE_WAIT count.


Your socket timeout should be at least one minute.  Four seconds would 
be fine for connection timeouts.  In my own software configurations for 
various client software, I have been known to use one second for the 
connection timeout -- on a LAN, if it doesn't connect in far less than 
one second, something is probably wrong somewhere.


I'm curious about the mention of gzip.  The Jetty that Solr ships with 
doesn't have gzip compression enabled for HTTP, and I'm not aware of 
anything in Solr that handles gzip files.  Maybe the Extracting Request 
Handler does, but if you've been following recommendations that are 
common on this list, you won't be using ERH except as a proof of concept.


Thanks,
Shawn


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-03 Thread S G
Hi Arcadius,

Most of the clients use Solrj to interact with Solr.
Does it not automatically handle the connection pools?

SG


On Fri, Feb 2, 2018 at 4:47 PM, Arcadius Ahouansou 
wrote:

> I have seen a lot of CLOSE_WAIT in the past.
> In many cases, it was that the client application was not releasing/closing
> or pooling connections properly.
>
> I would suggest you double check the client code first.
>
> Arcadius.
>
> On 2 February 2018 at 23:52, mmb1234  wrote:
>
> > > You said that you're running Solr 6.2.2, but there is no 6.2.2 version.
> > > but the JVM argument list includes "-Xmx512m" which is a 512MB heap
> >
> > My typos. They're 6.6.2 and -Xmx30g respectively.
> >
> > > many open connections causes is a large number of open file handles,
> >
> > solr [ /opt/solr/server/logs ]$ sysctl -a | grep vm.max_map_count
> > vm.max_map_count = 262144
> >
> > The only thing I notice right before solr shutdown messages in solr.log
> the
> > /update QTime goes from ~500ms  to ~25.
> >
> > There is an automated health check that issues a kill on the 
> due
> > to http connection timeout.
> >
> >
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >
>
>
>
> --
> Arcadius Ahouansou
> Menelic Ltd | Applied Knowledge Is Power
> Office : +441444702101
> Mobile: +447908761999
> Menelic Ltd: menelic.com
> Visitor Management System: menelicvisitor.com
> ---
>


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-02 Thread Arcadius Ahouansou
I have seen a lot of CLOSE_WAIT in the past.
In many cases, it was that the client application was not releasing/closing
or pooling connections properly.

I would suggest you double check the client code first.

Arcadius.

On 2 February 2018 at 23:52, mmb1234  wrote:

> > You said that you're running Solr 6.2.2, but there is no 6.2.2 version.
> > but the JVM argument list includes "-Xmx512m" which is a 512MB heap
>
> My typos. They're 6.6.2 and -Xmx30g respectively.
>
> > many open connections causes is a large number of open file handles,
>
> solr [ /opt/solr/server/logs ]$ sysctl -a | grep vm.max_map_count
> vm.max_map_count = 262144
>
> The only thing I notice right before solr shutdown messages in solr.log the
> /update QTime goes from ~500ms  to ~25.
>
> There is an automated health check that issues a kill on the  due
> to http connection timeout.
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>



-- 
Arcadius Ahouansou
Menelic Ltd | Applied Knowledge Is Power
Office : +441444702101
Mobile: +447908761999
Menelic Ltd: menelic.com
Visitor Management System: menelicvisitor.com
---


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-02 Thread mmb1234
> You said that you're running Solr 6.2.2, but there is no 6.2.2 version.
> but the JVM argument list includes "-Xmx512m" which is a 512MB heap

My typos. They're 6.6.2 and -Xmx30g respectively.

> many open connections causes is a large number of open file handles,

solr [ /opt/solr/server/logs ]$ sysctl -a | grep vm.max_map_count
vm.max_map_count = 262144

The only thing I notice right before solr shutdown messages in solr.log the
/update QTime goes from ~500ms  to ~25.

There is an automated health check that issues a kill on the  due
to http connection timeout. 





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-02 Thread Shawn Heisey

On 2/2/2018 10:00 AM, mmb1234 wrote:

In our solr non-cloud env., we are seeing lots of CLOSE_WAIT, causing jvm to
stop "working" with 3 mins of solr start.

solr [ /opt/solr ]$ netstat -anp | grep 8983 | grep CLOSE_WAIT | grep
10.xxx.xxx.xxx | wc -l
9453


Solr isn't handling network services.  That's being done by the servlet 
container.  The servlet container included with Solr is Jetty.  Since 
version 5.0, Jetty is the only option with official support.


There are a number of issues against Jetty where people are seeing many 
CLOSE_WAIT problems.  At least one of them does acknowledge that there 
is a bug in Jetty, but in discussion with them on IRC, they have 
apparently never been able to actually reproduce the problem.  One issue 
says that they don't think Jetty 9.4.x will have the problem, but Solr 
won't have Jetty 9.4.x until Solr 7.3.0 is released.


You said that you're running Solr 6.2.2, but there is no 6.2.2 version.  
There is 6.4.2 and 6.6.2.  Solr 6.6.x and 6.4.x include Jetty 9.3.14, 
6.2.x includes Jetty 9.3.8.


Another oddity: Your info says that Solr has a 30GB heap, but the JVM 
argument list includes "-Xmx512m" which is a 512MB heap.


One issue that a many open connections causes is a large number of open 
file handles, and that can apparently cause software to become 
unresponsive.  Are you seeing any errors in Solr's logfile before the 
process becomes unresponsive?


Thanks,
Shawn



9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-02 Thread mmb1234
Hello,

In our solr non-cloud env., we are seeing lots of CLOSE_WAIT, causing jvm to
stop "working" with 3 mins of solr start.

solr [ /opt/solr ]$ netstat -anp | grep 8983 | grep CLOSE_WAIT | grep
10.xxx.xxx.xxx | wc -l
9453

Only option is then`kill -9` because even `jcmd  Thread.print` is
unable to connect to the jvm. The problem can be reproduced at will.

Any suggestions what could be causing this or the fix ?

Details of system are as follows and has been setup for "bulk indexing".

-

Solr / server:
v6.2.2 non-solrcloud in a docker with kubernetes
java: 1.8.0_151 25.151-b12 HotSpot 64bit | Oracle
jvm: heap 30GB
os: Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64
GNU/Linux
os memory: 230GB | no swap configured
os cpu: 32vCPU
jvm:  "-XX:+UseLargePages",
"-XX:LargePageSizeInBytes=2m",
"-Xms512m",
"-Xmx512m",
"-XX:NewRatio=3",
"-XX:SurvivorRatio=4",
"-XX:TargetSurvivorRatio=90",
"-XX:MaxTenuringThreshold=8",
"-XX:+UseConcMarkSweepGC",
"-XX:+UseParNewGC",
"-XX:ConcGCThreads=4",
"-XX:ParallelGCThreads=4",
"-XX:+CMSScavengeBeforeRemark",
"-XX:PretenureSizeThreshold=64m",
"-XX:+UseCMSInitiatingOccupancyOnly",
"-XX:CMSInitiatingOccupancyFraction=50",
"-XX:CMSMaxAbortablePrecleanTime=6000",
"-XX:+CMSParallelRemarkEnabled",
"-XX:+ParallelRefProcEnabled",

non-cloud solr.xml:
transientCacheSize = 30
shareSchema = true
Also only 4 cores are POSTed to.

Client / java8 app:
An AsyncHTTPClient POST-ing gzip payloads.
PoolingNHttpClientConnectionManager maxtotal=10,000 and maxperroute=1000)
ConnectionRequestTimeout = ConnectTimeout = SocketTimeout = 4000 (4 secs)

Gzip payloads:
About 800 json messages like this.
[
  {id:"abcdefx", datetimestamp:"xx", key1:"xx", key2:"z",
},
   
]

POST rate:
Each of 4 solr core receives ~32 payloads per second from the custom java
app (plugin handler metrics in solr reports the same).
Approx ~102,000 docs per sec in total (32 payload x 800 docs x 4 solr cores)

Document uniqueness:
No doc or id is ever repeated or concurrently sent.
No atomic updates needed (overwrite=false in AddUpdateCommand was set in
solr handler)

Solrconfig.xml
For bulk indexing requirement, updatelog and softcommit were minimized /
removed.
  
none
200

  1
  6

  
  

  ${solr.autoCommit.maxTime:1}
  true


  ${solr.autoSoftCommit.maxTime:-1}
  ${solr.autoSoftCommit.maxDocs:-1}
  false

  

-M



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


CLOSE_WAIT connections

2012-03-27 Thread Bernd Fehling

Hi list,

I have looked into the CLOSE_WAIT problem and created an issue with a patch to 
fix this.
A search for CLOSE_WAIT shows that there are many Apache projects hit by this 
problem.

https://issues.apache.org/jira/browse/SOLR-3280

Can someone recheck the patch (it belongs to SnapPuller) and give the OK for 
release?
The patch is against branch_3x (3.6).


Regards
Bernd