Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Richard Goodman Fri, 22 Nov 2019 09:18:23 -0800

Hey Serj,

So I've applied your patch, deployed the latest build of the
solr-prometheus exporter, and left it running for a little bit. It's had
some interesting results.


As I mentioned, I split up my config from 1 monolithic config to separate
configs, and have an instance running for each one *(they are split up by
groups defined in the /solr/admin/metrics endpoint)*, for more information,
I have split them up via the following:

   - Core
   - Node
   - JVM
   - Jetty

And I was getting the connection pool shut down, for every single type.
However, this is not the case any more.

*Core*
Whilst I am no longer getting Connection Pool Shutdown, I am now getting
the likes of;
ERROR - 2019-11-22 16:44:17.522;
org.apache.solr.prometheus.scraper.SolrScraper; failed to request:
/admin/metrics Timeout occured while waiting response from server at:
http://127.0.0.1:8085/solr

Because of this, I am now no longer collecting any core metrics at all. I'm
yet to figure out in the code where this error is reported, and seeing if
anything can help with this.

*Node*
I did get a NPE that I've never seen before, however, this appeared to be a
1 off, and now this is also reporting 0 connection pool shut down issues,
which I was getting previously. I'll attach a dump of the node NPE below as
like an "appendix"

*JVM*
Completely seems to resolve the issue, have no longer been getting it.

*Jetty*
Unfortunately here, I am still getting connection pool shut down errors,
again I will upload a dump of the error in the "appendix"

Whilst this hasn't appeared to completely fix the issue, it's definitely a
step in the right direction. As I said, for Node and JVM I am no longer
getting this, and have seen an improve in the amount of metrics being
collected. But it looks like there is still some work we can look at and do

--------
*Appendix*

*Node NPE Error*
ERROR - 2019-11-22 16:46:38.594;
org.apache.solr.prometheus.scraper.SolrScraper; failed to request:
/admin/metrics Timeout occured while waiting response from server at:
http://127.0.0.1:8083/solr
WARN  - 2019-11-22 16:46:59.105; org.apache.solr.prometheus.scraper.Async;
Error occurred during metrics collection
java.util.concurrent.ExecutionException: java.lang.NullPointerException
        at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
~[?:1.8.0_141]
        at
org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
~[?:1.8.0_141]
        at
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
~[?:1.8.0_141]
        at
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
~[?:1.8.0_141]
        at
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
~[?:1.8.0_141]
        at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
~[?:1.8.0_141]
        at
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
~[?:1.8.0_141]
        at
org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
~[?:1.8.0_141]
        at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_141]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_141]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.lang.NullPointerException
        at
org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:112)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
~[?:1.8.0_141]
        ... 4 more


*Jetty Connection Pool Shut down*
WARN  - 2019-11-22 16:41:57.505; org.apache.solr.prometheus.scraper.Async;
Error occurred during metrics collection
java.util.concurrent.ExecutionException: java.lang.IllegalStateException:
Connection pool shut down
        at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
~[?:1.8.0_141]
        at
org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987
b - jenkins - 2019-11-22 16:23:03]
        at
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
~[?:1.8.0_141]
        at
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
~[?:1.8.0_141]
        at
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
~[?:1.8.0_141]
        at
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
~[?:1.8.0_141]
        at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
~[?:1.8.0_141]
        at
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
~[?:1.8.0_141]
        at
org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
~[?:1.8.0_141]
        at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_141]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_141]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.lang.IllegalStateException: Connection pool shut down
        at org.apache.http.util.Asserts.check(Asserts.java:34)
~[httpcore-4.4.10.jar:4.4.10]
        at
org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:191)
~[httpcore-4.4.10.jar:4.4.10]
        at
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:267)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:102)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
~[?:1.8.0_141]
        ... 4 more
INFO  - 2019-11-22 16:41:57.657;
org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed
metrics collection

On Fri, 22 Nov 2019 at 16:42, Alex Jablonski <ajablon...@thoughtworks.com>
wrote:

> Hey Richard,
>
> I'd definitely love to hear whether this improves things for you.
> According to Guava's documentation, the cache can start evicting items when
> it gets close to the limit (
> https://github.com/google/guava/wiki/CachesExplained#size-based-eviction),
> not just when it reaches it, so if this does end up helping you out, that
> could be the reason. I haven't dug into the implementation of "close to the
> maximum" that Guava's cache uses, or whether that would happen in the
> course of building up the maps to ping all of the nodes (which is where the
> issue for us seemed to be), but it's at least a possible avenue to explore
> further.
>
> In any case, thanks for trying this out!
>
> On Fri, Nov 22, 2019 at 10:16 AM Richard Goodman <richa...@brandwatch.com>
> wrote:
>
>> Hi Alex,
>>
>> This makes me really happy to see an email about this. I've been working
>> on a little while about setting up the prometheus exporter for our
>> clusters. Spent good amount of time setting up config, and started getting
>> some really decent graphs in grafana on metrics we've never been able to
>> collect before.
>>
>> For our stage environment, this worked like a charm, so shortly rolled it
>> out to our live environment. This is when I started to get into trouble.
>>
>> I too was getting the exact problem you was facing, I then decided to
>> split out all of my config so I had one config dedicated to JVM metric
>> collection, one dedicated to Node level metrics etc., etc., I was still
>> getting loads of errors coming through, which confused me.
>>
>> Our clusters are typically 96 nodes, so from your report, not sure how I
>> would be getting this issue. One theory I had was the timeouts happening on
>> the core admin API *(our indexes are range between 5gb-20gb in size
>> each)*, and our clusters will typically be around 10s of TB in size.
>> Because of this, when we have any replica state change, we notice
>> significant delays in /solr/admin/cores , sometimes taking a few minutes to
>> return.
>>
>> Because of this, I think there is a strong connection to the core admin
>> being a problem here, the reason for this is we have 1 unique cluster where
>> its typically storing 30days worth of data within its collections, new day
>> comes along, we create a collection for that day, and any collections older
>> than 30 days get dropped. Documents within this cluster typically don't
>> change either, so there's never really any state change, and causes the
>> cluster to be significantly reliable for us, where our other main group of
>> clusters go through significant amount of change a day.
>>
>> I'm currently applying your patch into our build, and will deploy this
>> and keep you updated to see if this helps. At the moment, I'm looking if
>> there is a way to add a default to indexInfo=false to the core admin api,
>> that could help us here *(because using that makes the response time
>> insanely fast as per usual, however, does remove some statistics)*.
>>
>> With that though, its very experimental, and not sure if it's the best
>> approach, but you have to start somewhere right?
>>
>> I'd be keen to look into this issue with you, as it's been a problem for
>> us also.
>>
>> I'll reply again with any results I find from applying your patch.
>>
>> Cheers,
>>
>> On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <ajablon...@thoughtworks.com>
>> wrote:
>>
>>> Pull request is here: https://github.com/apache/lucene-solr/pull/1022/
>>>
>>> Thanks!
>>> Alex Jablonski
>>>
>>> On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <
>>> ajablon...@thoughtworks.com> wrote:
>>>
>>>> Hi there!
>>>>
>>>> My colleague and I have run into an issue that seems to appear when
>>>> running the Solr Prometheus exporter in SolrCloud mode against a large (>
>>>> 100 node) cluster. The symptoms we're observing are "connection pool shut
>>>> down" exceptions in the logs and the inability to collect metrics from more
>>>> than 100 nodes in the cluster.
>>>>
>>>> We think we've traced down the issue to
>>>> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
>>>> . In that class, hostClientCache exists as a cache of HttpSolrClients
>>>> (currently having fixed size 100) that, on evicting a client from the
>>>> cache, closes the client's connection. The hostClientCache is used in
>>>> createHttpSolrClients to return a map of base URLs to HttpSolrClients.
>>>>
>>>> Given, say, 300 base URLs, createHttpSolrClients will happily add those
>>>> base URLs to the cache, and the "get" method on the cache will happily
>>>> return the new additions to the cache. But on adding the 101st
>>>> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
>>>> closed. This repeats itself until the only open clients we have are to base
>>>> URLs 201 through 300; clients for the first 200 base URLs will be returned,
>>>> but will already have been closed. When we later use the result of
>>>> createHttpSolrClients to collect metrics, expecting valid and open
>>>> HttpSolrClients, we fail to connect when using any of those clients that
>>>> have already been closed, leading to the "Connection pool shut down"
>>>> exception and not collecting metrics from those nodes.
>>>>
>>>> Our idea for a fix was to change the existing cache to, instead of
>>>> having a fixed maximum size, use `expireAfterAccess` with a timeout that's
>>>> a multiple of the scrape interval (twice the scrape interval?). We wanted
>>>> to confirm a few things:
>>>>
>>>> 1. Has this issue been reported before, and if so, is there another fix
>>>> in progress already?
>>>> 2. Does this approach seem desirable?
>>>> 3. If so, are there any opinions on what the cache timeout should be
>>>> besides just double the scrape interval?
>>>>
>>>> We'll also open a PR shortly with the changes we're proposing and link
>>>> here. Please let me know if any of the above is unclear or incorrect.
>>>>
>>>> Thanks!
>>>> Alex Jablonski
>>>>
>>>>
>>
>> --
>>
>> Richard Goodman    |    Data Infrastructure engineer
>>
>> richa...@brandwatch.com
>>
>>
>> NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
>> PARIS   | SINGAPORE | SYDNEY
>>
>> <https://www.brandwatch.com/blog/digital-consumer-intelligence/>
>>
>

-- 

Richard Goodman    |    Data Infrastructure engineer

richa...@brandwatch.com


NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
PARIS   | SINGAPORE | SYDNEY

<https://www.brandwatch.com/blog/digital-consumer-intelligence/>

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Reply via email to