Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Richard Goodman Fri, 22 Nov 2019 09:20:20 -0800

Sorry, I meant to say, Hey Alex

It's because this time around I did reply all, and Serj's name was
highlighted in my email!


On Fri, 22 Nov 2019 at 17:17, Richard Goodman <richa...@brandwatch.com>
wrote:

> Hey Serj,
>
> So I've applied your patch, deployed the latest build of the
> solr-prometheus exporter, and left it running for a little bit. It's had
> some interesting results.
>
> As I mentioned, I split up my config from 1 monolithic config to separate
> configs, and have an instance running for each one *(they are split up by
> groups defined in the /solr/admin/metrics endpoint)*, for more
> information, I have split them up via the following:
>
>    - Core
>    - Node
>    - JVM
>    - Jetty
>
> And I was getting the connection pool shut down, for every single type.
> However, this is not the case any more.
>
> *Core*
> Whilst I am no longer getting Connection Pool Shutdown, I am now getting
> the likes of;
> ERROR - 2019-11-22 16:44:17.522;
> org.apache.solr.prometheus.scraper.SolrScraper; failed to request:
> /admin/metrics Timeout occured while waiting response from server at:
> http://127.0.0.1:8085/solr
>
> Because of this, I am now no longer collecting any core metrics at all.
> I'm yet to figure out in the code where this error is reported, and seeing
> if anything can help with this.
>
> *Node*
> I did get a NPE that I've never seen before, however, this appeared to be
> a 1 off, and now this is also reporting 0 connection pool shut down issues,
> which I was getting previously. I'll attach a dump of the node NPE below as
> like an "appendix"
>
> *JVM*
> Completely seems to resolve the issue, have no longer been getting it.
>
> *Jetty*
> Unfortunately here, I am still getting connection pool shut down errors,
> again I will upload a dump of the error in the "appendix"
>
> Whilst this hasn't appeared to completely fix the issue, it's definitely a
> step in the right direction. As I said, for Node and JVM I am no longer
> getting this, and have seen an improve in the amount of metrics being
> collected. But it looks like there is still some work we can look at and do
>
> --------
> *Appendix*
>
> *Node NPE Error*
> ERROR - 2019-11-22 16:46:38.594;
> org.apache.solr.prometheus.scraper.SolrScraper; failed to request:
> /admin/metrics Timeout occured while waiting response from server at:
> http://127.0.0.1:8083/solr
> WARN  - 2019-11-22 16:46:59.105; org.apache.solr.prometheus.scraper.Async;
> Error occurred during metrics collection
> java.util.concurrent.ExecutionException: java.lang.NullPointerException
>         at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
> ~[?:1.8.0_141]
>         at
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_141]
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_141]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
> Caused by: java.lang.NullPointerException
>         at
> org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:112)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> ~[?:1.8.0_141]
>         ... 4 more
>
>
> *Jetty Connection Pool Shut down*
> WARN  - 2019-11-22 16:41:57.505; org.apache.solr.prometheus.scraper.Async;
> Error occurred during metrics collection
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException:
> Connection pool shut down
>         at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987
> b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
> ~[?:1.8.0_141]
>         at
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_141]
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_141]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
> Caused by: java.lang.IllegalStateException: Connection pool shut down
>         at org.apache.http.util.Asserts.check(Asserts.java:34)
> ~[httpcore-4.4.10.jar:4.4.10]
>         at
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:191)
> ~[httpcore-4.4.10.jar:4.4.10]
>         at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:267)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:102)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> ~[?:1.8.0_141]
>         ... 4 more
> INFO  - 2019-11-22 16:41:57.657;
> org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed
> metrics collection
>
> On Fri, 22 Nov 2019 at 16:42, Alex Jablonski <ajablon...@thoughtworks.com>
> wrote:
>
>> Hey Richard,
>>
>> I'd definitely love to hear whether this improves things for you.
>> According to Guava's documentation, the cache can start evicting items when
>> it gets close to the limit (
>> https://github.com/google/guava/wiki/CachesExplained#size-based-eviction),
>> not just when it reaches it, so if this does end up helping you out, that
>> could be the reason. I haven't dug into the implementation of "close to the
>> maximum" that Guava's cache uses, or whether that would happen in the
>> course of building up the maps to ping all of the nodes (which is where the
>> issue for us seemed to be), but it's at least a possible avenue to explore
>> further.
>>
>> In any case, thanks for trying this out!
>>
>> On Fri, Nov 22, 2019 at 10:16 AM Richard Goodman <richa...@brandwatch.com>
>> wrote:
>>
>>> Hi Alex,
>>>
>>> This makes me really happy to see an email about this. I've been working
>>> on a little while about setting up the prometheus exporter for our
>>> clusters. Spent good amount of time setting up config, and started getting
>>> some really decent graphs in grafana on metrics we've never been able to
>>> collect before.
>>>
>>> For our stage environment, this worked like a charm, so shortly rolled
>>> it out to our live environment. This is when I started to get into trouble.
>>>
>>> I too was getting the exact problem you was facing, I then decided to
>>> split out all of my config so I had one config dedicated to JVM metric
>>> collection, one dedicated to Node level metrics etc., etc., I was still
>>> getting loads of errors coming through, which confused me.
>>>
>>> Our clusters are typically 96 nodes, so from your report, not sure how I
>>> would be getting this issue. One theory I had was the timeouts happening on
>>> the core admin API *(our indexes are range between 5gb-20gb in size
>>> each)*, and our clusters will typically be around 10s of TB in size.
>>> Because of this, when we have any replica state change, we notice
>>> significant delays in /solr/admin/cores , sometimes taking a few minutes to
>>> return.
>>>
>>> Because of this, I think there is a strong connection to the core admin
>>> being a problem here, the reason for this is we have 1 unique cluster where
>>> its typically storing 30days worth of data within its collections, new day
>>> comes along, we create a collection for that day, and any collections older
>>> than 30 days get dropped. Documents within this cluster typically don't
>>> change either, so there's never really any state change, and causes the
>>> cluster to be significantly reliable for us, where our other main group of
>>> clusters go through significant amount of change a day.
>>>
>>> I'm currently applying your patch into our build, and will deploy this
>>> and keep you updated to see if this helps. At the moment, I'm looking if
>>> there is a way to add a default to indexInfo=false to the core admin api,
>>> that could help us here *(because using that makes the response time
>>> insanely fast as per usual, however, does remove some statistics)*.
>>>
>>> With that though, its very experimental, and not sure if it's the best
>>> approach, but you have to start somewhere right?
>>>
>>> I'd be keen to look into this issue with you, as it's been a problem for
>>> us also.
>>>
>>> I'll reply again with any results I find from applying your patch.
>>>
>>> Cheers,
>>>
>>> On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <
>>> ajablon...@thoughtworks.com> wrote:
>>>
>>>> Pull request is here: https://github.com/apache/lucene-solr/pull/1022/
>>>>
>>>> Thanks!
>>>> Alex Jablonski
>>>>
>>>> On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <
>>>> ajablon...@thoughtworks.com> wrote:
>>>>
>>>>> Hi there!
>>>>>
>>>>> My colleague and I have run into an issue that seems to appear when
>>>>> running the Solr Prometheus exporter in SolrCloud mode against a large (>
>>>>> 100 node) cluster. The symptoms we're observing are "connection pool shut
>>>>> down" exceptions in the logs and the inability to collect metrics from 
>>>>> more
>>>>> than 100 nodes in the cluster.
>>>>>
>>>>> We think we've traced down the issue to
>>>>> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
>>>>> . In that class, hostClientCache exists as a cache of HttpSolrClients
>>>>> (currently having fixed size 100) that, on evicting a client from the
>>>>> cache, closes the client's connection. The hostClientCache is used in
>>>>> createHttpSolrClients to return a map of base URLs to HttpSolrClients.
>>>>>
>>>>> Given, say, 300 base URLs, createHttpSolrClients will happily add
>>>>> those base URLs to the cache, and the "get" method on the cache will
>>>>> happily return the new additions to the cache. But on adding the 101st
>>>>> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
>>>>> closed. This repeats itself until the only open clients we have are to 
>>>>> base
>>>>> URLs 201 through 300; clients for the first 200 base URLs will be 
>>>>> returned,
>>>>> but will already have been closed. When we later use the result of
>>>>> createHttpSolrClients to collect metrics, expecting valid and open
>>>>> HttpSolrClients, we fail to connect when using any of those clients that
>>>>> have already been closed, leading to the "Connection pool shut down"
>>>>> exception and not collecting metrics from those nodes.
>>>>>
>>>>> Our idea for a fix was to change the existing cache to, instead of
>>>>> having a fixed maximum size, use `expireAfterAccess` with a timeout that's
>>>>> a multiple of the scrape interval (twice the scrape interval?). We wanted
>>>>> to confirm a few things:
>>>>>
>>>>> 1. Has this issue been reported before, and if so, is there another
>>>>> fix in progress already?
>>>>> 2. Does this approach seem desirable?
>>>>> 3. If so, are there any opinions on what the cache timeout should be
>>>>> besides just double the scrape interval?
>>>>>
>>>>> We'll also open a PR shortly with the changes we're proposing and link
>>>>> here. Please let me know if any of the above is unclear or incorrect.
>>>>>
>>>>> Thanks!
>>>>> Alex Jablonski
>>>>>
>>>>>
>>>
>>> --
>>>
>>> Richard Goodman    |    Data Infrastructure engineer
>>>
>>> richa...@brandwatch.com
>>>
>>>
>>> NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
>>> PARIS   | SINGAPORE | SYDNEY
>>>
>>> <https://www.brandwatch.com/blog/digital-consumer-intelligence/>
>>>
>>
>
> --
>
> Richard Goodman    |    Data Infrastructure engineer
>
> richa...@brandwatch.com
>
>
> NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
> PARIS   | SINGAPORE | SYDNEY
>
> <https://www.brandwatch.com/blog/digital-consumer-intelligence/>
>


-- 

Richard Goodman    |    Data Infrastructure engineer

richa...@brandwatch.com


NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
PARIS   | SINGAPORE | SYDNEY

<https://www.brandwatch.com/blog/digital-consumer-intelligence/>

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Reply via email to