Hey Serj, So I've applied your patch, deployed the latest build of the solr-prometheus exporter, and left it running for a little bit. It's had some interesting results.
As I mentioned, I split up my config from 1 monolithic config to separate configs, and have an instance running for each one *(they are split up by groups defined in the /solr/admin/metrics endpoint)*, for more information, I have split them up via the following: - Core - Node - JVM - Jetty And I was getting the connection pool shut down, for every single type. However, this is not the case any more. *Core* Whilst I am no longer getting Connection Pool Shutdown, I am now getting the likes of; ERROR - 2019-11-22 16:44:17.522; org.apache.solr.prometheus.scraper.SolrScraper; failed to request: /admin/metrics Timeout occured while waiting response from server at: http://127.0.0.1:8085/solr Because of this, I am now no longer collecting any core metrics at all. I'm yet to figure out in the code where this error is reported, and seeing if anything can help with this. *Node* I did get a NPE that I've never seen before, however, this appeared to be a 1 off, and now this is also reporting 0 connection pool shut down issues, which I was getting previously. I'll attach a dump of the node NPE below as like an "appendix" *JVM* Completely seems to resolve the issue, have no longer been getting it. *Jetty* Unfortunately here, I am still getting connection pool shut down errors, again I will upload a dump of the error in the "appendix" Whilst this hasn't appeared to completely fix the issue, it's definitely a step in the right direction. As I said, for Node and JVM I am no longer getting this, and have seen an improve in the amount of metrics being collected. But it looks like there is still some work we can look at and do -------- *Appendix* *Node NPE Error* ERROR - 2019-11-22 16:46:38.594; org.apache.solr.prometheus.scraper.SolrScraper; failed to request: /admin/metrics Timeout occured while waiting response from server at: http://127.0.0.1:8083/solr WARN - 2019-11-22 16:46:59.105; org.apache.solr.prometheus.scraper.Async; Error occurred during metrics collection java.util.concurrent.ExecutionException: java.lang.NullPointerException at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_141] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~[?:1.8.0_141] at org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[?:1.8.0_141] at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_141] at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) ~[?:1.8.0_141] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_141] at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_141] at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_141] at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_141] at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_141] at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_141] at org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_141] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_141] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_141] at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) ~[?:1.8.0_141] at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141] Caused by: java.lang.NullPointerException at org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:112) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] at org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] at org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_141] ... 4 more *Jetty Connection Pool Shut down* WARN - 2019-11-22 16:41:57.505; org.apache.solr.prometheus.scraper.Async; Error occurred during metrics collection java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Connection pool shut down at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_141] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~[?:1.8.0_141] at org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987 b - jenkins - 2019-11-22 16:23:03] at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[?:1.8.0_141] at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_141] at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) ~[?:1.8.0_141] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_141] at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_141] at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_141] at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_141] at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_141] at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_141] at org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_141] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_141] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_141] at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) ~[?:1.8.0_141] at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141] Caused by: java.lang.IllegalStateException: Connection pool shut down at org.apache.http.util.Asserts.check(Asserts.java:34) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:191) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:267) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] at org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:102) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] at org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] at org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_141] ... 4 more INFO - 2019-11-22 16:41:57.657; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed metrics collection On Fri, 22 Nov 2019 at 16:42, Alex Jablonski <ajablon...@thoughtworks.com> wrote: > Hey Richard, > > I'd definitely love to hear whether this improves things for you. > According to Guava's documentation, the cache can start evicting items when > it gets close to the limit ( > https://github.com/google/guava/wiki/CachesExplained#size-based-eviction), > not just when it reaches it, so if this does end up helping you out, that > could be the reason. I haven't dug into the implementation of "close to the > maximum" that Guava's cache uses, or whether that would happen in the > course of building up the maps to ping all of the nodes (which is where the > issue for us seemed to be), but it's at least a possible avenue to explore > further. > > In any case, thanks for trying this out! > > On Fri, Nov 22, 2019 at 10:16 AM Richard Goodman <richa...@brandwatch.com> > wrote: > >> Hi Alex, >> >> This makes me really happy to see an email about this. I've been working >> on a little while about setting up the prometheus exporter for our >> clusters. Spent good amount of time setting up config, and started getting >> some really decent graphs in grafana on metrics we've never been able to >> collect before. >> >> For our stage environment, this worked like a charm, so shortly rolled it >> out to our live environment. This is when I started to get into trouble. >> >> I too was getting the exact problem you was facing, I then decided to >> split out all of my config so I had one config dedicated to JVM metric >> collection, one dedicated to Node level metrics etc., etc., I was still >> getting loads of errors coming through, which confused me. >> >> Our clusters are typically 96 nodes, so from your report, not sure how I >> would be getting this issue. One theory I had was the timeouts happening on >> the core admin API *(our indexes are range between 5gb-20gb in size >> each)*, and our clusters will typically be around 10s of TB in size. >> Because of this, when we have any replica state change, we notice >> significant delays in /solr/admin/cores , sometimes taking a few minutes to >> return. >> >> Because of this, I think there is a strong connection to the core admin >> being a problem here, the reason for this is we have 1 unique cluster where >> its typically storing 30days worth of data within its collections, new day >> comes along, we create a collection for that day, and any collections older >> than 30 days get dropped. Documents within this cluster typically don't >> change either, so there's never really any state change, and causes the >> cluster to be significantly reliable for us, where our other main group of >> clusters go through significant amount of change a day. >> >> I'm currently applying your patch into our build, and will deploy this >> and keep you updated to see if this helps. At the moment, I'm looking if >> there is a way to add a default to indexInfo=false to the core admin api, >> that could help us here *(because using that makes the response time >> insanely fast as per usual, however, does remove some statistics)*. >> >> With that though, its very experimental, and not sure if it's the best >> approach, but you have to start somewhere right? >> >> I'd be keen to look into this issue with you, as it's been a problem for >> us also. >> >> I'll reply again with any results I find from applying your patch. >> >> Cheers, >> >> On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <ajablon...@thoughtworks.com> >> wrote: >> >>> Pull request is here: https://github.com/apache/lucene-solr/pull/1022/ >>> >>> Thanks! >>> Alex Jablonski >>> >>> On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski < >>> ajablon...@thoughtworks.com> wrote: >>> >>>> Hi there! >>>> >>>> My colleague and I have run into an issue that seems to appear when >>>> running the Solr Prometheus exporter in SolrCloud mode against a large (> >>>> 100 node) cluster. The symptoms we're observing are "connection pool shut >>>> down" exceptions in the logs and the inability to collect metrics from more >>>> than 100 nodes in the cluster. >>>> >>>> We think we've traced down the issue to >>>> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java >>>> . In that class, hostClientCache exists as a cache of HttpSolrClients >>>> (currently having fixed size 100) that, on evicting a client from the >>>> cache, closes the client's connection. The hostClientCache is used in >>>> createHttpSolrClients to return a map of base URLs to HttpSolrClients. >>>> >>>> Given, say, 300 base URLs, createHttpSolrClients will happily add those >>>> base URLs to the cache, and the "get" method on the cache will happily >>>> return the new additions to the cache. But on adding the 101st >>>> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and >>>> closed. This repeats itself until the only open clients we have are to base >>>> URLs 201 through 300; clients for the first 200 base URLs will be returned, >>>> but will already have been closed. When we later use the result of >>>> createHttpSolrClients to collect metrics, expecting valid and open >>>> HttpSolrClients, we fail to connect when using any of those clients that >>>> have already been closed, leading to the "Connection pool shut down" >>>> exception and not collecting metrics from those nodes. >>>> >>>> Our idea for a fix was to change the existing cache to, instead of >>>> having a fixed maximum size, use `expireAfterAccess` with a timeout that's >>>> a multiple of the scrape interval (twice the scrape interval?). We wanted >>>> to confirm a few things: >>>> >>>> 1. Has this issue been reported before, and if so, is there another fix >>>> in progress already? >>>> 2. Does this approach seem desirable? >>>> 3. If so, are there any opinions on what the cache timeout should be >>>> besides just double the scrape interval? >>>> >>>> We'll also open a PR shortly with the changes we're proposing and link >>>> here. Please let me know if any of the above is unclear or incorrect. >>>> >>>> Thanks! >>>> Alex Jablonski >>>> >>>> >> >> -- >> >> Richard Goodman | Data Infrastructure engineer >> >> richa...@brandwatch.com >> >> >> NEW YORK | BOSTON | BRIGHTON | LONDON | BERLIN | STUTTGART | >> PARIS | SINGAPORE | SYDNEY >> >> <https://www.brandwatch.com/blog/digital-consumer-intelligence/> >> > -- Richard Goodman | Data Infrastructure engineer richa...@brandwatch.com NEW YORK | BOSTON | BRIGHTON | LONDON | BERLIN | STUTTGART | PARIS | SINGAPORE | SYDNEY <https://www.brandwatch.com/blog/digital-consumer-intelligence/>