Sorry, I meant to say, Hey Alex It's because this time around I did reply all, and Serj's name was highlighted in my email!
On Fri, 22 Nov 2019 at 17:17, Richard Goodman <richa...@brandwatch.com> wrote: > Hey Serj, > > So I've applied your patch, deployed the latest build of the > solr-prometheus exporter, and left it running for a little bit. It's had > some interesting results. > > As I mentioned, I split up my config from 1 monolithic config to separate > configs, and have an instance running for each one *(they are split up by > groups defined in the /solr/admin/metrics endpoint)*, for more > information, I have split them up via the following: > > - Core > - Node > - JVM > - Jetty > > And I was getting the connection pool shut down, for every single type. > However, this is not the case any more. > > *Core* > Whilst I am no longer getting Connection Pool Shutdown, I am now getting > the likes of; > ERROR - 2019-11-22 16:44:17.522; > org.apache.solr.prometheus.scraper.SolrScraper; failed to request: > /admin/metrics Timeout occured while waiting response from server at: > http://127.0.0.1:8085/solr > > Because of this, I am now no longer collecting any core metrics at all. > I'm yet to figure out in the code where this error is reported, and seeing > if anything can help with this. > > *Node* > I did get a NPE that I've never seen before, however, this appeared to be > a 1 off, and now this is also reporting 0 connection pool shut down issues, > which I was getting previously. I'll attach a dump of the node NPE below as > like an "appendix" > > *JVM* > Completely seems to resolve the issue, have no longer been getting it. > > *Jetty* > Unfortunately here, I am still getting connection pool shut down errors, > again I will upload a dump of the error in the "appendix" > > Whilst this hasn't appeared to completely fix the issue, it's definitely a > step in the right direction. As I said, for Node and JVM I am no longer > getting this, and have seen an improve in the amount of metrics being > collected. But it looks like there is still some work we can look at and do > > -------- > *Appendix* > > *Node NPE Error* > ERROR - 2019-11-22 16:46:38.594; > org.apache.solr.prometheus.scraper.SolrScraper; failed to request: > /admin/metrics Timeout occured while waiting response from server at: > http://127.0.0.1:8083/solr > WARN - 2019-11-22 16:46:59.105; org.apache.solr.prometheus.scraper.Async; > Error occurred during metrics collection > java.util.concurrent.ExecutionException: java.lang.NullPointerException > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > ~[?:1.8.0_141] > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > ~[?:1.8.0_141] > at > org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45) > ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] > at > java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) > ~[?:1.8.0_141] > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) > ~[?:1.8.0_141] > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) > ~[?:1.8.0_141] > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > ~[?:1.8.0_141] > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > ~[?:1.8.0_141] > at > java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) > ~[?:1.8.0_141] > at > java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) > ~[?:1.8.0_141] > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > ~[?:1.8.0_141] > at > java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) > ~[?:1.8.0_141] > at > org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43) > ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] > at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) > ~[?:1.8.0_141] > at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) > ~[?:1.8.0_141] > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > ~[?:1.8.0_141] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) > ~[?:1.8.0_141] > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_141] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_141] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141] > Caused by: java.lang.NullPointerException > at > org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:112) > ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] > at > org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121) > ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] > at > org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81) > ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) > ~[?:1.8.0_141] > ... 4 more > > > *Jetty Connection Pool Shut down* > WARN - 2019-11-22 16:41:57.505; org.apache.solr.prometheus.scraper.Async; > Error occurred during metrics collection > java.util.concurrent.ExecutionException: java.lang.IllegalStateException: > Connection pool shut down > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > ~[?:1.8.0_141] > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > ~[?:1.8.0_141] > at > org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45) > ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987 > b - jenkins - 2019-11-22 16:23:03] > at > java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) > ~[?:1.8.0_141] > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) > ~[?:1.8.0_141] > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) > ~[?:1.8.0_141] > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > ~[?:1.8.0_141] > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > ~[?:1.8.0_141] > at > java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) > ~[?:1.8.0_141] > at > java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) > ~[?:1.8.0_141] > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > ~[?:1.8.0_141] > at > java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) > ~[?:1.8.0_141] > at > org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43) > ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] > at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) > ~[?:1.8.0_141] > at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) > ~[?:1.8.0_141] > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > ~[?:1.8.0_141] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) > ~[?:1.8.0_141] > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_141] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_141] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141] > Caused by: java.lang.IllegalStateException: Connection pool shut down > at org.apache.http.util.Asserts.check(Asserts.java:34) > ~[httpcore-4.4.10.jar:4.4.10] > at > org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:191) > ~[httpcore-4.4.10.jar:4.4.10] > at > org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:267) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542) > ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) > ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) > ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] > at > org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) > ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11] > at > org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:102) > ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] > at > org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121) > ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] > at > org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81) > ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT > e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) > ~[?:1.8.0_141] > ... 4 more > INFO - 2019-11-22 16:41:57.657; > org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed > metrics collection > > On Fri, 22 Nov 2019 at 16:42, Alex Jablonski <ajablon...@thoughtworks.com> > wrote: > >> Hey Richard, >> >> I'd definitely love to hear whether this improves things for you. >> According to Guava's documentation, the cache can start evicting items when >> it gets close to the limit ( >> https://github.com/google/guava/wiki/CachesExplained#size-based-eviction), >> not just when it reaches it, so if this does end up helping you out, that >> could be the reason. I haven't dug into the implementation of "close to the >> maximum" that Guava's cache uses, or whether that would happen in the >> course of building up the maps to ping all of the nodes (which is where the >> issue for us seemed to be), but it's at least a possible avenue to explore >> further. >> >> In any case, thanks for trying this out! >> >> On Fri, Nov 22, 2019 at 10:16 AM Richard Goodman <richa...@brandwatch.com> >> wrote: >> >>> Hi Alex, >>> >>> This makes me really happy to see an email about this. I've been working >>> on a little while about setting up the prometheus exporter for our >>> clusters. Spent good amount of time setting up config, and started getting >>> some really decent graphs in grafana on metrics we've never been able to >>> collect before. >>> >>> For our stage environment, this worked like a charm, so shortly rolled >>> it out to our live environment. This is when I started to get into trouble. >>> >>> I too was getting the exact problem you was facing, I then decided to >>> split out all of my config so I had one config dedicated to JVM metric >>> collection, one dedicated to Node level metrics etc., etc., I was still >>> getting loads of errors coming through, which confused me. >>> >>> Our clusters are typically 96 nodes, so from your report, not sure how I >>> would be getting this issue. One theory I had was the timeouts happening on >>> the core admin API *(our indexes are range between 5gb-20gb in size >>> each)*, and our clusters will typically be around 10s of TB in size. >>> Because of this, when we have any replica state change, we notice >>> significant delays in /solr/admin/cores , sometimes taking a few minutes to >>> return. >>> >>> Because of this, I think there is a strong connection to the core admin >>> being a problem here, the reason for this is we have 1 unique cluster where >>> its typically storing 30days worth of data within its collections, new day >>> comes along, we create a collection for that day, and any collections older >>> than 30 days get dropped. Documents within this cluster typically don't >>> change either, so there's never really any state change, and causes the >>> cluster to be significantly reliable for us, where our other main group of >>> clusters go through significant amount of change a day. >>> >>> I'm currently applying your patch into our build, and will deploy this >>> and keep you updated to see if this helps. At the moment, I'm looking if >>> there is a way to add a default to indexInfo=false to the core admin api, >>> that could help us here *(because using that makes the response time >>> insanely fast as per usual, however, does remove some statistics)*. >>> >>> With that though, its very experimental, and not sure if it's the best >>> approach, but you have to start somewhere right? >>> >>> I'd be keen to look into this issue with you, as it's been a problem for >>> us also. >>> >>> I'll reply again with any results I find from applying your patch. >>> >>> Cheers, >>> >>> On Wed, 20 Nov 2019 at 20:34, Alex Jablonski < >>> ajablon...@thoughtworks.com> wrote: >>> >>>> Pull request is here: https://github.com/apache/lucene-solr/pull/1022/ >>>> >>>> Thanks! >>>> Alex Jablonski >>>> >>>> On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski < >>>> ajablon...@thoughtworks.com> wrote: >>>> >>>>> Hi there! >>>>> >>>>> My colleague and I have run into an issue that seems to appear when >>>>> running the Solr Prometheus exporter in SolrCloud mode against a large (> >>>>> 100 node) cluster. The symptoms we're observing are "connection pool shut >>>>> down" exceptions in the logs and the inability to collect metrics from >>>>> more >>>>> than 100 nodes in the cluster. >>>>> >>>>> We think we've traced down the issue to >>>>> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java >>>>> . In that class, hostClientCache exists as a cache of HttpSolrClients >>>>> (currently having fixed size 100) that, on evicting a client from the >>>>> cache, closes the client's connection. The hostClientCache is used in >>>>> createHttpSolrClients to return a map of base URLs to HttpSolrClients. >>>>> >>>>> Given, say, 300 base URLs, createHttpSolrClients will happily add >>>>> those base URLs to the cache, and the "get" method on the cache will >>>>> happily return the new additions to the cache. But on adding the 101st >>>>> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and >>>>> closed. This repeats itself until the only open clients we have are to >>>>> base >>>>> URLs 201 through 300; clients for the first 200 base URLs will be >>>>> returned, >>>>> but will already have been closed. When we later use the result of >>>>> createHttpSolrClients to collect metrics, expecting valid and open >>>>> HttpSolrClients, we fail to connect when using any of those clients that >>>>> have already been closed, leading to the "Connection pool shut down" >>>>> exception and not collecting metrics from those nodes. >>>>> >>>>> Our idea for a fix was to change the existing cache to, instead of >>>>> having a fixed maximum size, use `expireAfterAccess` with a timeout that's >>>>> a multiple of the scrape interval (twice the scrape interval?). We wanted >>>>> to confirm a few things: >>>>> >>>>> 1. Has this issue been reported before, and if so, is there another >>>>> fix in progress already? >>>>> 2. Does this approach seem desirable? >>>>> 3. If so, are there any opinions on what the cache timeout should be >>>>> besides just double the scrape interval? >>>>> >>>>> We'll also open a PR shortly with the changes we're proposing and link >>>>> here. Please let me know if any of the above is unclear or incorrect. >>>>> >>>>> Thanks! >>>>> Alex Jablonski >>>>> >>>>> >>> >>> -- >>> >>> Richard Goodman | Data Infrastructure engineer >>> >>> richa...@brandwatch.com >>> >>> >>> NEW YORK | BOSTON | BRIGHTON | LONDON | BERLIN | STUTTGART | >>> PARIS | SINGAPORE | SYDNEY >>> >>> <https://www.brandwatch.com/blog/digital-consumer-intelligence/> >>> >> > > -- > > Richard Goodman | Data Infrastructure engineer > > richa...@brandwatch.com > > > NEW YORK | BOSTON | BRIGHTON | LONDON | BERLIN | STUTTGART | > PARIS | SINGAPORE | SYDNEY > > <https://www.brandwatch.com/blog/digital-consumer-intelligence/> > -- Richard Goodman | Data Infrastructure engineer richa...@brandwatch.com NEW YORK | BOSTON | BRIGHTON | LONDON | BERLIN | STUTTGART | PARIS | SINGAPORE | SYDNEY <https://www.brandwatch.com/blog/digital-consumer-intelligence/>