Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Alex Jablonski Fri, 22 Nov 2019 08:43:05 -0800

Hey Richard,

I'd definitely love to hear whether this improves things for you. According
to Guava's documentation, the cache can start evicting items when it gets
close to the limit (
https://github.com/google/guava/wiki/CachesExplained#size-based-eviction),
not just when it reaches it, so if this does end up helping you out, that
could be the reason. I haven't dug into the implementation of "close to the
maximum" that Guava's cache uses, or whether that would happen in the
course of building up the maps to ping all of the nodes (which is where the
issue for us seemed to be), but it's at least a possible avenue to explore
further.


In any case, thanks for trying this out!

On Fri, Nov 22, 2019 at 10:16 AM Richard Goodman <richa...@brandwatch.com>
wrote:

> Hi Alex,
>
> This makes me really happy to see an email about this. I've been working
> on a little while about setting up the prometheus exporter for our
> clusters. Spent good amount of time setting up config, and started getting
> some really decent graphs in grafana on metrics we've never been able to
> collect before.
>
> For our stage environment, this worked like a charm, so shortly rolled it
> out to our live environment. This is when I started to get into trouble.
>
> I too was getting the exact problem you was facing, I then decided to
> split out all of my config so I had one config dedicated to JVM metric
> collection, one dedicated to Node level metrics etc., etc., I was still
> getting loads of errors coming through, which confused me.
>
> Our clusters are typically 96 nodes, so from your report, not sure how I
> would be getting this issue. One theory I had was the timeouts happening on
> the core admin API *(our indexes are range between 5gb-20gb in size each)*,
> and our clusters will typically be around 10s of TB in size. Because of
> this, when we have any replica state change, we notice significant delays
> in /solr/admin/cores , sometimes taking a few minutes to return.
>
> Because of this, I think there is a strong connection to the core admin
> being a problem here, the reason for this is we have 1 unique cluster where
> its typically storing 30days worth of data within its collections, new day
> comes along, we create a collection for that day, and any collections older
> than 30 days get dropped. Documents within this cluster typically don't
> change either, so there's never really any state change, and causes the
> cluster to be significantly reliable for us, where our other main group of
> clusters go through significant amount of change a day.
>
> I'm currently applying your patch into our build, and will deploy this and
> keep you updated to see if this helps. At the moment, I'm looking if there
> is a way to add a default to indexInfo=false to the core admin api, that
> could help us here *(because using that makes the response time insanely
> fast as per usual, however, does remove some statistics)*.
>
> With that though, its very experimental, and not sure if it's the best
> approach, but you have to start somewhere right?
>
> I'd be keen to look into this issue with you, as it's been a problem for
> us also.
>
> I'll reply again with any results I find from applying your patch.
>
> Cheers,
>
> On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <ajablon...@thoughtworks.com>
> wrote:
>
>> Pull request is here: https://github.com/apache/lucene-solr/pull/1022/
>>
>> Thanks!
>> Alex Jablonski
>>
>> On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <
>> ajablon...@thoughtworks.com> wrote:
>>
>>> Hi there!
>>>
>>> My colleague and I have run into an issue that seems to appear when
>>> running the Solr Prometheus exporter in SolrCloud mode against a large (>
>>> 100 node) cluster. The symptoms we're observing are "connection pool shut
>>> down" exceptions in the logs and the inability to collect metrics from more
>>> than 100 nodes in the cluster.
>>>
>>> We think we've traced down the issue to
>>> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
>>> . In that class, hostClientCache exists as a cache of HttpSolrClients
>>> (currently having fixed size 100) that, on evicting a client from the
>>> cache, closes the client's connection. The hostClientCache is used in
>>> createHttpSolrClients to return a map of base URLs to HttpSolrClients.
>>>
>>> Given, say, 300 base URLs, createHttpSolrClients will happily add those
>>> base URLs to the cache, and the "get" method on the cache will happily
>>> return the new additions to the cache. But on adding the 101st
>>> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
>>> closed. This repeats itself until the only open clients we have are to base
>>> URLs 201 through 300; clients for the first 200 base URLs will be returned,
>>> but will already have been closed. When we later use the result of
>>> createHttpSolrClients to collect metrics, expecting valid and open
>>> HttpSolrClients, we fail to connect when using any of those clients that
>>> have already been closed, leading to the "Connection pool shut down"
>>> exception and not collecting metrics from those nodes.
>>>
>>> Our idea for a fix was to change the existing cache to, instead of
>>> having a fixed maximum size, use `expireAfterAccess` with a timeout that's
>>> a multiple of the scrape interval (twice the scrape interval?). We wanted
>>> to confirm a few things:
>>>
>>> 1. Has this issue been reported before, and if so, is there another fix
>>> in progress already?
>>> 2. Does this approach seem desirable?
>>> 3. If so, are there any opinions on what the cache timeout should be
>>> besides just double the scrape interval?
>>>
>>> We'll also open a PR shortly with the changes we're proposing and link
>>> here. Please let me know if any of the above is unclear or incorrect.
>>>
>>> Thanks!
>>> Alex Jablonski
>>>
>>>
>
> --
>
> Richard Goodman    |    Data Infrastructure engineer
>
> richa...@brandwatch.com
>
>
> NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
> PARIS   | SINGAPORE | SYDNEY
>
> <https://www.brandwatch.com/blog/digital-consumer-intelligence/>
>

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Reply via email to