Hi Alex,

This makes me really happy to see an email about this. I've been working on
a little while about setting up the prometheus exporter for our clusters.
Spent good amount of time setting up config, and started getting some
really decent graphs in grafana on metrics we've never been able to collect
before.

For our stage environment, this worked like a charm, so shortly rolled it
out to our live environment. This is when I started to get into trouble.

I too was getting the exact problem you was facing, I then decided to split
out all of my config so I had one config dedicated to JVM metric
collection, one dedicated to Node level metrics etc., etc., I was still
getting loads of errors coming through, which confused me.

Our clusters are typically 96 nodes, so from your report, not sure how I
would be getting this issue. One theory I had was the timeouts happening on
the core admin API *(our indexes are range between 5gb-20gb in size each)*,
and our clusters will typically be around 10s of TB in size. Because of
this, when we have any replica state change, we notice significant delays
in /solr/admin/cores , sometimes taking a few minutes to return.

Because of this, I think there is a strong connection to the core admin
being a problem here, the reason for this is we have 1 unique cluster where
its typically storing 30days worth of data within its collections, new day
comes along, we create a collection for that day, and any collections older
than 30 days get dropped. Documents within this cluster typically don't
change either, so there's never really any state change, and causes the
cluster to be significantly reliable for us, where our other main group of
clusters go through significant amount of change a day.

I'm currently applying your patch into our build, and will deploy this and
keep you updated to see if this helps. At the moment, I'm looking if there
is a way to add a default to indexInfo=false to the core admin api, that
could help us here *(because using that makes the response time insanely
fast as per usual, however, does remove some statistics)*.

With that though, its very experimental, and not sure if it's the best
approach, but you have to start somewhere right?

I'd be keen to look into this issue with you, as it's been a problem for us
also.

I'll reply again with any results I find from applying your patch.

Cheers,

On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <ajablon...@thoughtworks.com>
wrote:

> Pull request is here: https://github.com/apache/lucene-solr/pull/1022/
>
> Thanks!
> Alex Jablonski
>
> On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <
> ajablon...@thoughtworks.com> wrote:
>
>> Hi there!
>>
>> My colleague and I have run into an issue that seems to appear when
>> running the Solr Prometheus exporter in SolrCloud mode against a large (>
>> 100 node) cluster. The symptoms we're observing are "connection pool shut
>> down" exceptions in the logs and the inability to collect metrics from more
>> than 100 nodes in the cluster.
>>
>> We think we've traced down the issue to
>> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
>> . In that class, hostClientCache exists as a cache of HttpSolrClients
>> (currently having fixed size 100) that, on evicting a client from the
>> cache, closes the client's connection. The hostClientCache is used in
>> createHttpSolrClients to return a map of base URLs to HttpSolrClients.
>>
>> Given, say, 300 base URLs, createHttpSolrClients will happily add those
>> base URLs to the cache, and the "get" method on the cache will happily
>> return the new additions to the cache. But on adding the 101st
>> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
>> closed. This repeats itself until the only open clients we have are to base
>> URLs 201 through 300; clients for the first 200 base URLs will be returned,
>> but will already have been closed. When we later use the result of
>> createHttpSolrClients to collect metrics, expecting valid and open
>> HttpSolrClients, we fail to connect when using any of those clients that
>> have already been closed, leading to the "Connection pool shut down"
>> exception and not collecting metrics from those nodes.
>>
>> Our idea for a fix was to change the existing cache to, instead of having
>> a fixed maximum size, use `expireAfterAccess` with a timeout that's a
>> multiple of the scrape interval (twice the scrape interval?). We wanted to
>> confirm a few things:
>>
>> 1. Has this issue been reported before, and if so, is there another fix
>> in progress already?
>> 2. Does this approach seem desirable?
>> 3. If so, are there any opinions on what the cache timeout should be
>> besides just double the scrape interval?
>>
>> We'll also open a PR shortly with the changes we're proposing and link
>> here. Please let me know if any of the above is unclear or incorrect.
>>
>> Thanks!
>> Alex Jablonski
>>
>>

-- 

Richard Goodman    |    Data Infrastructure engineer

richa...@brandwatch.com


NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
PARIS   | SINGAPORE | SYDNEY

<https://www.brandwatch.com/blog/digital-consumer-intelligence/>

Reply via email to