Hi there!

My colleague and I have run into an issue that seems to appear when running
the Solr Prometheus exporter in SolrCloud mode against a large (> 100 node)
cluster. The symptoms we're observing are "connection pool shut down"
exceptions in the logs and the inability to collect metrics from more than
100 nodes in the cluster.

We think we've traced down the issue to
lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
. In that class, hostClientCache exists as a cache of HttpSolrClients
(currently having fixed size 100) that, on evicting a client from the
cache, closes the client's connection. The hostClientCache is used in
createHttpSolrClients to return a map of base URLs to HttpSolrClients.

Given, say, 300 base URLs, createHttpSolrClients will happily add those
base URLs to the cache, and the "get" method on the cache will happily
return the new additions to the cache. But on adding the 101st
HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
closed. This repeats itself until the only open clients we have are to base
URLs 201 through 300; clients for the first 200 base URLs will be returned,
but will already have been closed. When we later use the result of
createHttpSolrClients to collect metrics, expecting valid and open
HttpSolrClients, we fail to connect when using any of those clients that
have already been closed, leading to the "Connection pool shut down"
exception and not collecting metrics from those nodes.

Our idea for a fix was to change the existing cache to, instead of having a
fixed maximum size, use `expireAfterAccess` with a timeout that's a
multiple of the scrape interval (twice the scrape interval?). We wanted to
confirm a few things:

1. Has this issue been reported before, and if so, is there another fix in
progress already?
2. Does this approach seem desirable?
3. If so, are there any opinions on what the cache timeout should be
besides just double the scrape interval?

We'll also open a PR shortly with the changes we're proposing and link
here. Please let me know if any of the above is unclear or incorrect.

Thanks!
Alex Jablonski

Reply via email to