Pull request is here: https://github.com/apache/lucene-solr/pull/1022/

Thanks!
Alex Jablonski

On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <ajablon...@thoughtworks.com>
wrote:

> Hi there!
>
> My colleague and I have run into an issue that seems to appear when
> running the Solr Prometheus exporter in SolrCloud mode against a large (>
> 100 node) cluster. The symptoms we're observing are "connection pool shut
> down" exceptions in the logs and the inability to collect metrics from more
> than 100 nodes in the cluster.
>
> We think we've traced down the issue to
> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
> . In that class, hostClientCache exists as a cache of HttpSolrClients
> (currently having fixed size 100) that, on evicting a client from the
> cache, closes the client's connection. The hostClientCache is used in
> createHttpSolrClients to return a map of base URLs to HttpSolrClients.
>
> Given, say, 300 base URLs, createHttpSolrClients will happily add those
> base URLs to the cache, and the "get" method on the cache will happily
> return the new additions to the cache. But on adding the 101st
> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
> closed. This repeats itself until the only open clients we have are to base
> URLs 201 through 300; clients for the first 200 base URLs will be returned,
> but will already have been closed. When we later use the result of
> createHttpSolrClients to collect metrics, expecting valid and open
> HttpSolrClients, we fail to connect when using any of those clients that
> have already been closed, leading to the "Connection pool shut down"
> exception and not collecting metrics from those nodes.
>
> Our idea for a fix was to change the existing cache to, instead of having
> a fixed maximum size, use `expireAfterAccess` with a timeout that's a
> multiple of the scrape interval (twice the scrape interval?). We wanted to
> confirm a few things:
>
> 1. Has this issue been reported before, and if so, is there another fix in
> progress already?
> 2. Does this approach seem desirable?
> 3. If so, are there any opinions on what the cache timeout should be
> besides just double the scrape interval?
>
> We'll also open a PR shortly with the changes we're proposing and link
> here. Please let me know if any of the above is unclear or incorrect.
>
> Thanks!
> Alex Jablonski
>
>

Reply via email to