Pull request is here: https://github.com/apache/lucene-solr/pull/1022/
Thanks! Alex Jablonski On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <ajablon...@thoughtworks.com> wrote: > Hi there! > > My colleague and I have run into an issue that seems to appear when > running the Solr Prometheus exporter in SolrCloud mode against a large (> > 100 node) cluster. The symptoms we're observing are "connection pool shut > down" exceptions in the logs and the inability to collect metrics from more > than 100 nodes in the cluster. > > We think we've traced down the issue to > lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java > . In that class, hostClientCache exists as a cache of HttpSolrClients > (currently having fixed size 100) that, on evicting a client from the > cache, closes the client's connection. The hostClientCache is used in > createHttpSolrClients to return a map of base URLs to HttpSolrClients. > > Given, say, 300 base URLs, createHttpSolrClients will happily add those > base URLs to the cache, and the "get" method on the cache will happily > return the new additions to the cache. But on adding the 101st > HttpSolrClient to the cache, the first HttpSolrClient gets evicted and > closed. This repeats itself until the only open clients we have are to base > URLs 201 through 300; clients for the first 200 base URLs will be returned, > but will already have been closed. When we later use the result of > createHttpSolrClients to collect metrics, expecting valid and open > HttpSolrClients, we fail to connect when using any of those clients that > have already been closed, leading to the "Connection pool shut down" > exception and not collecting metrics from those nodes. > > Our idea for a fix was to change the existing cache to, instead of having > a fixed maximum size, use `expireAfterAccess` with a timeout that's a > multiple of the scrape interval (twice the scrape interval?). We wanted to > confirm a few things: > > 1. Has this issue been reported before, and if so, is there another fix in > progress already? > 2. Does this approach seem desirable? > 3. If so, are there any opinions on what the cache timeout should be > besides just double the scrape interval? > > We'll also open a PR shortly with the changes we're proposing and link > here. Please let me know if any of the above is unclear or incorrect. > > Thanks! > Alex Jablonski > >