Thanks for sharing this with us and debugging this reasonably far. If this happens when a replica is being added, then this makes sense to me because the replica isn't ready yet to respond with a searcher. Conditionally if the replica is in recovery, then maybe don't ask for the indexInfo?
I can see room for improvement in Solr here. The core admin stats don't actually need an IndexSearcher, it needs a DirectoryReader which is much cheaper to get. Even SolrCore.getIndexSize creates/closes one on the fly. I filed an issue: https://issues.apache.org/jira/browse/SOLR-14325 ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Fri, Mar 13, 2020 at 8:50 AM Richard Goodman <[email protected]> wrote: > Hi there, > > I went to a Solr meetup in London a month or two ago, and asked this > question then, however, people weren't fully sure what the underlying issue > was and were curious to know. > > - Background of our tech stack > > Where I work, we run just under 30 solr clusters in our live environment: > > - Most of our clusters have the upper limit capacity of around 100TB, > but typically our used capacity per cluster is around 55% on average. > - Most of our clusters are running 96 instances, if not more. > - Each instance has a TB of storage, > - Most of our clusters have 322 collections, where each collection is > split into 16 shards and 3-way replication > - Each core is on average 20GB, some going up to 40GB. > - We are currently running solr v7.7.2 > > > We've been dealing with an issue with running Solr Cloud to this size for > quite some time, whilst it's never caused a catastrophic failure, when it > comes to us wanting to improve our monitoring, it can become quite a > blocker for us. > > - The issue > > We use the core admin API a lot, to get information on core level metrics. > However, if there are cores recovering, this will lock and slow down the > API massively, sometimes minutes to get a response from said API. > > I worked on implementing the solr prometheus exporter to provide us vision > on our clusters, however, because of the core admin timeout, I am met with > tons of errors within the application and unable to collect core level > metrics. Similarly, if we use the metrics api for the node level metrics > and core level metrics, it will take a very long time to return results. > > - Any findings so far > > We discovered that if you supply "&indexInfo=false" to the core api call, > it will return instantly, but of course you miss some information that is > quite useful. We have potentially identified where in a jstack the issue is > occuring, as quoted from a colleague: > > "Had a look at the code from my stack trace, as we expected it's waiting > for the index to be opened to populate the info about numDocs etc, for the > Core admin API call" > > However, not entirely sure what the best approach to fixing this is? We > were also curious if other people are getting a similar issue? > > - Evidence > > I created a small dev cluster in our tech stack, and loaded a 30GB backup, > where each core is around 2GB big on average. I then just ran a time on a > curl of the core admin api; > > *Call before a replica being added:* > > root @ rich-solr0.dev-cluster.net ~ # time curl -s > "http://localhost:8080/solr/admin/cores" > core_admin_pre.outreal 0m0.070s > user 0m0.008s > sys 0m0.008s > > > *Call during a replica being added:* > > root @ rich-solr1.dev-cluster.net ~ # time curl -s > "http://localhost:8080/solr/admin/cores" > core_admin_during.outreal > 0m8.130s > user 0m0.016s > sys 0m0.008s > > > > *Call after a replica being added:* > > root @ rich-solr1.dev-cluster.net ~ # time curl -s > "http://localhost:8080/solr/admin/cores" > core_admin_done.out > > real 0m0.037s > > user 0m0.008s > > sys 0m0.012s > > > I know here, that the difference is only 8 seconds, however, this is only > with 2GB cores, in our live environment, it's significantly larger. > > I have also collected a bunch of jstacks that are before, during and after > this occurring, where a previous colleague quoted the following being a > possible root cause: > > "qtp511717113-1006" #1006 prio=5 os_prio=0 tid=0x00007f465800a000 nid=0xc709 > in Object.wait() [0x00007f6b2f954000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2206) > - locked <0x00000004d1d75af0> (a java.lang.Object) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1994) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1868) > at > org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:339) > at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:46) > at > org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360) > at > org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:396) > at > org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180) > > > The stacktraces can be found here > <https://github.com/richardbadman/core-admin-timeout>. > > If anyone has any advice, that would be greatly appreciated. > > Cheers, > -- > > Richard Goodman >
