Hi all, A couple of months ago, I migrated my solr deployment off of some legacy hardware (old spinning disks), and onto much newer hardware (SSD's, newer processors). While I am seeing much improved search performance since this move, I am also seeing intermittent indexing timeouts for 10-15 min periods about once a day or so (both from my indexing code and between replicas), which were not happening before. I have been scratching my head trying to figure out why, but have thus far been unsuccessful. I was hoping someone on here could maybe offer some thoughts as to how to further debug.
Some information about my setup: -Solr Cloud 8.3, running on linux -2 nodes, 1 shard (2 replicas) per collection -Handful of collections, maxing out in the 10s of millions of docs per collection. Less than 100 million docs total -Nodes are 8 CPU cores with SSD storage. 64 GB of RAM on server, heap size of 26 GB. -Relatively aggressive NRT tuning (hard commit 60 sec, soft commit 15 sec). -Multi-threaded indexing process using SolrJ CloudSolrClient, sending updates in batches of ~1000 docs -Indexing and querying is done constantly throughout the day The indexing process, heap sizes, and soft/hard commit intervals were carefully tuned for my original setup, and were working flawlessly until the hardware change. It's only since the move to faster hardware/SSDs that I am now seeing timeouts during indexing (maybe counter-intuitively). My first thought was that I was having stop the world GC pauses which were causing the timeouts, but when I captured GC logs during one of the timeout windows and ran it through a log analyzer, there were no issues detected. Largest GC pause was under 1 second. I monitor the heap continuously, and I always sit between 15-20 GB of 26 GB used...so I don't think that my heap is too small necessarily. My next thought was that maybe it had to do with segment merges happening in the background, causing indexing to block. I am using the dynamic defaults for the merge scheduler, which almost certainly changed when I moved hardware (since now it is detecting a non-spinning disk, and my understanding is that the max concurrent merges is set based on this). I have been unable to confirm this though. I do not see any merge warnings or errors in the logs, and I have thus far been able to catch it in action to try and confirm via a thread dump. Interestingly, when I did take a thread dump during normal execution, I noticed that one of my nodes has a huge number of running threads (~1700) compared to the other node (~150). Most of the threads are updateExecutor threads that appear to be permanently in a waiting state. I'm not sure what causes the node to get into this state, or if it is related to the timeouts at all. I have thus far been unable to replicate the issue in a test environment, so it's hard to trial and error possible solutions. Does anyone have any suggestions on what could be causing these timeouts all of a sudden, or tips on how to debug further? Thanks!