Thank you David, Mystery, for sure. Fun ... depends on your definition I guess ;)
TL;DR We were able to get the end-to-end duration of our test suite back down to baseline. The root cause of the majority of the regression turned out to be in our application's ORM layer (Hibernate 7 upgrade), not Solr itself. However, the Solr-related investigation uncovered real performance observations around MMapDirectory, commit overhead, and soft vs hard commits on Solr 10 that we wanted to share in case they're useful to others. *ROOT CAUSE: NOT SOLR (FOR THE MOST PART)* Our migration upgraded both Solr (8 -> 10) and Hibernate (6 -> 7) simultaneously, which made attribution difficult. After detailed profiling, we found that a Hibernate 7 behavioral change around auto-flushing with JOINED inheritance tables was responsible for the majority of the regression (116 min -> ~27 min after the fix). The Solr-related findings below are independent observations that contribute to the remaining difference between our Solr 8 and Solr 10 test suite timings, though we have not isolated exactly how much of the remaining ~8 minute gap they explain versus other framework changes in our upgrade (Hibernate 7, JUnit 6, Jackson 3, Spring 7). *FINDING 1: MMAPDIRECTORY PANAMA MEMORYSEGMENT OVERHEAD* After ruling out Jetty, we used JFR analysis to find that the SB4 branch generates 3,906 HandshakeAllThreads VM operations in 61 seconds vs 65 on main. Using -Xlog:handshake*=trace we identified the handshake type: CloseScopedMemory, with 163,637 operations in 74 seconds (2,211/sec). This is caused by Lucene 10's MMapDirectory using Panama MemorySegment instead of the old MappedByteBuffer approach. Each Arena.close() triggers a CloseScopedMemory handshake that deoptimizes the top frame of every JVM thread. This is documented in: https://github.com/apache/lucene/issues/13325 https://bugs.openjdk.org/browse/JDK-8335480 (fix in JDK 24) This was also reported on the Solr side: https://issues.apache.org/jira/browse/SOLR-17375 The suggested workaround there is to set -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false, which works on Lucene 9. However, this option was removed in Lucene 10: the Panama MemorySegment implementation is now the only MMapDirectory code path, with no fallback and no disable flag. The key difference across Solr versions: - Solr 8 / Lucene 8: MappedByteBuffer + Unsafe.invokeCleaner(), no handshakes at all - Solr 9 / Lucene 9: Panama MemorySegment, but -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false can disable it (as suggested in SOLR-17375) - Solr 10 / Lucene 10: Panama is mandatory, the enableMemorySegments flag no longer exists Our workaround for Solr 10: set -Dsolr.directoryFactory=solr.NIOFSDirectoryFactory in the test JVM args. This avoids MMap entirely and reduced CloseScopedMemory from 163,637 to 12,379 operations (13x fewer). In CI, this brought our IT suite from 116 minutes to 109 minutes, a modest 6% improvement. Locally (macOS, 10 cores) the difference was similarly small (74s to 71s for a single test class), because the handshake coordination is cheap with many cores. The remaining 12k CloseScopedMemory events likely come from Lucene internals that still use MemorySegment regardless of the directory factory. *FINDING 2: HARD COMMITS ARE 25% SLOWER ON LUCENE 10* We instrumented SolrServiceImpl.commit() with System.nanoTime() timing around the SolrClient.commit() call and ran the same test class (DiscoveryRestControllerIT, 83 tests, ~2000 Solr commits) on both branches: Main (Solr 8 / Lucene 8.11.4): 1,972 commits, total 24,600 ms, avg 12 ms/commit SB4 (Solr 10 / Lucene 10.3.2): 1,966 commits, total 30,180 ms, avg 15 ms/commit That is a 25% increase per commit, and commits account for 43% of the total test class runtime on both branches. With ~2000 commits per test class and hundreds of test classes in the full IT suite, this adds up to tens of minutes. We believe this is at least partly related to the lucene103 codec TrieReader regression documented in: https://github.com/apache/lucene/issues/15820 <https://github.com/apache/lucene/issues/15820> which reports a 4x slower seekExact for _id lookups compared to the lucene90 FST approach. Every solr.add() does an ID lookup for deduplication, and the hard commit path triggers segment reopening which exercises this code path. *FINDING 3: SOFT COMMITS ELIMINATE THE OVERHEAD* Our DSpace codebase was calling solr.commit() (hard commit) everywhere, despite having autoCommit configured in solrconfig.xml with maxTime=10000 and openSearcher=true. This meant every explicit commit was doing an fsync + opening a new IndexSearcher, which is redundant when autoCommit handles durability on a 10-second schedule. We switched all solr.commit() calls to solr.commit(true, true, true) (soft commit: waitFlush=true, waitSearcher=true, softCommit=true). Results for DiscoveryRestControllerIT (83 tests): Main, hard commit (Solr 8): 57.48s SB4, hard commit (Solr 10): 70.05s (+22%) SB4, soft commit (Solr 10): 42.94s (-25% vs main!) Soft commits on Solr 10 are faster than hard commits on Solr 8. We filed this as a separate improvement for DSpace since it benefits all versions: https://github.com/DSpace/DSpace/issues/12153 https://github.com/DSpace/DSpace/pull/12154 Note: when switching to soft commits in tests, we did encounter a subtle behavioral difference with EmbeddedSolrServer on Solr 10: deleteByQuery("*:*") followed by a soft commit does not reliably make the deletes visible to NRT readers. We had to add an explicit hard commit after deleteByQuery in our test cleanup to ensure proper test isolation. This was not an issue with Solr 8. *SUMMARY* Our embedded Solr 10 test suite (6 cores, ~2000 commits per test class, JDK 21) went from 19 minutes (Solr 8 baseline) to 116 minutes after upgrading both Solr and Hibernate simultaneously. After investigation, the regression broke down as: 1. Hibernate 7 auto-flush behavioral change: 116 -> ~27 min (the dominant factor, not Solr-related) 2. NIOFSDirectoryFactory to avoid Panama MMap overhead: ~6% gain 3. Soft commits instead of hard commits: eliminated the per-class Solr 10 vs Solr 8 gap entirely The final CI result after all fixes: 27 minutes for the full 3,400+ integration test suite, which is close to the Solr 8 baseline of 19 minutes. The remaining gap (~8 min) is likely a combination of the Lucene 10 codec overhead and the general overhead of EmbeddedSolrServer on Solr 10. If any of this analysis seems off or if there are better approaches we should consider, we would welcome the feedback. Best regards, Bram Luyten On Wed, Mar 25, 2026 at 2:27 AM David Smiley <[email protected]> wrote: > No problem. Good luck to you Bram! Looks like a fun mystery. > > On Tue, Mar 24, 2026 at 11:34 AM Bram Luyten <[email protected]> > wrote: > > > > Hi David, > > > > Thank you for the detailed response. I owe you an apology: after > > re-examining our data based on your feedback, the wall-clock profiling > > led us to an incorrect attribution. Sorry for the noise. > > > > You were right to question the wall-clock numbers for background > > threads. When we re-checked with CPU profiling (async-profiler -e cpu), > > AdaptiveExecutionStrategy.produce() shows exactly 0 CPU samples on the > > Solr 10 branch. The selector thread is idle, not busy-polling. > > Wall-clock profiling inflated it because it samples all threads > > regardless of state. Total CPU samples are nearly identical between > > branches (17,519 vs 17,119), same distribution. > > > > To answer your question: we create exactly one CoreContainer for the > > entire test suite, held as a static singleton with 6 cores. Between > > tests we clear data via deleteByQuery + commit, but the container stays > > alive for the full JVM lifetime. So the "lots of CoreContainers" > > scenario does not apply here. > > > > Given identical CPU profiles and zero Jetty CPU samples, the Solr path > > is almost certainly not our bottleneck. We will look elsewhere. I don't > > think the SolrStartup benchmark would be productive at this point. > > > > Again, apologies for the false alarm, and thank you for steering us in > > the right direction. > > > > Best regards, > > Bram Luyten > > > > On Tue, Mar 24, 2026 at 2:43 PM David Smiley <[email protected]> wrote: > > > > > Hello Bram, > > > > > > Some of what you are sharing confuses me. I don't think sharing the > > > wall-clock-time is pertinent for background threads -- and I assume > > > those Jetty HttpClients are in the background doing nothing. Yes, > > > CoreContainer creates a Jetty HttpClient that is unused in an embedded > > > mode. Curious; are you creating lots of CoreContainers (perhaps > > > indirectly via creating EmbeddedSolrServer)? Maybe we have a > > > regression there. I suspect a test environment would be doing this, > > > creating a CoreContainer for each test, basically. Solr's tests do > > > this too! And a slowdown as big as you show sounds like something > > > we'd notice... most likely. On the other hand, if your CI/tests > > > creates very few CoreContainers and there's all this slowdown you > > > report, then CoreContainer startup is mostly irrelevant. > > > > > > We do have a benchmark that should capture a slowdown in this area -- > > > > > > > https://github.com/apache/solr/blob/9c911e7337cd1026accc1a825e26906039982328/solr/benchmark/src/java/org/apache/solr/bench/lifecycle/SolrStartup.java > > > (scope is a bit larger but good enough) but we don't have continuous > > > benchmarking over releases to make relative comparisons. We've been > > > talking about that, but the recent discussions are unlikely to support > > > a way to do this for embedded Solr. I've been working on this > > > benchmark code lately as well. *Anyway*, I recommend that you try > > > this benchmark, starting with its great README, mostly documenting JMH > > > itself. If you do that and find some curious/suspicious things, I'd > > > love to hear more! > > > > > > On Tue, Mar 24, 2026 at 3:51 AM Bram Luyten <[email protected]> > > > wrote: > > > > > > > > Hi all, > > > > > > > > Disclaimer: I am a DSpace developer, not a Solr/Jetty internals > > > > expert. Much of the profiling and analysis below was done with heavy > > > > assistance from Claude. I'm sharing this because the data seems > > > > significant, > > > > but I may be misinterpreting some of it. Corrections and guidance are > > > very > > > > welcome. > > > > > > > > > > > > CONTEXT > > > > --------------- > > > > > > > > We are upgrading DSpace (open-source repository software) from > > > > Spring Boot 3 / Solr 8 to Spring Boot 4 / Solr 10. Our integration > > > > test suite uses embedded Solr via solr-core as a test dependency > > > > (EmbeddedSolrServer style, no HTTP traffic -- everything is > > > > in-process in a single JVM). > > > > > > > > After the upgrade, our IT suite went from ~31 minutes to ~2 hours > > > > in CI. We spent considerable time profiling and eliminating other > > > > causes (Hibernate 7, Spring 7, H2 database, GC, lock contention, > > > > caching). Wall-clock profiling with async-profiler ultimately > > > > pointed to embedded Solr as the primary bottleneck. > > > > > > > > Note: we previously reported the Solr 10 POM issue with missing > > > > Jackson 2 dependency versions (solr-core, solr-solrj, solr-api). > > > > We have the workaround in place (explicit dependency declarations), > > > > so the embedded Solr 10 has a complete classpath. > > > > > > > > > > > > THE PROBLEM > > > > ---------------------- > > > > > > > > Wall-clock profiling (async-profiler -e wall) of the same test class > > > > (DiscoveryRestControllerIT, 83 tests) on both branches shows: > > > > > > > > Component Main (Solr 8) SB4 (Solr 10) Difference > > > > ---------------------------------------------------------------- > > > > Solr total 3.6s 11.5s +7.9s > > > > Hibernate 0.2s 0.2s 0.0s > > > > H2 Database 0.1s 0.1s 0.0s > > > > Spring 0.1s 0.1s 0.0s > > > > Test total 68.4s 84.3s +15.9s > > > > > > > > Solr accounts for 50% of the total wall-clock difference (7.9s out > > > > of 15.9s). Hibernate, H2, and Spring are essentially unchanged. > > > > > > > > > > > > THE ROOT CAUSE > > > > --------------------------- > > > > > > > > Breaking down the Solr wall-clock time by operation: > > > > > > > > Operation Main SB4 > > > > --------------------------------------------------------------- > > > > Jetty EatWhatYouKill.produce() 2558 (58%) -- > > > > Jetty AdaptiveExecutionStrategy.produce() -- 12786 (91%) > > > > DirectUpdateHandler2.commit() 522 (12%) 707 (5%) > > > > SpellChecker.newSearcher() 119 (3%) 261 (2%) > > > > > > > > (Numbers are async-profiler wall-clock samples) > > > > > > > > The dominant operation is Jetty's NIO selector execution strategy: > > > > > > > > - Solr 8 / Jetty 9: EatWhatYouKill.produce(): 2558 samples (58%) > > > > - Solr 10 / Jetty 12: AdaptiveExecutionStrategy.produce(): 12786 > > > samples > > > > (91%) > > > > - That is a 5x increase in wall-clock time > > > > > > > > The full stack trace shows: > > > > > > > > ThreadPoolExecutor > > > > -> MDCAwareThreadPoolExecutor > > > > -> ManagedSelector (Jetty NIO selector) > > > > -> AdaptiveExecutionStrategy.produce() > > > > -> AdaptiveExecutionStrategy.tryProduce() > > > > -> AdaptiveExecutionStrategy.produceTask() > > > > -> ... -> KQueue.poll (macOS NIO) > > > > > > > > This is the Jetty HTTP client's NIO event loop. Even though we use > > > > EmbeddedSolrServer (no HTTP traffic), Solr 10's CoreContainer > > > > appears to create an internal Jetty HTTP client (likely for > > > > inter-shard communication via HttpJettySolrClient). In embedded > > > > single-node mode, this client has no work to do, but its NIO > > > > selector thread still runs, and AdaptiveExecutionStrategy.produce() > > > > idles much less efficiently than Jetty 9's EatWhatYouKill did. > > > > > > > > On macOS this manifests as busy-polling in KQueue.poll. The impact > > > > may differ on Linux (epoll). > > > > > > > > > > > > PROFILING METHODOLOGY > > > > ----------------------------------------- > > > > > > > > - Tool: async-profiler 4.3 (wall-clock mode, safepoint-free) > > > > - JDK: OpenJDK 21.0.9 > > > > - Both branches use the same H2 2.4.240 test database > > > > - Both branches use the same test code and Solr schema/config > > > > - The only Solr-related difference is the Solr version (8.11.4 vs > > > 10.0.0) > > > > - Profiling was done on macOS (Apple Silicon), but the CI slowdown > > > > (GitHub Actions, Ubuntu) shows the same pattern at larger scale > > > > > > > > > > > > WHAT WE RULED OUT > > > > --------------------------------- > > > > > > > > Before identifying the Solr/Jetty issue, we investigated and ruled > > > > out many other causes: > > > > > > > > - Hibernate 7 overhead: SQL query count is similar (fewer on SB4), > > > > query execution time is <40ms total for 1400+ queries > > > > - H2 database: same version (2.4.240) on both branches, negligible > > > > wall-clock difference > > > > - GC pauses: only +0.7s extra on SB4 (1.4% of total difference) > > > > - Lock contention: main actually has MORE lock contention than SB4 > > > > - Hibernate session.clear(): tested with/without, no effect > > > > - JaCoCo coverage: tested with/without, no effect > > > > - Hibernate caching (L2, query cache): disabled both, no effect > > > > - Hibernate batch fetch size: tested, no effect > > > > > > > > > > > > QUESTIONS FOR THE SOLR TEAM > > > > -------------------------------------------------- > > > > > > > > 1. Does embedded mode (EmbeddedSolrServer / CoreContainer without > > > > an HTTP listener) need to create a Jetty HTTP client at all? > > > > If the client is only for shard-to-shard communication, it > > > > seems unnecessary in single-node embedded testing. > > > > > > > > 2. If the HTTP client is required, can its NIO selector / thread > > > > pool be configured with minimal resources for embedded mode? > > > > (e.g., fewer selector threads, smaller thread pool, or an > > > > idle-friendly execution strategy) > > > > > > > > 3. Is there a Solr configuration (solr.xml property, system > > > > property, or CoreContainer API) that we can use from the > > > > consuming application to reduce this overhead? > > > > > > > > 4. Is this specific to macOS (KQueue) or does it also affect > > > > Linux (epoll)? Our CI runs on Ubuntu and shows a larger > > > > slowdown (3.8x) than local macOS (1.28x), which could be > > > > related. > > > > > > > > ENVIRONMENT > > > > ----------------------- > > > > > > > > Solr: 10.0.0 (solr-core as test dependency for embedded server) > > > > Jetty: 12.0.x (pulled in transitively by Solr 10) > > > > JDK: 21 > > > > OS: macOS (profiled), Ubuntu (CI where the 4x slowdown manifests) > > > > Project: DSpace (https://github.com/DSpace/DSpace) > > > > PR: https://github.com/DSpace/DSpace/pull/11810 > > > > > > > > Happy to provide the full async-profiler flame graph files or > > > > additional profiling data if useful. > > > > > > > > Thanks, > > > > Bram Luyten, Atmire > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
