[PR] Reduce regular CI test suite runtime [solr]

via GitHub Tue, 02 Jun 2026 04:06:08 -0700


janhoy opened a new pull request, #4495:
URL: https://github.com/apache/solr/pull/4495


   This PR reduces regular CI test suite runtime through two strategies:
   1. **Fix one severely under-optimized test** (10x speedup)
   2. **Calibrate `@Repeat` iteration counts** for regular vs nightly CI
   3. **Move 10 slow integration/stress tests to `@Nightly`**
   
   Tests annotated `@Nightly` continue to run in the dedicated nightly CI job 
(`-Ptests.nightly=true`) with no loss of coverage. Regular PR/branch CI becomes 
significantly faster.
   
   ## Strategy 1 — Fix inefficient test structure
   
   ### `DistributedCombinedQueryComponentTest` (~80s → ~10s, **10x speedup**)
   
   **Root cause:** The test had 6 separate `@Test` methods, all operating on an 
identical document set. `BaseDistributedSearchTestCase` uses a method-level 
`@Rule` (`ShardsRepeatRule`) — not a `@ClassRule` — so it creates and destroys 
the full distributed cluster for *each test method*. With 6 + 2 extra methods, 
the setup/teardown overhead (≈9s each) dominated the 80s runtime.
   
   **Fix:** Merged the 6 same-dataset methods into a single 
`testCombinedQueries()` method, reducing cluster lifecycles from 8 to 3. All 
assertions for single-lexical matching, multi-lexical matching, sorting, 
pagination, faceting, and facet+highlighting are preserved — just executed 
within one cluster lifecycle.
   
   ## Strategy 2 — Calibrate `@Repeat` counts (regular CI vs nightly)
   
   `@Repeat` requires a compile-time constant so `TEST_NIGHTLY ? N : M` cannot 
be used directly in the annotation. The solution uses the **subclass pattern**: 
reduce the count in the base class for regular CI, then create a one-liner 
`*NightlyTest` subclass annotated `@Nightly @Repeat(originalCount)` that 
inherits all tests and runs with the full count nightly.
   
   This preserves all framework semantics: each iteration gets a distinct 
random seed, independent setup/teardown, separate failure reporting, and unique 
test naming — benefits that would be lost by converting to a plain loop.
   
   | Test class | Regular CI | Nightly | Nightly subclass |
   |---|---|---|---|
   | `RandomizedTaggerTest` | 2 iterations | 10 iterations | 
`RandomizedTaggerNightlyTest` |
   | `TestSolr4Spatial2` (`testLLPDecodeIsStableAndPrecise`) | 2 iterations | 
10 iterations | `TestSolr4Spatial2Nightly` |
   | `SpatialHeatmapFacetsTest` (`testPng`) | 1 iteration | 3 iterations | 
`SpatialHeatmapFacetsNightlyTest` |
   | `CloudExitableDirectoryReaderTest` (`testCreepThenBite`) | 2 iterations | 
5 iterations | `CloudExitableDirectoryReaderNightlyTest` |
   
   ## Strategy 3 — Move 10 tests to `@Nightly`
   
   These tests are slow not because of a fixable design issue, but because they 
are **inherently integration/stress tests** that exercise complex distributed 
behavior, external infrastructure, or require many repetitions to catch race 
conditions. They belong in nightly CI.
   
   | Test | Module | Why it's slow |
   |---|---|---|
   | `RollingRestartTest` | `solr:core` | Repeatedly stops/starts Jetty nodes 
and waits for overseer leader election across up to 16 nodes. Even at the 
minimum 2 restarts, the ZooKeeper coordination overhead makes this a stress 
test, not a unit test. |
   | `SyncSliceTest` | `solr:core` | Exercises leader election and peer-sync 
after deliberate shard inconsistency. Uses 4–7 shard nodes; deliberately 
indexes to skip servers and waits for recovery. |
   | `RecoveryZkTest` | `solr:core` | Indexes up to 3000 docs across two 
concurrent threads, stops/restarts a replica mid-index, then waits for full 
replication. The `if (!TEST_NIGHTLY)` branch also reveals it was written with 
nightly in mind. |
   | `UnloadDistributedZkTest` | `solr:core` | Exercises core unloading, ZK 
state transitions, and replica removal across a distributed cluster. Heavy 
ZooKeeper interaction throughout. |
   | `SolrAndKafkaIntegrationTest` | `solr:cross-dc-manager` | Requires 
starting an embedded Kafka cluster (`EmbeddedKafkaCluster`) alongside a full 
SolrCloud cluster. The external broker startup/shutdown alone makes this 
integration-only. |
   | `GCSIncrementalBackupTest` | `solr:modules:gcs-repository` | Full GCS 
backup-and-restore integration test: creates a collection, indexes docs, backs 
up to GCS, restores, verifies. Inherently I/O and cluster-heavy. |
   | `S3IncrementalBackupTest` | `solr:modules:s3-repository` | Same as above 
for S3, using an embedded `S3MockRule`. Full backup lifecycle per test method. |
   | `BadClusterTest` | `solr:solrj-streaming` | Progressively degrades a live 
cluster across ordered test scenarios — stopping replicas, killing leaders — to 
verify streaming behavior under failure. The cluster worsens through the test 
by design. |
   | `PerReplicaStatesIntegrationTest` | `solr:solrj` | Creates multiple full 
MiniSolrCloudClusters within a single test class. Even the class Javadoc notes: 
*"This test would be faster if we simulated the ZK state instead."* |
   | `TestPullReplica` | `solr:core` | Has `@Repeat(30)` on `testCreateDelete` 
— 30 full collection create/delete cycles. Multiple other methods exercise pull 
replica replication, which requires waiting for index replication to complete. 
One of the heaviest cloud tests in the suite. |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Reduce regular CI test suite runtime [solr]

Reply via email to