janhoy opened a new pull request, #4495: URL: https://github.com/apache/solr/pull/4495
This PR reduces regular CI test suite runtime through two strategies: 1. **Fix one severely under-optimized test** (10x speedup) 2. **Calibrate `@Repeat` iteration counts** for regular vs nightly CI 3. **Move 10 slow integration/stress tests to `@Nightly`** Tests annotated `@Nightly` continue to run in the dedicated nightly CI job (`-Ptests.nightly=true`) with no loss of coverage. Regular PR/branch CI becomes significantly faster. ## Strategy 1 — Fix inefficient test structure ### `DistributedCombinedQueryComponentTest` (~80s → ~10s, **10x speedup**) **Root cause:** The test had 6 separate `@Test` methods, all operating on an identical document set. `BaseDistributedSearchTestCase` uses a method-level `@Rule` (`ShardsRepeatRule`) — not a `@ClassRule` — so it creates and destroys the full distributed cluster for *each test method*. With 6 + 2 extra methods, the setup/teardown overhead (≈9s each) dominated the 80s runtime. **Fix:** Merged the 6 same-dataset methods into a single `testCombinedQueries()` method, reducing cluster lifecycles from 8 to 3. All assertions for single-lexical matching, multi-lexical matching, sorting, pagination, faceting, and facet+highlighting are preserved — just executed within one cluster lifecycle. ## Strategy 2 — Calibrate `@Repeat` counts (regular CI vs nightly) `@Repeat` requires a compile-time constant so `TEST_NIGHTLY ? N : M` cannot be used directly in the annotation. The solution uses the **subclass pattern**: reduce the count in the base class for regular CI, then create a one-liner `*NightlyTest` subclass annotated `@Nightly @Repeat(originalCount)` that inherits all tests and runs with the full count nightly. This preserves all framework semantics: each iteration gets a distinct random seed, independent setup/teardown, separate failure reporting, and unique test naming — benefits that would be lost by converting to a plain loop. | Test class | Regular CI | Nightly | Nightly subclass | |---|---|---|---| | `RandomizedTaggerTest` | 2 iterations | 10 iterations | `RandomizedTaggerNightlyTest` | | `TestSolr4Spatial2` (`testLLPDecodeIsStableAndPrecise`) | 2 iterations | 10 iterations | `TestSolr4Spatial2Nightly` | | `SpatialHeatmapFacetsTest` (`testPng`) | 1 iteration | 3 iterations | `SpatialHeatmapFacetsNightlyTest` | | `CloudExitableDirectoryReaderTest` (`testCreepThenBite`) | 2 iterations | 5 iterations | `CloudExitableDirectoryReaderNightlyTest` | ## Strategy 3 — Move 10 tests to `@Nightly` These tests are slow not because of a fixable design issue, but because they are **inherently integration/stress tests** that exercise complex distributed behavior, external infrastructure, or require many repetitions to catch race conditions. They belong in nightly CI. | Test | Module | Why it's slow | |---|---|---| | `RollingRestartTest` | `solr:core` | Repeatedly stops/starts Jetty nodes and waits for overseer leader election across up to 16 nodes. Even at the minimum 2 restarts, the ZooKeeper coordination overhead makes this a stress test, not a unit test. | | `SyncSliceTest` | `solr:core` | Exercises leader election and peer-sync after deliberate shard inconsistency. Uses 4–7 shard nodes; deliberately indexes to skip servers and waits for recovery. | | `RecoveryZkTest` | `solr:core` | Indexes up to 3000 docs across two concurrent threads, stops/restarts a replica mid-index, then waits for full replication. The `if (!TEST_NIGHTLY)` branch also reveals it was written with nightly in mind. | | `UnloadDistributedZkTest` | `solr:core` | Exercises core unloading, ZK state transitions, and replica removal across a distributed cluster. Heavy ZooKeeper interaction throughout. | | `SolrAndKafkaIntegrationTest` | `solr:cross-dc-manager` | Requires starting an embedded Kafka cluster (`EmbeddedKafkaCluster`) alongside a full SolrCloud cluster. The external broker startup/shutdown alone makes this integration-only. | | `GCSIncrementalBackupTest` | `solr:modules:gcs-repository` | Full GCS backup-and-restore integration test: creates a collection, indexes docs, backs up to GCS, restores, verifies. Inherently I/O and cluster-heavy. | | `S3IncrementalBackupTest` | `solr:modules:s3-repository` | Same as above for S3, using an embedded `S3MockRule`. Full backup lifecycle per test method. | | `BadClusterTest` | `solr:solrj-streaming` | Progressively degrades a live cluster across ordered test scenarios — stopping replicas, killing leaders — to verify streaming behavior under failure. The cluster worsens through the test by design. | | `PerReplicaStatesIntegrationTest` | `solr:solrj` | Creates multiple full MiniSolrCloudClusters within a single test class. Even the class Javadoc notes: *"This test would be faster if we simulated the ZK state instead."* | | `TestPullReplica` | `solr:core` | Has `@Repeat(30)` on `testCreateDelete` — 30 full collection create/delete cycles. Multiple other methods exercise pull replica replication, which requires waiting for index replication to complete. One of the heaviest cloud tests in the suite. | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
