ArafatKhan2198 opened a new pull request, #10617: URL: https://github.com/apache/ozone/pull/10617
## What changes were proposed in this pull request? When an SCM follower restarts in an HA cluster, it used to start talking to datanodes **right away**, even while it was still catching up on the Ratis log. That caused problems: - Datanodes report containers the follower doesn’t know about yet → **`CONTAINER_NOT_FOUND`** - Or the follower tries to update container state and fails → **`NotLeaderException`** - In both cases, **replica info gets dropped** - If that SCM later becomes leader, containers can show **missing or wrong replicas** **The fix:** 1. **Don’t start the datanode server in HA mode** during normal SCM startup. 2. **Wait until catch-up is done**, then start it from `SCMStateMachine`. 3. **Don’t let followers write container state changes** during report handling — only the leader should. **Why:** Replica locations are rebuilt from datanode reports. Those reports must only be processed **after** the SCM has replayed all committed Ratis entries. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-14989 ## How was this patch tested? ### **Integration tests** TestSCMFollowerCatchupWithContainerReport - - `testFollowerCatchupAfterContainerClose` — close-while-down (HDDS-14989 scenario) - `testFollowerCatchupAfterContainerCreate` — create-while-down (`CONTAINER_NOT_FOUND` scenario) - `testFollowerCatchupOnIdleCluster` — idle cluster edge case ### **Manual test (docker-compose `ozone-ha`)** Environment: `hadoop-ozone/dist/target/ozone-2.3.0-SNAPSHOT/compose/ozone-ha` Config: RF=3, 3 datanodes, `hdds.container.report.interval=1h`, `ozone.scm.container.size=1GB` **Procedure** (same for with/without fix): 1. Start cluster: `OZONE_REPLICATION_FACTOR=3 docker compose up -d --scale datanode=3` 2. Write 50 × 1MB keys to `vol1/buck1` (containers 1–3) 3. Stop follower **scm3** 4. Close containers 1, 2, 3 5. Write 50 × 1MB keys to `vol1/buck2` (creates containers 4, 5, 6 while scm3 is down) 6. Restart **scm3** 7. Transfer SCM leadership to scm3 8. Inspect scm3 logs and `ozone admin container info` for containers 4–6 **Without the fix:** ``` 06:57:58.622 ScmDatanodeProtocol RPC server ... listening at /0.0.0.0:9861 06:57:58.837 CONTAINER_NOT_FOUND for Container #4 06:57:58.837 CONTAINER_NOT_FOUND for Container #5 06:57:58.837 CONTAINER_NOT_FOUND for Container #6 (6 errors total — 2 datanodes × 3 containers) ``` After leadership transfer, containers 4–6 had **1 replica each** (expected 3). **With the fix:** ``` 07:24:28.377 Follower caught up with leader: lastAppliedIndex=49, leaderCommit=49 07:24:28.378 ScmDatanodeProtocol RPC server ... listening at /0.0.0.0:9861 ``` - `CONTAINER_NOT_FOUND` on scm3: **0** - After leadership transfer, containers 4–6 each had **3 replicas** from all datanodes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
