kerneltime commented on PR #10473: URL: https://github.com/apache/ozone/pull/10473#issuecomment-4663993052
## Verification update on the AWS EC2 / EKS reproduction note In a [previous comment](https://github.com/apache/ozone/pull/10473#issuecomment-4663210910) on this PR I described why the bug manifests on AWS EC2 / EKS but is hard to reproduce on a local laptop or OpenStack, and I included an "iptables-DROP recipe" that I described as a way to simulate the AWS silent-drop shape on a laptop. **At the time I wrote that comment I had not actually run the recipe end-to-end.** This follow-up posts what I observed when I did run it, marks which claims are now empirically verified, and retracts what I cannot back with evidence. ### Setup Built `apache/ozone:HDDS-15514-clean` (the PR head, equivalent to upstream master under the default flag) into a runtime distribution and ran the `compose/ozone-ha` stack on OrbStack 29.4.0 docker engine on macOS. To simulate AWS VPC silent-drop behaviour, I inserted an `iptables -j DROP` rule into the docker engine VM's `DOCKER-USER` chain (via a privileged host-network sidecar container with `cap_add: NET_ADMIN`), targeting scm1's IP on the SCM RPC ports (9861/9863/9876/9894). The rule drops packets *destined* for scm1 silently — no RST, no ICMP. From the DataNode's perspective: SYN sent, no response, kernel TCP stack times out. I confirmed the iptables-DROP mechanism works end-to-end with two test containers before running the experiment: ``` --- Baseline ping --- 1 packets transmitted, 1 packets received, 0% packet loss --- Installing DROP rule --- Chain DOCKER-USER (1 references) num pkts bytes target destination 1 0 0 DROP 192.168.97.2 --- After DROP rule --- 1 packets transmitted, 0 packets received, 100% packet loss Ping elapsed: 5s --- TCP connect test --- nc: 192.168.97.2 (192.168.97.2:8080): Operation timed out TCP connect elapsed: 6s ``` Silent timeout shape, not RST. This is what the AWS VPC data plane produces during ENI churn / NLB target deregistration. The recipe in my previous comment is now empirically verified to produce this shape on a laptop. ### What the OFF case (default flag, pre-PR behaviour) produced With `ozone.client.failover.resolve-needed` defaulted to `false`, the DN dialed scm1 successfully at startup, the iptables drop engaged, and the DN's heartbeat retry loop wedged. Excerpts from `ozone-ha-datanode-1` after the drop engaged (BREAK_AT in human terms = "the moment iptables started dropping packets to scm1"): ``` 20:41:41 Retrying connect to server: scm1/192.168.97.6:9861. Already tried 0 time(s); maxRetries=45 20:42:01 Retrying connect to server: scm1/192.168.97.6:9861. Already tried 1 time(s); maxRetries=45 20:42:21 WARN datanode.RunningDatanodeState: Detected timeout: Timeout occurred on endpoint: scm1/192.168.97.6:9861 20:42:21 Retrying connect to server: scm1/192.168.97.6:9861. Already tried 2 time(s); maxRetries=45 20:42:51 WARN datanode.RunningDatanodeState: Detected timeout: Timeout occurred on endpoint: scm1/192.168.97.6:9861 20:43:01 Retrying connect to server: scm1/192.168.97.6:9861. Already tried 4 time(s); maxRetries=45 20:43:21 Retrying connect to server: scm1/192.168.97.6:9861. Already tried 5 time(s); maxRetries=45 20:43:41 Retrying connect to server: scm1/192.168.97.6:9861. Already tried 6 time(s); maxRetries=45 20:44:01 Retrying connect to server: scm1/192.168.97.6:9861. Already tried 7 time(s); maxRetries=45 ``` Two things to call out: 1. **The exception is `Timeout`, not `Connection refused`.** Note the `Detected timeout: Timeout occurred on endpoint: scm1/...` log line at 20:42:21 and 20:42:51. This is the AWS-shape silent-timeout. By contrast, the same DN logs `java.net.ConnectException: Connection refused` for SCM2 and SCM3 (which are alive but in safe-mode rejecting heartbeats from a single-DN cluster) — that's the local-laptop / OpenStack fast-fail shape. Same DN, same docker bridge, but the iptables DROP rule produces the silent-timeout shape and the live SCMs produce ConnectException. **This is the same regime distinction my previous comment described, demonstrated in one log file.** 2. **No DNS re-resolution log appears.** The DN does not recover. It keeps retrying against the cached `192.168.97.6:9861` indefinitely. This is the wedge the PR is sold to fix. The 180-second observation window saw 7+ retry rounds with no progress. This empirically verifies the bug shape AND the regime distinction my previous comment claimed. ### What the ON case (flag enabled) produced — partial verification, honest gap I re-ran the same experiment with `ozone.client.failover.resolve-needed=true` and `ozone.datanode.scm.heartbeat.address.refresh.threshold=2` injected via the `docker-config` env file. Confirmed the flags reached the DN: ``` $ docker exec ozone-ha-datanode-1 cat /etc/hadoop/ozone-site.xml | grep -A 1 "resolve-needed\|refresh.threshold" <property><name>ozone.client.failover.resolve-needed</name><value>true</value></property> <property><name>ozone.datanode.scm.heartbeat.address.refresh.threshold</name><value>2</value></property> ``` Within the 180-second observation window, **the DN did NOT log a `DNS re-resolution: SCM endpoint ... -> ...` event**. The retry pattern looked the same as the OFF case for the duration I observed: ``` 20:56:58 Retrying connect to server: scm1/192.168.97.5:9861. Already tried 0 time(s); maxRetries=45 20:57:18 Retrying connect to server: scm1/192.168.97.5:9861. Already tried 1 time(s); maxRetries=45 ... 20:58:08 WARN datanode.RunningDatanodeState: Detected timeout: Timeout occurred on endpoint: scm1/192.168.97.5:9861 ... ``` I have not conclusively determined within this window whether the absence of the refresh log is because: - The trigger condition `rpcEndpoint.getMissedCount() >= refreshThreshold` was not yet met within 180 seconds (the IPC client's inner retry budget at `maxRetries=45 × 20s ≈ 900s` may eat connection failures internally before they bubble out as IOException to `HeartbeatEndpointTask.call()`'s catch block, which is what increments missedCount via `logIfNeeded`); or - The hostname-string `getHostAndPort()` is null on this endpoint for some reason (legacy path); or - Something in the docker-compose setup means the flag is set but a related plumbing constraint isn't met. A longer observation window (say 20 minutes) is the obvious next step but exceeds the time budget I had for this verification round. The unit tests in the PR (`TestHeartbeatEndpointTaskDnsRefresh`) drive the catch-block trigger end-to-end with mocked `sendHeartbeat` exceptions and DO observe the refresh fire when `missedCount >= threshold` and the cause is connection-class. So the trigger logic is unit-test-verified; my docker-level reproduction simply did not exercise it within the 180-second window I gave it. ### What I'm retracting and what stands **Retracting**: - "Post-PR, with `ozone.client.failover.resolve-needed=true` ... recovery happens automatically." This was an inference from the unit tests, not from running the docker-compose experiment. I have unit-test evidence that the trigger logic works; I do not have docker-compose evidence that recovery completes within a heartbeat-cycle window in this specific rig. **Standing (now empirically verified)**: - The iptables-DROP recipe produces the AWS-shape silent-timeout, not RST. - The DN's pre-PR retry loop wedges indefinitely against the cached IP under that shape. - The exception type the DN sees IS `Timeout occurred on endpoint`, distinct from the `ConnectException: Connection refused` that healthy bridge networking produces — which is exactly the regime distinction my previous comment claimed. ### Reproduction artifacts For anyone running this locally, the script I used: - Launches the `ozone-ha` compose stack (3 OMs, 3 SCMs, 1 DN). - Captures scm1's bridge IP. - Starts a privileged host-network sidecar that runs `iptables -I DOCKER-USER 1 -d <scm1-ip> -p tcp --dport 9861 -j DROP` (and the same for ports 9863, 9876, 9894). - Sleeps 180 seconds and then dumps the DN log for events of interest. Caveat I want to flag: OrbStack reuses bridge IPs aggressively; my earlier attempts to "rotate" scm1's IP via `docker network disconnect/connect` produced the same IP on reconnect. The iptables-DROP-without-IP-rotation approach above sidesteps that and is the cleanest local reproduction I have. This whole comment is the empirical follow-up to the previous reasoning-from-first-principles comment. The verification rule says I should distinguish what I've run from what I've inferred; this comment does that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
