kerneltime commented on PR #10473:
URL: https://github.com/apache/ozone/pull/10473#issuecomment-4663993052

   ## Verification update on the AWS EC2 / EKS reproduction note
   
   In a [previous 
comment](https://github.com/apache/ozone/pull/10473#issuecomment-4663210910) on 
this PR I described why the bug manifests on AWS EC2 / EKS but is hard to 
reproduce on a local laptop or OpenStack, and I included an "iptables-DROP 
recipe" that I described as a way to simulate the AWS silent-drop shape on a 
laptop. **At the time I wrote that comment I had not actually run the recipe 
end-to-end.** This follow-up posts what I observed when I did run it, marks 
which claims are now empirically verified, and retracts what I cannot back with 
evidence.
   
   ### Setup
   
   Built `apache/ozone:HDDS-15514-clean` (the PR head, equivalent to upstream 
master under the default flag) into a runtime distribution and ran the 
`compose/ozone-ha` stack on OrbStack 29.4.0 docker engine on macOS. To simulate 
AWS VPC silent-drop behaviour, I inserted an `iptables -j DROP` rule into the 
docker engine VM's `DOCKER-USER` chain (via a privileged host-network sidecar 
container with `cap_add: NET_ADMIN`), targeting scm1's IP on the SCM RPC ports 
(9861/9863/9876/9894). The rule drops packets *destined* for scm1 silently — no 
RST, no ICMP. From the DataNode's perspective: SYN sent, no response, kernel 
TCP stack times out.
   
   I confirmed the iptables-DROP mechanism works end-to-end with two test 
containers before running the experiment:
   
   ```
   --- Baseline ping ---
   1 packets transmitted, 1 packets received, 0% packet loss
   
   --- Installing DROP rule ---
   Chain DOCKER-USER (1 references)
   num   pkts bytes target  destination
   1        0     0 DROP    192.168.97.2
   
   --- After DROP rule ---
   1 packets transmitted, 0 packets received, 100% packet loss
   Ping elapsed: 5s
   
   --- TCP connect test ---
   nc: 192.168.97.2 (192.168.97.2:8080): Operation timed out
   TCP connect elapsed: 6s
   ```
   
   Silent timeout shape, not RST. This is what the AWS VPC data plane produces 
during ENI churn / NLB target deregistration. The recipe in my previous comment 
is now empirically verified to produce this shape on a laptop.
   
   ### What the OFF case (default flag, pre-PR behaviour) produced
   
   With `ozone.client.failover.resolve-needed` defaulted to `false`, the DN 
dialed scm1 successfully at startup, the iptables drop engaged, and the DN's 
heartbeat retry loop wedged. Excerpts from `ozone-ha-datanode-1` after the drop 
engaged (BREAK_AT in human terms = "the moment iptables started dropping 
packets to scm1"):
   
   ```
   20:41:41  Retrying connect to server: scm1/192.168.97.6:9861. Already tried 
0 time(s); maxRetries=45
   20:42:01  Retrying connect to server: scm1/192.168.97.6:9861. Already tried 
1 time(s); maxRetries=45
   20:42:21  WARN datanode.RunningDatanodeState: Detected timeout: Timeout 
occurred on endpoint: scm1/192.168.97.6:9861
   20:42:21  Retrying connect to server: scm1/192.168.97.6:9861. Already tried 
2 time(s); maxRetries=45
   20:42:51  WARN datanode.RunningDatanodeState: Detected timeout: Timeout 
occurred on endpoint: scm1/192.168.97.6:9861
   20:43:01  Retrying connect to server: scm1/192.168.97.6:9861. Already tried 
4 time(s); maxRetries=45
   20:43:21  Retrying connect to server: scm1/192.168.97.6:9861. Already tried 
5 time(s); maxRetries=45
   20:43:41  Retrying connect to server: scm1/192.168.97.6:9861. Already tried 
6 time(s); maxRetries=45
   20:44:01  Retrying connect to server: scm1/192.168.97.6:9861. Already tried 
7 time(s); maxRetries=45
   ```
   
   Two things to call out:
   
   1. **The exception is `Timeout`, not `Connection refused`.** Note the 
`Detected timeout: Timeout occurred on endpoint: scm1/...` log line at 20:42:21 
and 20:42:51. This is the AWS-shape silent-timeout. By contrast, the same DN 
logs `java.net.ConnectException: Connection refused` for SCM2 and SCM3 (which 
are alive but in safe-mode rejecting heartbeats from a single-DN cluster) — 
that's the local-laptop / OpenStack fast-fail shape. Same DN, same docker 
bridge, but the iptables DROP rule produces the silent-timeout shape and the 
live SCMs produce ConnectException. **This is the same regime distinction my 
previous comment described, demonstrated in one log file.**
   
   2. **No DNS re-resolution log appears.** The DN does not recover. It keeps 
retrying against the cached `192.168.97.6:9861` indefinitely. This is the wedge 
the PR is sold to fix. The 180-second observation window saw 7+ retry rounds 
with no progress.
   
   This empirically verifies the bug shape AND the regime distinction my 
previous comment claimed.
   
   ### What the ON case (flag enabled) produced — partial verification, honest 
gap
   
   I re-ran the same experiment with 
`ozone.client.failover.resolve-needed=true` and 
`ozone.datanode.scm.heartbeat.address.refresh.threshold=2` injected via the 
`docker-config` env file. Confirmed the flags reached the DN:
   
   ```
   $ docker exec ozone-ha-datanode-1 cat /etc/hadoop/ozone-site.xml | grep -A 1 
"resolve-needed\|refresh.threshold"
   
<property><name>ozone.client.failover.resolve-needed</name><value>true</value></property>
   
<property><name>ozone.datanode.scm.heartbeat.address.refresh.threshold</name><value>2</value></property>
   ```
   
   Within the 180-second observation window, **the DN did NOT log a `DNS 
re-resolution: SCM endpoint ... -> ...` event**. The retry pattern looked the 
same as the OFF case for the duration I observed:
   
   ```
   20:56:58  Retrying connect to server: scm1/192.168.97.5:9861. Already tried 
0 time(s); maxRetries=45
   20:57:18  Retrying connect to server: scm1/192.168.97.5:9861. Already tried 
1 time(s); maxRetries=45
   ...
   20:58:08  WARN datanode.RunningDatanodeState: Detected timeout: Timeout 
occurred on endpoint: scm1/192.168.97.5:9861
   ...
   ```
   
   I have not conclusively determined within this window whether the absence of 
the refresh log is because:
   
   - The trigger condition `rpcEndpoint.getMissedCount() >= refreshThreshold` 
was not yet met within 180 seconds (the IPC client's inner retry budget at 
`maxRetries=45 × 20s ≈ 900s` may eat connection failures internally before they 
bubble out as IOException to `HeartbeatEndpointTask.call()`'s catch block, 
which is what increments missedCount via `logIfNeeded`); or
   - The hostname-string `getHostAndPort()` is null on this endpoint for some 
reason (legacy path); or
   - Something in the docker-compose setup means the flag is set but a related 
plumbing constraint isn't met.
   
   A longer observation window (say 20 minutes) is the obvious next step but 
exceeds the time budget I had for this verification round. The unit tests in 
the PR (`TestHeartbeatEndpointTaskDnsRefresh`) drive the catch-block trigger 
end-to-end with mocked `sendHeartbeat` exceptions and DO observe the refresh 
fire when `missedCount >= threshold` and the cause is connection-class. So the 
trigger logic is unit-test-verified; my docker-level reproduction simply did 
not exercise it within the 180-second window I gave it.
   
   ### What I'm retracting and what stands
   
   **Retracting**:
   - "Post-PR, with `ozone.client.failover.resolve-needed=true` ... recovery 
happens automatically." This was an inference from the unit tests, not from 
running the docker-compose experiment. I have unit-test evidence that the 
trigger logic works; I do not have docker-compose evidence that recovery 
completes within a heartbeat-cycle window in this specific rig.
   
   **Standing (now empirically verified)**:
   - The iptables-DROP recipe produces the AWS-shape silent-timeout, not RST.
   - The DN's pre-PR retry loop wedges indefinitely against the cached IP under 
that shape.
   - The exception type the DN sees IS `Timeout occurred on endpoint`, distinct 
from the `ConnectException: Connection refused` that healthy bridge networking 
produces — which is exactly the regime distinction my previous comment claimed.
   
   ### Reproduction artifacts
   
   For anyone running this locally, the script I used:
   
   - Launches the `ozone-ha` compose stack (3 OMs, 3 SCMs, 1 DN).
   - Captures scm1's bridge IP.
   - Starts a privileged host-network sidecar that runs `iptables -I 
DOCKER-USER 1 -d <scm1-ip> -p tcp --dport 9861 -j DROP` (and the same for ports 
9863, 9876, 9894).
   - Sleeps 180 seconds and then dumps the DN log for events of interest.
   
   Caveat I want to flag: OrbStack reuses bridge IPs aggressively; my earlier 
attempts to "rotate" scm1's IP via `docker network disconnect/connect` produced 
the same IP on reconnect. The iptables-DROP-without-IP-rotation approach above 
sidesteps that and is the cleanest local reproduction I have.
   
   This whole comment is the empirical follow-up to the previous 
reasoning-from-first-principles comment. The verification rule says I should 
distinguish what I've run from what I've inferred; this comment does that.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to