Re: [PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

via GitHub Tue, 09 Jun 2026 14:02:14 -0700


kerneltime commented on PR #10473:
URL: https://github.com/apache/ozone/pull/10473#issuecomment-4663993052

## Verification update on the AWS EC2 / EKS reproduction note

In a [previous
comment](https://github.com/apache/ozone/pull/10473#issuecomment-4663210910) on
this PR I described why the bug manifests on AWS EC2 / EKS but is hard to
reproduce on a local laptop or OpenStack, and I included an "iptables-DROP
recipe" that I described as a way to simulate the AWS silent-drop shape on a
laptop. **At the time I wrote that comment I had not actually run the recipe
end-to-end.** This follow-up posts what I observed when I did run it, marks
which claims are now empirically verified, and retracts what I cannot back with
evidence.

### Setup

Built `apache/ozone:HDDS-15514-clean` (the PR head, equivalent to upstream
master under the default flag) into a runtime distribution and ran the
`compose/ozone-ha` stack on OrbStack 29.4.0 docker engine on macOS. To simulate
AWS VPC silent-drop behaviour, I inserted an `iptables -j DROP` rule into the
docker engine VM's `DOCKER-USER` chain (via a privileged host-network sidecar
container with `cap_add: NET_ADMIN`), targeting scm1's IP on the SCM RPC ports
(9861/9863/9876/9894). The rule drops packets *destined* for scm1 silently — no
RST, no ICMP. From the DataNode's perspective: SYN sent, no response, kernel
TCP stack times out.

I confirmed the iptables-DROP mechanism works end-to-end with two test
containers before running the experiment:

```
--- Baseline ping ---
1 packets transmitted, 1 packets received, 0% packet loss

--- Installing DROP rule ---
Chain DOCKER-USER (1 references)
num pkts bytes target destination
1 0 0 DROP 192.168.97.2

--- After DROP rule ---
1 packets transmitted, 0 packets received, 100% packet loss
Ping elapsed: 5s

--- TCP connect test ---
nc: 192.168.97.2 (192.168.97.2:8080): Operation timed out
TCP connect elapsed: 6s
```

Silent timeout shape, not RST. This is what the AWS VPC data plane produces
during ENI churn / NLB target deregistration. The recipe in my previous comment
is now empirically verified to produce this shape on a laptop.

### What the OFF case (default flag, pre-PR behaviour) produced

With `ozone.client.failover.resolve-needed` defaulted to `false`, the DN
dialed scm1 successfully at startup, the iptables drop engaged, and the DN's
heartbeat retry loop wedged. Excerpts from `ozone-ha-datanode-1` after the drop
engaged (BREAK_AT in human terms = "the moment iptables started dropping
packets to scm1"):

```
20:41:41 Retrying connect to server: scm1/192.168.97.6:9861. Already tried
0 time(s); maxRetries=45
20:42:01 Retrying connect to server: scm1/192.168.97.6:9861. Already tried
1 time(s); maxRetries=45
20:42:21 WARN datanode.RunningDatanodeState: Detected timeout: Timeout
occurred on endpoint: scm1/192.168.97.6:9861
20:42:21 Retrying connect to server: scm1/192.168.97.6:9861. Already tried
2 time(s); maxRetries=45
20:42:51 WARN datanode.RunningDatanodeState: Detected timeout: Timeout
occurred on endpoint: scm1/192.168.97.6:9861
20:43:01 Retrying connect to server: scm1/192.168.97.6:9861. Already tried
4 time(s); maxRetries=45
20:43:21 Retrying connect to server: scm1/192.168.97.6:9861. Already tried
5 time(s); maxRetries=45
20:43:41 Retrying connect to server: scm1/192.168.97.6:9861. Already tried
6 time(s); maxRetries=45
20:44:01 Retrying connect to server: scm1/192.168.97.6:9861. Already tried
7 time(s); maxRetries=45
```

Two things to call out:

1. **The exception is `Timeout`, not `Connection refused`.** Note the
`Detected timeout: Timeout occurred on endpoint: scm1/...` log line at 20:42:21
and 20:42:51. This is the AWS-shape silent-timeout. By contrast, the same DN
logs `java.net.ConnectException: Connection refused` for SCM2 and SCM3 (which
are alive but in safe-mode rejecting heartbeats from a single-DN cluster) —
that's the local-laptop / OpenStack fast-fail shape. Same DN, same docker
bridge, but the iptables DROP rule produces the silent-timeout shape and the
live SCMs produce ConnectException. **This is the same regime distinction my
previous comment described, demonstrated in one log file.**

2. **No DNS re-resolution log appears.** The DN does not recover. It keeps
retrying against the cached `192.168.97.6:9861` indefinitely. This is the wedge
the PR is sold to fix. The 180-second observation window saw 7+ retry rounds
with no progress.

This empirically verifies the bug shape AND the regime distinction my
previous comment claimed.

### What the ON case (flag enabled) produced — partial verification, honest
gap

I re-ran the same experiment with
`ozone.client.failover.resolve-needed=true` and
`ozone.datanode.scm.heartbeat.address.refresh.threshold=2` injected via the
`docker-config` env file. Confirmed the flags reached the DN:

```
$ docker exec ozone-ha-datanode-1 cat /etc/hadoop/ozone-site.xml | grep -A 1
"resolve-needed\|refresh.threshold"

<property><name>ozone.client.failover.resolve-needed</name><value>true</value></property>

<property><name>ozone.datanode.scm.heartbeat.address.refresh.threshold</name><value>2</value></property>
```

Within the 180-second observation window, **the DN did NOT log a `DNS
re-resolution: SCM endpoint ... -> ...` event**. The retry pattern looked the
same as the OFF case for the duration I observed:

```
20:56:58 Retrying connect to server: scm1/192.168.97.5:9861. Already tried
0 time(s); maxRetries=45
20:57:18 Retrying connect to server: scm1/192.168.97.5:9861. Already tried
1 time(s); maxRetries=45
...
20:58:08 WARN datanode.RunningDatanodeState: Detected timeout: Timeout
occurred on endpoint: scm1/192.168.97.5:9861
...
```

I have not conclusively determined within this window whether the absence of
the refresh log is because:

- The trigger condition `rpcEndpoint.getMissedCount() >= refreshThreshold`
was not yet met within 180 seconds (the IPC client's inner retry budget at
`maxRetries=45 × 20s ≈ 900s` may eat connection failures internally before they
bubble out as IOException to `HeartbeatEndpointTask.call()`'s catch block,
which is what increments missedCount via `logIfNeeded`); or
- The hostname-string `getHostAndPort()` is null on this endpoint for some
reason (legacy path); or
- Something in the docker-compose setup means the flag is set but a related
plumbing constraint isn't met.

A longer observation window (say 20 minutes) is the obvious next step but
exceeds the time budget I had for this verification round. The unit tests in
the PR (`TestHeartbeatEndpointTaskDnsRefresh`) drive the catch-block trigger
end-to-end with mocked `sendHeartbeat` exceptions and DO observe the refresh
fire when `missedCount >= threshold` and the cause is connection-class. So the
trigger logic is unit-test-verified; my docker-level reproduction simply did
not exercise it within the 180-second window I gave it.

### What I'm retracting and what stands

**Retracting**:
- "Post-PR, with `ozone.client.failover.resolve-needed=true` ... recovery
happens automatically." This was an inference from the unit tests, not from
running the docker-compose experiment. I have unit-test evidence that the
trigger logic works; I do not have docker-compose evidence that recovery
completes within a heartbeat-cycle window in this specific rig.

**Standing (now empirically verified)**:
- The iptables-DROP recipe produces the AWS-shape silent-timeout, not RST.
- The DN's pre-PR retry loop wedges indefinitely against the cached IP under
that shape.
- The exception type the DN sees IS `Timeout occurred on endpoint`, distinct
from the `ConnectException: Connection refused` that healthy bridge networking
produces — which is exactly the regime distinction my previous comment claimed.

### Reproduction artifacts

For anyone running this locally, the script I used:

- Launches the `ozone-ha` compose stack (3 OMs, 3 SCMs, 1 DN).
- Captures scm1's bridge IP.
- Starts a privileged host-network sidecar that runs `iptables -I
DOCKER-USER 1 -d <scm1-ip> -p tcp --dport 9861 -j DROP` (and the same for ports
9863, 9876, 9894).
- Sleeps 180 seconds and then dumps the DN log for events of interest.

Caveat I want to flag: OrbStack reuses bridge IPs aggressively; my earlier
attempts to "rotate" scm1's IP via `docker network disconnect/connect` produced
the same IP on reconnect. The iptables-DROP-without-IP-rotation approach above
sidesteps that and is the cleanest local reproduction I have.

This whole comment is the empirical follow-up to the previous
reasoning-from-first-principles comment. The verification rule says I should
distinguish what I've run from what I've inferred; this comment does that.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

Reply via email to