Re: [PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

via GitHub Tue, 09 Jun 2026 12:18:59 -0700


kerneltime commented on PR #10473:
URL: https://github.com/apache/ozone/pull/10473#issuecomment-4663210910


   ## Why this manifests on AWS EC2 / EKS but is hard to reproduce on a local 
laptop or OpenStack
   
   Reviewers running this in a local docker-compose setup, in minikube, or in 
an OpenStack / Calico-style on-prem K8s cluster will probably observe that an 
SCM pod restart self-heals on the next heartbeat without this PR's changes. 
That observation is correct for those environments and incomplete as a reason 
to question the fix. The bug only manifests when the network plane drops 
packets *silently* on a defunct destination IP. The differences between 
environments come down to what the kernel does when a TCP SYN is sent to an IP 
whose owner has gone away.
   
   ### The three regimes
   
   **1. Local laptop / single-host docker / minikube — kernel-level RST or ICMP 
unreachable**
   
   When an SCM container is rescheduled, the old container's IP is released. 
The next dial against that IP traverses a local bridge / loopback path. The 
kernel's networking stack knows there is no listener at that IP:port, no route 
to that IP, or no neighbor table entry — and immediately returns either:
   - `ECONNREFUSED` (RST from the kernel because nothing is listening), or
   - `EHOSTUNREACH` / `ENETUNREACH` (no route).
   
   Java surfaces these as `ConnectException` or `NoRouteToHostException`. 
Hadoop's vendored IPC `Client.updateAddress()` (added by HADOOP-17068) catches 
this exception class and re-resolves the hostname on the next retry. 
**Self-heals on the next heartbeat. Bug does not manifest.**
   
   **2. OpenStack / on-prem K8s with Calico-style L3 routing — fast-fail on the 
same shape**
   
   When a Pod dies on Calico (or similar pure-L3 CNI), Calico immediately 
removes the BGP route advertising that pod IP. Packets to the now-dead IP 
either get dropped at the first hop with ICMP unreachable, or — more often — 
hit a fallback route that returns RST. Either way the client sees 
`ConnectException` or `NoRouteToHostException` within milliseconds. Hadoop's 
`Client.updateAddress()` fires, the hostname is re-resolved, the next dial hits 
the new IP. **Self-heals. Bug does not manifest.**
   
   This is the regime most committers test in. It's the regime the 
docker-compose `ozone-ha` stack reproduces (I tested this PR's fix end-to-end 
in that stack with `docker network disconnect/connect` to rotate IPs, and the 
pre-PR retry loop did self-heal there).
   
   **3. AWS EC2 / EKS with VPC-native ENI networking — silent packet drop**
   
   VPC-native EKS uses the AWS VPC CNI (`amazon-vpc-cni-k8s`), which assigns 
each pod a real VPC ENI secondary IP. When a pod dies and is rescheduled to a 
new IP, **the dead IP enters a transitional state where packets to it are 
silently dropped** for any combination of the following reasons (which 
compound):
   
   - The old IP's ENI may be in the process of being unbound from the node — 
the VPC route table still claims the IP belongs to that ENI, but the ENI is no 
longer accepting traffic. AWS's data plane drops these packets without sending 
RST or ICMP.
   - The old IP may have been reassigned to a different pod whose security 
group / NetworkPolicy denies traffic from this client. Default-deny in AWS 
Security Groups is a silent drop, not a reject.
   - During pod restart on the same node, kube-proxy's iptables `-j DROP` rule 
for the old endpoint can be active for a brief window before the new endpoint 
replaces it.
   - L4 load balancers in the path (NLB, target groups) hold connections to the 
old target until target-deregistration-delay elapses — typically 30-300 seconds 
— and the in-flight packets to the deregistered target are silently dropped.
   
   In every case, the client's TCP stack does not receive RST, ICMP, or any 
feedback. The connection attempt sits in `SYN_SENT` until the configured 
connect timeout fires (typically `ipc.client.connect.timeout`, default 20 
seconds, or the OS default of ~75-130 seconds). Java surfaces this as 
`SocketTimeoutException`.
   
   **Hadoop's `Client.updateAddress()` is wired to fire on `ConnectException` 
and `NoRouteToHostException` — but NOT on `SocketTimeoutException`.** The IPC 
layer keeps retrying against the same cached `InetSocketAddress`, gets the same 
timeout, retries again, ad infinitum. The DataNode / OM / client process 
appears alive, the heartbeat thread appears alive, but every retry dials the 
dead IP with no recovery. **Bug manifests permanently** until the process is 
restarted.
   
   ### Why this PR's filter explicitly includes `SocketTimeoutException`
   
   `org.apache.hadoop.hdds.utils.ConnectionFailureUtils.isConnectionFailure` 
matches `SocketTimeoutException` precisely because it is the dominant failure 
shape on AWS EC2 / EKS. Without this entry in the filter, the PR would only fix 
the "easy" cases (local docker, OpenStack, minikube) where Hadoop's IPC retry 
already self-heals — which is exactly the wrong subset, because those cases 
don't need fixing. The case that actually needs fixing is the silent-timeout 
case, and that's the one the filter has to catch.
   
   ### Reproduction recipes
   
   | Environment | Reproduction | Bug manifests? | Hadoop's IPC retry 
self-heals? |
   |---|---|---|---|
   | Local docker-compose | `docker network disconnect; docker network connect` 
to rotate IP | Pre-PR: dialing old IP fails fast with `ConnectException`. | 
**Yes** — `Client.updateAddress` re-resolves. |
   | Minikube / kind | `kubectl delete pod scm-0` | Pre-PR: same shape — 
fast-fail RST. | Yes. |
   | OpenStack on-prem K8s with Calico | `kubectl delete pod scm-0` | Pre-PR: 
same shape — Calico withdraws the BGP route, fast-fail. | Yes. |
   | **AWS EKS with VPC CNI** | `kubectl delete pod scm-0` followed by ENI 
churn | **Pre-PR: SocketTimeoutException loop, no recovery, DataNode wedged.** 
| **No** — `updateAddress` is not wired for `SocketTimeoutException`. |
   | AWS EKS with NLB in front of SCM | NLB target deregistration | Pre-PR: 
30-300s of silent drops, then either timeout (if NLB drops) or RST (if NLB 
rejects). | Sometimes — depends on NLB config. |
   
   ### How to reproduce the AWS-shape failure on a laptop without an actual AWS 
account
   
   You cannot reproduce the AWS *infrastructure* on a laptop, but you can 
reproduce the *symptom* (silent packet drop instead of RST) by injecting an 
iptables DROP rule for the dead IP after the pod is rescheduled. The end-to-end 
test I used during PR development uses `iptables -A FORWARD -d <stale-ip> -j 
DROP` in a sidecar container to simulate the AWS silent-drop behaviour. Without 
the iptables drop, the docker-compose harness produces the regime-2 fast-fail 
shape, which (as noted) the pre-PR code already self-heals — and which is why I 
had to amend the test setup to actually exercise the bug being fixed.
   
   ### Why the bug was filed by AWS-EKS users specifically
   
   Cloudera ran into this at customer sites running Ozone on AWS EKS. Local 
development never reproduced it because the local stack falls into regime 1 or 
2. Reports from the field consistently described "DataNode appears alive, 
heartbeats appear to be running, but SCM doesn't see them and recovery only 
works after a manual DN restart" — which is exactly the symptom of 
`SocketTimeoutException` retries against a stale `InetSocketAddress`. 
Confirming this required adding TRACE-level logging in `Client.java` to observe 
that `setupConnection` was looping on the same cached IP without 
`updateAddress` ever firing, which only happens when the exception is *not* 
`ConnectException`-shaped.
   
   ### Summary for reviewers
   
   - **Local / OpenStack reviewers**: you will not reproduce the bug, because 
your network plane fails fast and Hadoop's existing retry self-heals. This does 
not mean the bug doesn't exist; it means your test environment is in regime 1 
or 2.
   - **AWS EKS reviewers**: this is the case the PR is sold on. Pre-PR, a pod 
restart wedges every dependent process. Post-PR, with 
`ozone.client.failover.resolve-needed=true` and 
`ozone.datanode.scm.heartbeat.address.refresh.threshold` (default 3, so ~90s 
with the default 30s heartbeat interval), recovery happens automatically.
   - **Filter rationale**: `SocketTimeoutException` is the load-bearing entry. 
The other entries (`ConnectException`, `NoRouteToHost`, `EOFException`, etc.) 
are belt-and-braces; they cover regimes where Hadoop's existing retry already 
works, but cost nothing to include and provide a uniform contract across 
environments.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

Reply via email to