kerneltime commented on PR #10473: URL: https://github.com/apache/ozone/pull/10473#issuecomment-4663210910
## Why this manifests on AWS EC2 / EKS but is hard to reproduce on a local laptop or OpenStack Reviewers running this in a local docker-compose setup, in minikube, or in an OpenStack / Calico-style on-prem K8s cluster will probably observe that an SCM pod restart self-heals on the next heartbeat without this PR's changes. That observation is correct for those environments and incomplete as a reason to question the fix. The bug only manifests when the network plane drops packets *silently* on a defunct destination IP. The differences between environments come down to what the kernel does when a TCP SYN is sent to an IP whose owner has gone away. ### The three regimes **1. Local laptop / single-host docker / minikube — kernel-level RST or ICMP unreachable** When an SCM container is rescheduled, the old container's IP is released. The next dial against that IP traverses a local bridge / loopback path. The kernel's networking stack knows there is no listener at that IP:port, no route to that IP, or no neighbor table entry — and immediately returns either: - `ECONNREFUSED` (RST from the kernel because nothing is listening), or - `EHOSTUNREACH` / `ENETUNREACH` (no route). Java surfaces these as `ConnectException` or `NoRouteToHostException`. Hadoop's vendored IPC `Client.updateAddress()` (added by HADOOP-17068) catches this exception class and re-resolves the hostname on the next retry. **Self-heals on the next heartbeat. Bug does not manifest.** **2. OpenStack / on-prem K8s with Calico-style L3 routing — fast-fail on the same shape** When a Pod dies on Calico (or similar pure-L3 CNI), Calico immediately removes the BGP route advertising that pod IP. Packets to the now-dead IP either get dropped at the first hop with ICMP unreachable, or — more often — hit a fallback route that returns RST. Either way the client sees `ConnectException` or `NoRouteToHostException` within milliseconds. Hadoop's `Client.updateAddress()` fires, the hostname is re-resolved, the next dial hits the new IP. **Self-heals. Bug does not manifest.** This is the regime most committers test in. It's the regime the docker-compose `ozone-ha` stack reproduces (I tested this PR's fix end-to-end in that stack with `docker network disconnect/connect` to rotate IPs, and the pre-PR retry loop did self-heal there). **3. AWS EC2 / EKS with VPC-native ENI networking — silent packet drop** VPC-native EKS uses the AWS VPC CNI (`amazon-vpc-cni-k8s`), which assigns each pod a real VPC ENI secondary IP. When a pod dies and is rescheduled to a new IP, **the dead IP enters a transitional state where packets to it are silently dropped** for any combination of the following reasons (which compound): - The old IP's ENI may be in the process of being unbound from the node — the VPC route table still claims the IP belongs to that ENI, but the ENI is no longer accepting traffic. AWS's data plane drops these packets without sending RST or ICMP. - The old IP may have been reassigned to a different pod whose security group / NetworkPolicy denies traffic from this client. Default-deny in AWS Security Groups is a silent drop, not a reject. - During pod restart on the same node, kube-proxy's iptables `-j DROP` rule for the old endpoint can be active for a brief window before the new endpoint replaces it. - L4 load balancers in the path (NLB, target groups) hold connections to the old target until target-deregistration-delay elapses — typically 30-300 seconds — and the in-flight packets to the deregistered target are silently dropped. In every case, the client's TCP stack does not receive RST, ICMP, or any feedback. The connection attempt sits in `SYN_SENT` until the configured connect timeout fires (typically `ipc.client.connect.timeout`, default 20 seconds, or the OS default of ~75-130 seconds). Java surfaces this as `SocketTimeoutException`. **Hadoop's `Client.updateAddress()` is wired to fire on `ConnectException` and `NoRouteToHostException` — but NOT on `SocketTimeoutException`.** The IPC layer keeps retrying against the same cached `InetSocketAddress`, gets the same timeout, retries again, ad infinitum. The DataNode / OM / client process appears alive, the heartbeat thread appears alive, but every retry dials the dead IP with no recovery. **Bug manifests permanently** until the process is restarted. ### Why this PR's filter explicitly includes `SocketTimeoutException` `org.apache.hadoop.hdds.utils.ConnectionFailureUtils.isConnectionFailure` matches `SocketTimeoutException` precisely because it is the dominant failure shape on AWS EC2 / EKS. Without this entry in the filter, the PR would only fix the "easy" cases (local docker, OpenStack, minikube) where Hadoop's IPC retry already self-heals — which is exactly the wrong subset, because those cases don't need fixing. The case that actually needs fixing is the silent-timeout case, and that's the one the filter has to catch. ### Reproduction recipes | Environment | Reproduction | Bug manifests? | Hadoop's IPC retry self-heals? | |---|---|---|---| | Local docker-compose | `docker network disconnect; docker network connect` to rotate IP | Pre-PR: dialing old IP fails fast with `ConnectException`. | **Yes** — `Client.updateAddress` re-resolves. | | Minikube / kind | `kubectl delete pod scm-0` | Pre-PR: same shape — fast-fail RST. | Yes. | | OpenStack on-prem K8s with Calico | `kubectl delete pod scm-0` | Pre-PR: same shape — Calico withdraws the BGP route, fast-fail. | Yes. | | **AWS EKS with VPC CNI** | `kubectl delete pod scm-0` followed by ENI churn | **Pre-PR: SocketTimeoutException loop, no recovery, DataNode wedged.** | **No** — `updateAddress` is not wired for `SocketTimeoutException`. | | AWS EKS with NLB in front of SCM | NLB target deregistration | Pre-PR: 30-300s of silent drops, then either timeout (if NLB drops) or RST (if NLB rejects). | Sometimes — depends on NLB config. | ### How to reproduce the AWS-shape failure on a laptop without an actual AWS account You cannot reproduce the AWS *infrastructure* on a laptop, but you can reproduce the *symptom* (silent packet drop instead of RST) by injecting an iptables DROP rule for the dead IP after the pod is rescheduled. The end-to-end test I used during PR development uses `iptables -A FORWARD -d <stale-ip> -j DROP` in a sidecar container to simulate the AWS silent-drop behaviour. Without the iptables drop, the docker-compose harness produces the regime-2 fast-fail shape, which (as noted) the pre-PR code already self-heals — and which is why I had to amend the test setup to actually exercise the bug being fixed. ### Why the bug was filed by AWS-EKS users specifically Cloudera ran into this at customer sites running Ozone on AWS EKS. Local development never reproduced it because the local stack falls into regime 1 or 2. Reports from the field consistently described "DataNode appears alive, heartbeats appear to be running, but SCM doesn't see them and recovery only works after a manual DN restart" — which is exactly the symptom of `SocketTimeoutException` retries against a stale `InetSocketAddress`. Confirming this required adding TRACE-level logging in `Client.java` to observe that `setupConnection` was looping on the same cached IP without `updateAddress` ever firing, which only happens when the exception is *not* `ConnectException`-shaped. ### Summary for reviewers - **Local / OpenStack reviewers**: you will not reproduce the bug, because your network plane fails fast and Hadoop's existing retry self-heals. This does not mean the bug doesn't exist; it means your test environment is in regime 1 or 2. - **AWS EKS reviewers**: this is the case the PR is sold on. Pre-PR, a pod restart wedges every dependent process. Post-PR, with `ozone.client.failover.resolve-needed=true` and `ozone.datanode.scm.heartbeat.address.refresh.threshold` (default 3, so ~90s with the default 30s heartbeat interval), recovery happens automatically. - **Filter rationale**: `SocketTimeoutException` is the load-bearing entry. The other entries (`ConnectException`, `NoRouteToHost`, `EOFException`, etc.) are belt-and-braces; they cover regimes where Hadoop's existing retry already works, but cost nothing to include and provide a uniform contract across environments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
