[jira] [Updated] (HDDS-15514) Datanode and OzoneManager fail to recover from SCM peer IP changes; cache stale InetSocketAddress for process lifetime

ASF GitHub Bot (Jira) Tue, 09 Jun 2026 00:50:07 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HDDS-15514:
----------------------------------
    Labels: pull-request-available  (was: )

> Datanode and OzoneManager fail to recover from SCM peer IP changes; cache 
> stale InetSocketAddress for process lifetime
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-15514
>                 URL: https://issues.apache.org/jira/browse/HDDS-15514
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM, Ozone Client, Ozone Datanode, sum
>    Affects Versions: 2.1.0
>            Reporter: Ritesh Shukla
>            Assignee: Ritesh Shukla
>            Priority: Major
>              Labels: pull-request-available
>
> h1. Problem
> In Kubernetes (and any environment where peer pod IPs may change while DNS 
> names remain stable), Apache Ozone DataNodes and OzoneManagers can become 
> permanently disconnected from SCM after an SCM peer pod is rescheduled to a 
> new IP. The DataNode/OM process remains alive but its heartbeats and RPC 
> calls keep dialing the now-defunct IP forever. The only known recoveries 
> today are (a) restart the DataNode/OM process or (b) deploy an external 
> operator that watches SCM pod IPs and force-restarts dependent components.
> This is the same class of bug that {{HADOOP-17068}} fixed for HDFS NameNode 
> HA in Hadoop 3.4.0. The intent has been encoded in Hadoop common's 
> {{SecurityUtil.getByName}} javadoc since 2.x:
> {quote}
> 4) ... if the host is re-resolved, ex. during a connection re-attempt, that a 
> reverse lookup to host and forward lookup to IP is not performed since the 
> reverse/forward mappings may not always return the same IP. {quote}
> But across Ozone's five inter-component RPC paths, that property does not 
> hold today.
> h1. Failure modes observed in production
> The bug presents differently depending on what the network does with packets 
> to the stale IP:
> h3. AWS EC2 / EKS (silent packet drop) — long stalls, no recovery
> When the cached IP belonged to an EC2 ENI that has since been released, the 
> AWS VPC silently drops packets destined for that IP. The DataNode's TCP SYN 
> sits in the kernel waiting for a SYN-ACK that never comes. Each connection 
> attempt consumes the full {{ipc.client.connect.timeout}} (default 20s) plus 
> {{ipc.client.connect.retry.count}} retries. The {{Client.updateAddress()}} 
> re-resolution in HADOOP-17068 is gated on {{IOException}} from 
> {{setupConnection}} — but on a silent drop, the exception only fires after 
> the full timeout chain. Across 3 SCMs in HA round-robin, hours can pass with 
> the DataNode process alive but completely decoupled from the cluster. Not a 
> single fresh DNS query is made because the existing {{InetSocketAddress}} in 
> {{EndpointStateMachine}} is never reconstr
> h3. OpenStack / on-prem with TCP RST or ICMP Unreachable — fast crash loop
> When the network actively rejects packets, the DataNode cycles through all 
> tiseconds. Hitting the maximum failover retry limit across all proxies 
> sequentially can cause the DataNode's heartbeat service to throw a fatal 
> excn. On restart, the DataNode builds a fresh {{SCMConnectionManager}} which 
> forces a fresh DNS resolution. So in this configuration, the bug self-heals  
> at the cost of unscheduled DataNode crashes.
> In both cases, the underlying defect is the same: long-lived 
> {{InetSocketAddrozen at process startup are never rebuilt.
> h1. Root cause
> {{InetSocketAddress(host, port)}} performs a one-shot DNS lookup at 
> construcIP for the object's lifetime. Apache Ozone's failover proxy providers 
> and DataNode connection manager all construct {{InetSocketAddress}} objects 
> oncedefinitely as map keys, RPC proxy targets, and final fields:
> || Path || Owner of the frozen address || Where it's built ||
> | DN → SCM heartbeat | {{EndpointStateMachine.address}} (final) | 
> {{InitDatanodeState.java:104}} → {{SCMConnectionManager.addSCMServer}} (lines 
> 133-174) |
> | OM → SCM (block + container) | {{SCMProxyInfo.rpcAddr}} (final) | 
> {{SCMFailoverProxyProviderBase.loadConfigs}} (line 148) |
> | Client → OM (Hadoop RPC) | {{OMProxyInfo.rpcAddr}} (final) | 
> {{OMFailoverPadoopRpcOMFailoverProxyProvider.initOmProxiesFromConfigs}} |
> | OM ↔ OM control plane | (uses {{OMFailoverProxyProvider}} machinery — same 
> shape as Client → OM) | — |
> | OM ↔ OM Ratis replication | {{RaftPeer.address}} (final String) — built 
> fr.getInetAddress(), ratisPort)}} | 
> {{OzoneManagerRatisServer.createRaftPeer}}(lines 438-451, 459-474) |
> | SCM ↔ SCM Ratis replication | {{RaftPeer.address}} (final String) — 
> alreadost paths | {{SCMRatisServerImpl.buildRaftGroup}} (lines 399-414) |
> Notably, {{HDDS-5919}}'s {{ozone.network.jvm.address.cache.enabled=false}} ot 
> only affects the JVM's positive DNS cache TTL, which would help future
> {{NetUtils.createSocketAddr}} calls. But the heartbeat/RPC path never makes 
> ress}} is final on each
> {{EndpointStateMachine}}/{{SCMProxyInfo}}/{{OMProxyInfo}}.
> h1. Proposed fix
> Mirror the HADOOP-17068 design pattern at the {{FailoverProxyProvider}} / 
> {{in Ozone (one tier above where Hadoop applied the fix, because Ozone's 
> seams live there). On each connection-class failure:
> # Re-resolve the configured hostname via {{NetUtils.createSocketAddr(hostnam
> # Compare the new resolved IP against the cached 
> {{InetSocketAddress.getAddress()}}.
> # If changed: stop the cached RPC proxy, replace the cached address 
> atomicalet the next retry build a fresh proxy via the existing creation path.
> # If unchanged: fall through to existing retry behavior (no-op).
> This requires preserving the original "host:port" config string alongside 
> ths}} (currently only the resolved address is kept). The implementation adds 
> an opt-in config flag mirroring HBase's 
> {{hbase.resolve.hostnames.on.failure}} nt.failover.resolve-needed}}:
> {noformat} ozone.client.failover.resolve-needed = false   (default) 
> ozone.datanode.scm.heartbeat.address.refresh.threshold = 3   (default; DN-sp 
> {noformat}
> When the flag is false, behavior is byte-identical to current master. Operat 
> existing non-K8s deployments see zero behavior change. This matches the HBase 
> / HADOOP-17068 precedent of requiring explicit operator opt-in for the
> h2. Per-path summary
> || Path || Mechanism ||
> | DN → SCM | {{EndpointStateMachine}} preserves {{hostAndPort}} string. 
> {{HeartbeatEndpointTask}} catch block calls {{maybeRefreshScmAddress}} when 
> {{missedCount}} >= threshold. {{SCMConnectionManager.refreshSCMServer}} swaps 
> the endpoint atomically; theVERSION}} state, which is the correct behavior 
> because a peer that has beenrescheduled is effectively a fresh process. |
> | OM → SCM | {{SCMProxyInfo}} retains the config-time host:port. 
> {{SCMFailovoxyAddressIfChanged(nodeId)}} runs in {{shouldRetry}} when an 
> {{IOException}} chain contains {{ConnectException}}, 
> {{NoRouteToHostException}}, or {{UnknownHostException}}. Stale proxy is 
> stopped via {{RPC.stopProxy}}. |
> | Client → OM (HadoopRPC) | {{OMProxyInfo.rpcAddr}} becomes mutable behind 
> thAddressIfChanged()}} re-resolves {{rpcAddrStr}}, swaps {{rpcAddr}} and 
> thederived {{dtService}}, nulls the cached proxy so the next 
> {{createProxyIfNeeded}} dials the new IP. 
> {{OMFailoverProxyProviderBase.shouldRetry}} calls this on connection-class
> exceptions before advancing the failover index. |
> | Client → OM (gRPC) | No code change required. 
> {{GrpcOMFailoverProxyProvideetSocketAddress(0)}} and lets gRPC's 
> {{NameResolver}} re-resolve hostnames on
> its own schedule. |
> | OM ↔ OM control plane | Uses Hadoop RPC via {{OMInterServiceProtocol}}, noy 
> via the Client → OM fix. |
> | OM ↔ OM Ratis replication | {{OzoneManagerRatisServer.createRaftPeer}} 
> simname:port string to {{RaftPeer.setAddress}} — never a resolved IP. Two of 
> three previous {{createRaftPeer}} branches were calling {{new 
> InetSocketAddrratisPort)}}, which strips the hostname and freezes the IP. 
> With hostname-only addresses, gRPC's default {{DnsNameResolver}} (used by 
> Ratis u connection failure / on its own refresh schedule. No Ratis upstream 
> change required. | | SCM ↔ SCM Ratis replication | Already uses hostname 
> strings; removes a mis use IP instead of hostname??}} comment in 
> {{SCMRatisServerImpl.buildRaftGroup}} and {{SCMHAManagerImpl}} and replaces
> h1. Connection-class exception filter
> The refresh path is gated on exception types where DNS re-resolution could p
> * {{java.net.ConnectException}} — connection refused / unreachable
> * {{java.net.NoRouteToHostException}} — host route gone
> * {{java.net.UnknownHostException}} — DNS lookup failed downstream
> Filtering excludes application-level errors ({{OMNotLeaderException}}, 
> {{Ret{{AccessControlException}}) where SCM/OM is reachable on the cached IP 
> and the failure is logical, not network. This avoids triggering DNS load on 
> ever
> h1. Testing
> 13 new unit tests + 1 real-RPC integration test, all passing under {{mvn 
> clean test}} on the latest master:
> || Test class || Tests || What it covers ||
> | {{TestSCMConnectionManager}} | +5 new | {{resolveLatestAddress}} edge 
> cases, {{refreshSCMServer}} happy-path swap, no-op when IP unchanged, no-op 
> when host:port not preserved (legacy ctor path) |
> | {{TestSCMFailoverProxyProviderRefresh}} | 3 new | Swap on IP change, 
> no-op{hostAndPort}} not preserved |
> | {{TestOMProxyInfoDnsRefresh}} | 3 new | Address swap, dtService update, 
> proxy null-out, proxy rebuild after refresh |
> | {{TestSCMConnectionManagerDnsRefreshE2E}} | 1 new | Real Hadoop RPC server 
> (via {{ScmTestMock}}) on a real loopback socket. Connection manager primed 
> with a deliberately stale {{127.0.0.99}} and preserved hostname 
> {{localhost:port}}. {{refreshSCMServer}} fires; a real {{sendHeartbeat}} 
> round-trips to the live server; {{ScmTestMock.rpcCount}} increments. Proves 
> the full chain: address swap → fresh RPC proxy → real socket dial → s. |
> | {{TestOzoneManagerRatisServer}} | +1 new | Asserts 
> {{RaftPeer.getAddress()}} is a hostname:port string, never an IP:port string. 
> Defensive regex check that the host portion is not a numeric IPv4. |
> Existing regression tests covered by the same module set 
> ({{TestSCMConnectionManager}} 1 prior, {{TestEndPoint}} 17, 
> {{TestOMFailoverProxyProvider}} 8, {{TestOMFailovers}} 1, 
> {{TestOzoneManagerRatisServer}} 5 prior): all green.
> A docker-compose validation run with the {{ozone-ha}} stack confirmed the new 
> code is wired into the runtime JAR ({{javap}} verified 
> {{addSCMServer(InetSocketAddress, String, String)}} and 
> {{refreshSCMServer(InetSocketAddress, String)}} are present in 
> {{hdds-container-service-2.2.0-SNAPSHOT.jar}}), the config flag reaches the 
> DN's process environment, and the cluster boots and processes writes 
> successfully under the opt-in flag.
> h1. Scope and known limitations
> * The fix only fires from the {{HEARTBEAT}} phase via 
> {{HeartbeatEndpointTask}}. If a DataNode starts up with the SCM peer already 
> at a stale IP (DN never reaches {{HEARTBEAT}}), the recovery path does not 
> engage. Initial-bringup DNS staleness is the 
> existing{ozone.network.jvm.address.cache.enabled=false}}.{{InitDatanodeState.java:94-101}}
>  already postpones initialization on initial-resolution failure.
> * HDFS-14118-style construction-time DNS fan-out (one hostname → multiple 
> persistent IPs) is a different problem (round-robin DNS for HDFS HA) and out 
> of scope here. Worth a follow-on JIRA if Ozone deployments need it.
> * The Ratis quorum-loss exit-0 issue ({{SCMStateMachine.close()}} calling {{ 
> when leader election fails to converge, leading to Kubernetes 
> CrashLoopBackOff death spirals) is a separate concern. File as 
> {{HDDS-XXXXX}it non-zero so K8s' standard restart handling becomes correct 
> again.
> h1. Suggested sub-task breakdown
> # {{HDDS-XXXX1}}: DN → SCM heartbeat — 
> {{EndpointStateMachine}}/{{SCMConnectionManager}}/{{HeartbeatEndpointTask}} 
> (smallest blast radius; lands first to prove the pattern)      # 
> {{HDDS-XXXX2}}: OM → SCM — {{SCMFailoverProxyProviderBase}}/{{SCMProxyInfo
> # {{HDDS-XXXX3}}: Client → OM — 
> {{OMFailoverProxyProviderBase}}/{{OMProxyInfo}}
> # {{HDDS-XXXX4}}: Ratis hostnames-only — 
> {{OzoneManagerRatisServer.createRaftPeer}}, {{SCMRatisServerImpl}}, 
> {{SCMHAManagerImpl}}, {{NodeDetails}} (smallest, lowest-risk; could lfirst to 
> reduce surface)
> # {{HDDS-XXXX5}}: New config keys + {{ozone-default.xml}} entries (could fold 
> into XXXX1)
>                                                                               
>                                                                               
>                       Each sub-task is independently testable and revertable. 
> Umbrella ticket citeeview context.
> h1. References
> * {{HADOOP-17068}}: client fails forever when namenode ipaddr changed (Hadoop 
> 3.4.0). Commit {{fa14e4bc001e28d9912e8d985d09bab75aedb87c}}. Authors: Sean 
> Chow, He Xiaoqiao. Touche{{Client.setupConnection}} only.
> * {{HDFS-14118}}: introduces {{dfs.client.failover.resolve-needed}} and the 
> {{AbstractNNFailoverProxyProvider.getResolvedAddressesIfNecessary}} hook 
> (different shape:            construction-time fan-out, not per-failure 
> refresh).
> * {{HBASE}}: {{hbase.resolve.hostnames.on.failure}} 
> ({{ConnectionImplementation.RESOLVE_HOSTNAME_ON_FAIL_KEY}}) — same opt-in 
> design philosophy.                                  * {{ZOOKEEPER-1506}}, 
> {{ZOOKEEPER-2982}}: ZooKeeper {{StaticHostProvider}} —}} call. The cleanest 
> design in the Hadoop-adjacent ecosystem; closer to"always on" than opt-in.
> * {{HDDS-5919}}: introduces {{ozone.network.jvm.address.cache.enabled}} (defe 
> JVM-level positive DNS cache TTL but does not fix the 
> long-lived{{InetSocketAddress}} instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-15514) Datanode and OzoneManager fail to recover from SCM peer IP changes; cache stale InetSocketAddress for process lifetime

Reply via email to