[
https://issues.apache.org/jira/browse/HDDS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-15514:
----------------------------------
Labels: pull-request-available (was: )
> Datanode and OzoneManager fail to recover from SCM peer IP changes; cache
> stale InetSocketAddress for process lifetime
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: HDDS-15514
> URL: https://issues.apache.org/jira/browse/HDDS-15514
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM, Ozone Client, Ozone Datanode, sum
> Affects Versions: 2.1.0
> Reporter: Ritesh Shukla
> Assignee: Ritesh Shukla
> Priority: Major
> Labels: pull-request-available
>
> h1. Problem
> In Kubernetes (and any environment where peer pod IPs may change while DNS
> names remain stable), Apache Ozone DataNodes and OzoneManagers can become
> permanently disconnected from SCM after an SCM peer pod is rescheduled to a
> new IP. The DataNode/OM process remains alive but its heartbeats and RPC
> calls keep dialing the now-defunct IP forever. The only known recoveries
> today are (a) restart the DataNode/OM process or (b) deploy an external
> operator that watches SCM pod IPs and force-restarts dependent components.
> This is the same class of bug that {{HADOOP-17068}} fixed for HDFS NameNode
> HA in Hadoop 3.4.0. The intent has been encoded in Hadoop common's
> {{SecurityUtil.getByName}} javadoc since 2.x:
> {quote}
> 4) ... if the host is re-resolved, ex. during a connection re-attempt, that a
> reverse lookup to host and forward lookup to IP is not performed since the
> reverse/forward mappings may not always return the same IP. {quote}
> But across Ozone's five inter-component RPC paths, that property does not
> hold today.
> h1. Failure modes observed in production
> The bug presents differently depending on what the network does with packets
> to the stale IP:
> h3. AWS EC2 / EKS (silent packet drop) — long stalls, no recovery
> When the cached IP belonged to an EC2 ENI that has since been released, the
> AWS VPC silently drops packets destined for that IP. The DataNode's TCP SYN
> sits in the kernel waiting for a SYN-ACK that never comes. Each connection
> attempt consumes the full {{ipc.client.connect.timeout}} (default 20s) plus
> {{ipc.client.connect.retry.count}} retries. The {{Client.updateAddress()}}
> re-resolution in HADOOP-17068 is gated on {{IOException}} from
> {{setupConnection}} — but on a silent drop, the exception only fires after
> the full timeout chain. Across 3 SCMs in HA round-robin, hours can pass with
> the DataNode process alive but completely decoupled from the cluster. Not a
> single fresh DNS query is made because the existing {{InetSocketAddress}} in
> {{EndpointStateMachine}} is never reconstr
> h3. OpenStack / on-prem with TCP RST or ICMP Unreachable — fast crash loop
> When the network actively rejects packets, the DataNode cycles through all
> tiseconds. Hitting the maximum failover retry limit across all proxies
> sequentially can cause the DataNode's heartbeat service to throw a fatal
> excn. On restart, the DataNode builds a fresh {{SCMConnectionManager}} which
> forces a fresh DNS resolution. So in this configuration, the bug self-heals
> at the cost of unscheduled DataNode crashes.
> In both cases, the underlying defect is the same: long-lived
> {{InetSocketAddrozen at process startup are never rebuilt.
> h1. Root cause
> {{InetSocketAddress(host, port)}} performs a one-shot DNS lookup at
> construcIP for the object's lifetime. Apache Ozone's failover proxy providers
> and DataNode connection manager all construct {{InetSocketAddress}} objects
> oncedefinitely as map keys, RPC proxy targets, and final fields:
> || Path || Owner of the frozen address || Where it's built ||
> | DN → SCM heartbeat | {{EndpointStateMachine.address}} (final) |
> {{InitDatanodeState.java:104}} → {{SCMConnectionManager.addSCMServer}} (lines
> 133-174) |
> | OM → SCM (block + container) | {{SCMProxyInfo.rpcAddr}} (final) |
> {{SCMFailoverProxyProviderBase.loadConfigs}} (line 148) |
> | Client → OM (Hadoop RPC) | {{OMProxyInfo.rpcAddr}} (final) |
> {{OMFailoverPadoopRpcOMFailoverProxyProvider.initOmProxiesFromConfigs}} |
> | OM ↔ OM control plane | (uses {{OMFailoverProxyProvider}} machinery — same
> shape as Client → OM) | — |
> | OM ↔ OM Ratis replication | {{RaftPeer.address}} (final String) — built
> fr.getInetAddress(), ratisPort)}} |
> {{OzoneManagerRatisServer.createRaftPeer}}(lines 438-451, 459-474) |
> | SCM ↔ SCM Ratis replication | {{RaftPeer.address}} (final String) —
> alreadost paths | {{SCMRatisServerImpl.buildRaftGroup}} (lines 399-414) |
> Notably, {{HDDS-5919}}'s {{ozone.network.jvm.address.cache.enabled=false}} ot
> only affects the JVM's positive DNS cache TTL, which would help future
> {{NetUtils.createSocketAddr}} calls. But the heartbeat/RPC path never makes
> ress}} is final on each
> {{EndpointStateMachine}}/{{SCMProxyInfo}}/{{OMProxyInfo}}.
> h1. Proposed fix
> Mirror the HADOOP-17068 design pattern at the {{FailoverProxyProvider}} /
> {{in Ozone (one tier above where Hadoop applied the fix, because Ozone's
> seams live there). On each connection-class failure:
> # Re-resolve the configured hostname via {{NetUtils.createSocketAddr(hostnam
> # Compare the new resolved IP against the cached
> {{InetSocketAddress.getAddress()}}.
> # If changed: stop the cached RPC proxy, replace the cached address
> atomicalet the next retry build a fresh proxy via the existing creation path.
> # If unchanged: fall through to existing retry behavior (no-op).
> This requires preserving the original "host:port" config string alongside
> ths}} (currently only the resolved address is kept). The implementation adds
> an opt-in config flag mirroring HBase's
> {{hbase.resolve.hostnames.on.failure}} nt.failover.resolve-needed}}:
> {noformat} ozone.client.failover.resolve-needed = false (default)
> ozone.datanode.scm.heartbeat.address.refresh.threshold = 3 (default; DN-sp
> {noformat}
> When the flag is false, behavior is byte-identical to current master. Operat
> existing non-K8s deployments see zero behavior change. This matches the HBase
> / HADOOP-17068 precedent of requiring explicit operator opt-in for the
> h2. Per-path summary
> || Path || Mechanism ||
> | DN → SCM | {{EndpointStateMachine}} preserves {{hostAndPort}} string.
> {{HeartbeatEndpointTask}} catch block calls {{maybeRefreshScmAddress}} when
> {{missedCount}} >= threshold. {{SCMConnectionManager.refreshSCMServer}} swaps
> the endpoint atomically; theVERSION}} state, which is the correct behavior
> because a peer that has beenrescheduled is effectively a fresh process. |
> | OM → SCM | {{SCMProxyInfo}} retains the config-time host:port.
> {{SCMFailovoxyAddressIfChanged(nodeId)}} runs in {{shouldRetry}} when an
> {{IOException}} chain contains {{ConnectException}},
> {{NoRouteToHostException}}, or {{UnknownHostException}}. Stale proxy is
> stopped via {{RPC.stopProxy}}. |
> | Client → OM (HadoopRPC) | {{OMProxyInfo.rpcAddr}} becomes mutable behind
> thAddressIfChanged()}} re-resolves {{rpcAddrStr}}, swaps {{rpcAddr}} and
> thederived {{dtService}}, nulls the cached proxy so the next
> {{createProxyIfNeeded}} dials the new IP.
> {{OMFailoverProxyProviderBase.shouldRetry}} calls this on connection-class
> exceptions before advancing the failover index. |
> | Client → OM (gRPC) | No code change required.
> {{GrpcOMFailoverProxyProvideetSocketAddress(0)}} and lets gRPC's
> {{NameResolver}} re-resolve hostnames on
> its own schedule. |
> | OM ↔ OM control plane | Uses Hadoop RPC via {{OMInterServiceProtocol}}, noy
> via the Client → OM fix. |
> | OM ↔ OM Ratis replication | {{OzoneManagerRatisServer.createRaftPeer}}
> simname:port string to {{RaftPeer.setAddress}} — never a resolved IP. Two of
> three previous {{createRaftPeer}} branches were calling {{new
> InetSocketAddrratisPort)}}, which strips the hostname and freezes the IP.
> With hostname-only addresses, gRPC's default {{DnsNameResolver}} (used by
> Ratis u connection failure / on its own refresh schedule. No Ratis upstream
> change required. | | SCM ↔ SCM Ratis replication | Already uses hostname
> strings; removes a mis use IP instead of hostname??}} comment in
> {{SCMRatisServerImpl.buildRaftGroup}} and {{SCMHAManagerImpl}} and replaces
> h1. Connection-class exception filter
> The refresh path is gated on exception types where DNS re-resolution could p
> * {{java.net.ConnectException}} — connection refused / unreachable
> * {{java.net.NoRouteToHostException}} — host route gone
> * {{java.net.UnknownHostException}} — DNS lookup failed downstream
> Filtering excludes application-level errors ({{OMNotLeaderException}},
> {{Ret{{AccessControlException}}) where SCM/OM is reachable on the cached IP
> and the failure is logical, not network. This avoids triggering DNS load on
> ever
> h1. Testing
> 13 new unit tests + 1 real-RPC integration test, all passing under {{mvn
> clean test}} on the latest master:
> || Test class || Tests || What it covers ||
> | {{TestSCMConnectionManager}} | +5 new | {{resolveLatestAddress}} edge
> cases, {{refreshSCMServer}} happy-path swap, no-op when IP unchanged, no-op
> when host:port not preserved (legacy ctor path) |
> | {{TestSCMFailoverProxyProviderRefresh}} | 3 new | Swap on IP change,
> no-op{hostAndPort}} not preserved |
> | {{TestOMProxyInfoDnsRefresh}} | 3 new | Address swap, dtService update,
> proxy null-out, proxy rebuild after refresh |
> | {{TestSCMConnectionManagerDnsRefreshE2E}} | 1 new | Real Hadoop RPC server
> (via {{ScmTestMock}}) on a real loopback socket. Connection manager primed
> with a deliberately stale {{127.0.0.99}} and preserved hostname
> {{localhost:port}}. {{refreshSCMServer}} fires; a real {{sendHeartbeat}}
> round-trips to the live server; {{ScmTestMock.rpcCount}} increments. Proves
> the full chain: address swap → fresh RPC proxy → real socket dial → s. |
> | {{TestOzoneManagerRatisServer}} | +1 new | Asserts
> {{RaftPeer.getAddress()}} is a hostname:port string, never an IP:port string.
> Defensive regex check that the host portion is not a numeric IPv4. |
> Existing regression tests covered by the same module set
> ({{TestSCMConnectionManager}} 1 prior, {{TestEndPoint}} 17,
> {{TestOMFailoverProxyProvider}} 8, {{TestOMFailovers}} 1,
> {{TestOzoneManagerRatisServer}} 5 prior): all green.
> A docker-compose validation run with the {{ozone-ha}} stack confirmed the new
> code is wired into the runtime JAR ({{javap}} verified
> {{addSCMServer(InetSocketAddress, String, String)}} and
> {{refreshSCMServer(InetSocketAddress, String)}} are present in
> {{hdds-container-service-2.2.0-SNAPSHOT.jar}}), the config flag reaches the
> DN's process environment, and the cluster boots and processes writes
> successfully under the opt-in flag.
> h1. Scope and known limitations
> * The fix only fires from the {{HEARTBEAT}} phase via
> {{HeartbeatEndpointTask}}. If a DataNode starts up with the SCM peer already
> at a stale IP (DN never reaches {{HEARTBEAT}}), the recovery path does not
> engage. Initial-bringup DNS staleness is the
> existing{ozone.network.jvm.address.cache.enabled=false}}.{{InitDatanodeState.java:94-101}}
> already postpones initialization on initial-resolution failure.
> * HDFS-14118-style construction-time DNS fan-out (one hostname → multiple
> persistent IPs) is a different problem (round-robin DNS for HDFS HA) and out
> of scope here. Worth a follow-on JIRA if Ozone deployments need it.
> * The Ratis quorum-loss exit-0 issue ({{SCMStateMachine.close()}} calling {{
> when leader election fails to converge, leading to Kubernetes
> CrashLoopBackOff death spirals) is a separate concern. File as
> {{HDDS-XXXXX}it non-zero so K8s' standard restart handling becomes correct
> again.
> h1. Suggested sub-task breakdown
> # {{HDDS-XXXX1}}: DN → SCM heartbeat —
> {{EndpointStateMachine}}/{{SCMConnectionManager}}/{{HeartbeatEndpointTask}}
> (smallest blast radius; lands first to prove the pattern) #
> {{HDDS-XXXX2}}: OM → SCM — {{SCMFailoverProxyProviderBase}}/{{SCMProxyInfo
> # {{HDDS-XXXX3}}: Client → OM —
> {{OMFailoverProxyProviderBase}}/{{OMProxyInfo}}
> # {{HDDS-XXXX4}}: Ratis hostnames-only —
> {{OzoneManagerRatisServer.createRaftPeer}}, {{SCMRatisServerImpl}},
> {{SCMHAManagerImpl}}, {{NodeDetails}} (smallest, lowest-risk; could lfirst to
> reduce surface)
> # {{HDDS-XXXX5}}: New config keys + {{ozone-default.xml}} entries (could fold
> into XXXX1)
>
>
> Each sub-task is independently testable and revertable.
> Umbrella ticket citeeview context.
> h1. References
> * {{HADOOP-17068}}: client fails forever when namenode ipaddr changed (Hadoop
> 3.4.0). Commit {{fa14e4bc001e28d9912e8d985d09bab75aedb87c}}. Authors: Sean
> Chow, He Xiaoqiao. Touche{{Client.setupConnection}} only.
> * {{HDFS-14118}}: introduces {{dfs.client.failover.resolve-needed}} and the
> {{AbstractNNFailoverProxyProvider.getResolvedAddressesIfNecessary}} hook
> (different shape: construction-time fan-out, not per-failure
> refresh).
> * {{HBASE}}: {{hbase.resolve.hostnames.on.failure}}
> ({{ConnectionImplementation.RESOLVE_HOSTNAME_ON_FAIL_KEY}}) — same opt-in
> design philosophy. * {{ZOOKEEPER-1506}},
> {{ZOOKEEPER-2982}}: ZooKeeper {{StaticHostProvider}} —}} call. The cleanest
> design in the Hadoop-adjacent ecosystem; closer to"always on" than opt-in.
> * {{HDDS-5919}}: introduces {{ozone.network.jvm.address.cache.enabled}} (defe
> JVM-level positive DNS cache TTL but does not fix the
> long-lived{{InetSocketAddress}} instances.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]