[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882403#comment-17882403 ]
ASF GitHub Bot commented on YARN-11730: --------------------------------------- arjunmohnot commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2355769123 Hi @slfan1989, the changes have been made and CI checks passed—could you kindly review when you get a chance? Thank you for your time and support! > Resourcemanager node reporting enhancement for unregistered hosts > ----------------------------------------------------------------- > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn > Affects Versions: 3.4.0 > Environment: Tested on multiple environments: > A. Docker Environment{*}:{*} > * Base OS: *Ubuntu 20.04* > * *Java 8* installed from OpenJDK. > * Docker image includes Hadoop binaries, user configurations, and ports for > YARN services. > * Verified behavior using a Hadoop snapshot in a containerized environment. > * Performed Namenode formatting and validated service interactions through > exposed ports. > * Repo reference: > [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] > B. Bare-metal Distributed Setup (RedHat Linux){*}:{*} > * Running *Java 8* in a High-Availability (HA) configuration with > *Zookeeper* for locking mechanism. > * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM > node, including state retention and proper node state transitions. > * Verified node state transitions during RM failover, ensuring nodes moved > between LOST, ACTIVE, and other states as expected. > Reporter: Arjun Mohnot > Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > h3. Issue Overview > When the ResourceManager (RM) starts, nodes listed in the _"include"_ file > are not immediately reported until their corresponding NodeManagers (NMs) > send their first heartbeat. However, nodes in the _"exclude"_ file are > instantly reflected in the _"Decommissioned Hosts"_ section with a port value > -1. > This design creates several challenges: > * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM > standalone restart, some nodes may not report back, even though they are > listed in the _"include"_ file. These nodes neither appear in the _LOST_ > state nor are they represented in the RM's JMX metrics. This results in an > untracked state, making it difficult to monitor their status. While in HDFS > similar behaviour exists and is marked as {_}"DEAD"{_}. > * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until > they send their first heartbeat. This delay impacts real-time cluster > monitoring, leading to a lack of immediate visibility for these nodes in > Resourcemanager's state on the total no. of nodes. > * {*}Operational Impact{*}: These unreported nodes cause operational > difficulties, particularly in automated workflows such as OS Upgrade > Automation (OSUA), node recovery automation, and others where validation > depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or > {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky > workarounds to determine their accurate status. > h3. Proposed Solution > To address these issues, we propose automatically assigning the _LOST_ state > to any node listed in the _"include"_ file that are not registered and not > part of the exclude file by default at the RM startup or HA failover. This > can be done by marking the node with a special port value {_}-2{_}, signaling > that the node is considered LOST but has not yet been reported. Whenever a > heartbeat is received for that {color:#de350b}nodeID{color}, it will be > transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other > required desired state. > h3. Key implementation points > * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of > the RM active node context should be automatically marked as {_}LOST{_}. This > can be achieved by modifying the _NodesListManager_ under the > {color:#de350b}refreshHostsReader{color} method, invoked during failover, or > manual node refresh operations. This logic should ensure that all > unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating > the node is untracked. > * For non-HA setups, this process can be triggered during RM service startup > to mark nodes as _LOST_ initially, and they will gradually transition to > their desired state when the heartbeat is received. > * Handle Node Heartbeat and Transition: When a node sends its first > heartbeat, the system should verify if the node is listed in > {color:#de350b}getInactiveRMNodes(){color}. If the node exists in the _LOST_ > state, the RM should remove it from the inactive list, decrement the _LOST_ > node count, and handle the transition back to the active node set. > * This logic can be placed in the state transition method within > {color:#de350b}RMNodeImpl.java{color}, ensuring that nodes transitioned from > _NEW_ to _LOST_ state, and recover gracefully from the _LOST_ state upon > receiving their heartbeat. > h3. Benefits > * {*}Improved Cluster Monitoring{*}: Automatically assigning a _LOST_ state > to nodes listed in the _"include"_ file but not reporting ensures that every > node in the cluster has a well-defined state ({_}ACTIVE{_}, {_}LOST{_}, > {_}DECOMMISSIONED{_}, {_}UNHEALTHY, etc{_}). This eliminates any potential > gaps in cluster node visibility and simplifies operational monitoring. > * {*}Better Recovery Management{*}: By marking unreported nodes as > {_}LOST{_}, automation can quickly identify which nodes require attention > during recovery efforts to restore cluster health. This prevents confusion > between unreachable nodes and untracked nodes, improving recovery accuracy. > * {*}Enhanced Cluster Stability{*}: This approach improves overall stability > by preventing nodes from slipping into an untracked or unknown state. It > guarantees that the system remains aware of all nodes, reducing issues during > RM failover or restart scenarios. > h3. Additional Considerations > * Feature Flag Control: This feature will be enabled/disabled via a > configuration flag, allowing users to adjust behavior based on their > requirements. By default, it is marked as {_}False{_}. > * Enough Validations: The approach has been well-tested on non-HA and HA > setups, and a dummy docker-based > [setup|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] has been > created to replicate the behavior. Added the required unit test cases to > validate the code behavior. Demo > [video|https://drive.google.com/file/d/1okiPe7uMNVMRUnNYtz-B8Igf8FMGr-SJ/view?usp=sharing] > for this change. >  > Any thoughts/suggestions/feedback are welcome! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org