[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

ASF GitHub Bot (Jira) Tue, 17 Sep 2024 06:19:05 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882403#comment-17882403
 ]


ASF GitHub Bot commented on YARN-11730:
---------------------------------------

arjunmohnot commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2355769123

   Hi @slfan1989, the changes have been made and CI checks passed—could you 
kindly review when you get a chance? Thank you for your time and support!




> Resourcemanager node reporting enhancement for unregistered hosts
> -----------------------------------------------------------------
>
>                 Key: YARN-11730
>                 URL: https://issues.apache.org/jira/browse/YARN-11730
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager, yarn
>    Affects Versions: 3.4.0
>         Environment: Tested on multiple environments:
> A. Docker Environment{*}:{*}
>  * Base OS: *Ubuntu 20.04*
>  * *Java 8* installed from OpenJDK.
>  * Docker image includes Hadoop binaries, user configurations, and ports for 
> YARN services.
>  * Verified behavior using a Hadoop snapshot in a containerized environment.
>  * Performed Namenode formatting and validated service interactions through 
> exposed ports.
>  * Repo reference: 
> [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]
> B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
>  * Running *Java 8* in a High-Availability (HA) configuration with 
> *Zookeeper* for locking mechanism.
>  * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM 
> node, including state retention and proper node state transitions.
>  * Verified node state transitions during RM failover, ensuring nodes moved 
> between LOST, ACTIVE, and other states as expected.
>            Reporter: Arjun Mohnot
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.0
>
>
> h3. Issue Overview
> When the ResourceManager (RM) starts, nodes listed in the _"include"_ file 
> are not immediately reported until their corresponding NodeManagers (NMs) 
> send their first heartbeat. However, nodes in the _"exclude"_ file are 
> instantly reflected in the _"Decommissioned Hosts"_ section with a port value 
> -1.
> This design creates several challenges:
>  * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM 
> standalone restart, some nodes may not report back, even though they are 
> listed in the _"include"_ file. These nodes neither appear in the _LOST_ 
> state nor are they represented in the RM's JMX metrics. This results in an 
> untracked state, making it difficult to monitor their status. While in HDFS 
> similar behaviour exists and is marked as {_}"DEAD"{_}.
>  * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until 
> they send their first heartbeat. This delay impacts real-time cluster 
> monitoring, leading to a lack of immediate visibility for these nodes in 
> Resourcemanager's state on the total no. of nodes.
>  * {*}Operational Impact{*}: These unreported nodes cause operational 
> difficulties, particularly in automated workflows such as OS Upgrade 
> Automation (OSUA), node recovery automation, and others where validation 
> depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or 
> {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky 
> workarounds to determine their accurate status.
> h3. Proposed Solution
> To address these issues, we propose automatically assigning the _LOST_ state 
> to any node listed in the _"include"_ file that are not registered and not 
> part of the exclude file by default at the RM startup or HA failover. This 
> can be done by marking the node with a special port value {_}-2{_}, signaling 
> that the node is considered LOST but has not yet been reported. Whenever a 
> heartbeat is received for that {color:#de350b}nodeID{color}, it will be 
> transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other 
> required desired state.
> h3. Key implementation points
>  * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of 
> the RM active node context should be automatically marked as {_}LOST{_}. This 
> can be achieved by modifying the _NodesListManager_ under the 
> {color:#de350b}refreshHostsReader{color} method, invoked during failover, or 
> manual node refresh operations. This logic should ensure that all 
> unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating 
> the node is untracked.
>  * For non-HA setups, this process can be triggered during RM service startup 
> to mark nodes as _LOST_ initially, and they will gradually transition to 
> their desired state when the heartbeat is received.
>  * Handle Node Heartbeat and Transition: When a node sends its first 
> heartbeat, the system should verify if the node is listed in 
> {color:#de350b}getInactiveRMNodes(){color}. If the node exists in the _LOST_ 
> state, the RM should remove it from the inactive list, decrement the _LOST_ 
> node count, and handle the transition back to the active node set.
>  * This logic can be placed in the state transition method within 
> {color:#de350b}RMNodeImpl.java{color}, ensuring that nodes transitioned from 
> _NEW_ to _LOST_ state, and recover gracefully from the _LOST_ state upon 
> receiving their heartbeat.
> h3. Benefits
>  * {*}Improved Cluster Monitoring{*}: Automatically assigning a _LOST_ state 
> to nodes listed in the _"include"_ file but not reporting ensures that every 
> node in the cluster has a well-defined state ({_}ACTIVE{_}, {_}LOST{_}, 
> {_}DECOMMISSIONED{_}, {_}UNHEALTHY, etc{_}). This eliminates any potential 
> gaps in cluster node visibility and simplifies operational monitoring.
>  * {*}Better Recovery Management{*}: By marking unreported nodes as 
> {_}LOST{_}, automation can quickly identify which nodes require attention 
> during recovery efforts to restore cluster health. This prevents confusion 
> between unreachable nodes and untracked nodes, improving recovery accuracy.
>  * {*}Enhanced Cluster Stability{*}: This approach improves overall stability 
> by preventing nodes from slipping into an untracked or unknown state. It 
> guarantees that the system remains aware of all nodes, reducing issues during 
> RM failover or restart scenarios.
> h3. Additional Considerations
>  * Feature Flag Control: This feature will be enabled/disabled via a 
> configuration flag, allowing users to adjust behavior based on their 
> requirements. By default, it is marked as {_}False{_}.
>  * Enough Validations: The approach has been well-tested on non-HA and HA 
> setups, and a dummy docker-based 
> [setup|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] has been 
> created to replicate the behavior. Added the required unit test cases to 
> validate the code behavior. Demo 
> [video|https://drive.google.com/file/d/1okiPe7uMNVMRUnNYtz-B8Igf8FMGr-SJ/view?usp=sharing]
>  for this change.
>  
> Any thoughts/suggestions/feedback are welcome!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

Reply via email to