[ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272901#comment-15272901
 ] 

Daniel Zhi commented on YARN-4676:
----------------------------------

To clarify before I make code changes:

1. HostsFileReader currently allows multiple hosts per line. When hosts are 
pure digits, there will be ambiguity with timeout during interpretation. Likely 
allowing pure digit would requires pure-digit-host starts with a new line.
2. -1 means infinite timeout (wait forever until ready). null means no 
overwrite, use the default timeout. 
3. there could be large number of hosts to be decommissioned so the single line 
could be huge. grep a particular host would return a huge line in that case. A 
mix could be log in a single line for less than N host but otherwise multiple 
line. That said, I am ok to change to single line.
4. simple after 1)
5. same as 2
6. ok
7. How about DEFAULT_NM_EXIT_WAIT_MS = 0? So that it could be customized in 
cases the delay is preferred.
8. The grace period is to give RM server-side a chance to DECOMMISSION the node 
should timeout reaches. A much smaller period like 2 seconds most likely would 
be sufficient as NodeManager heartbeat every second during which 
DECOMMISSIONING node will be re-evaluated and decommissioned if ready or 
timeout.
9. "yarn rmadmin -refreshNodes -g -1" waits forever until the node is ready. 
"yarn rmadmin -refreshNodes -g" uses default timeout as specified by the 
configuration key.
10. same as 2)
11. ok
12. see 7)
13. ok
14. Here is an example of the tabular logging. Keeping DECOMMISSIONED node a 
little longer prevent it from suddenly disappeared from the list after 
DECOMMISSIONed. 
2015-08-14 20:31:00,797 INFO 
org.apache.hadoop.yarn.server.resourcemanager.DecommissioningNodesWatcher (IPC 
Server handler 14 on 9023): Decommissioning Nodes: 
  ip-10-45-166-151.ec2.internal        20s fresh:  0s containers:14 
WAIT_CONTAINER timeout:1779s
    application_1439334429355_0004 RUNNING MAPREDUCE  7.50%    55s
  ip-10-170-95-251.ec2.internal        20s fresh:  0s containers:14 
WAIT_CONTAINER timeout:1779s
    application_1439334429355_0004 RUNNING MAPREDUCE  7.50%    55s
  ip-10-29-137-237.ec2.internal        19s fresh:  0s containers:14 
WAIT_CONTAINER timeout:1780s
    application_1439334429355_0004 RUNNING MAPREDUCE  7.50%    55s
  ip-10-157-4-26.ec2.internal          19s fresh:  0s containers:14 
WAIT_CONTAINER timeout:1780s
    application_1439334429355_0004 RUNNING MAPREDUCE  7.50%    55s

15. I agree that getDecommissioningStatus suggest the call is read-only. Since 
completed apps need to be take into account when evaluate readiness of the 
node, getDecommissioningStatus is actually a private method used internally so 
it could be changed into private checkDecommissioningStatus(nodeId).

16. readDecommissioningTimeout is to pick up new value without restart RM. It 
was requested by EMR customers and I do see the user scenarios. It is only 
invoked when there are DECOMMISSIONED nodes and will only be invoked once every 
20 seconds (poll period). I have to maintain private patch or consider other 
option if remove the feature.

17. ok
18. The method return number of seconds to timeout. I don't mind changing the 
name to getTimeoutTimestampInSec() but don't see the reason behind.

19. see the example in 14. This is once every 20 seconds and was very useful
during my development and testing of the work. I see more valuable to leave it 
as INFO but as the code become mature and stable, maybe ok to turn into DEBUG.

20. ok
21. The isValidNode() && isNodeInDecommissioning() condition is just a very 
quick shallow check --- for a DECOMMISSIONING node, although nodesListManager 
would return false for isValidNode() as the node appear in excluded host list, 
such node will be allowed to continue as it is in the middle of 
DECOMMISSIONING. During the process of the heart beat, decommissioningWatcher 
is updated with the latest container status of the node; Later 
decomWatcher.checkReadyToBeDecommissioned(rmNode.getNodeID()) evaluates its 
readiness and DECOMMISSION the node if ready (include timeout).  

22. the call simply returns if within 20 seconds of last call. Currently it 
lives inside ResourceTrackerService and uses rmContext. Alternatively 
DecommissioningNodesWatcher could be constructed with rmContext and internally 
has its own polling thread. Other than not sure yet the code pattern to use for 
such internal thread, it appears as valid alternative to me. 

23. ok
24. ok
25. Instead of disallow and exit, an alternative way is to allow the graceful 
decommission as usual. There will be no difference if no RM restart during the 
session. In case RM restart, currently all excluded nodes decommissioned right 
away, an enhanced support in future will resume it. 


> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>
>                 Key: YARN-4676
>                 URL: https://issues.apache.org/jira/browse/YARN-4676
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Zhi
>            Assignee: Daniel Zhi
>              Labels: features
>         Attachments: GracefulDecommissionYarnNode.pdf, 
> GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, YARN-4676.005.patch, 
> YARN-4676.006.patch, YARN-4676.007.patch, YARN-4676.008.patch, 
> YARN-4676.009.patch, YARN-4676.010.patch, YARN-4676.011.patch, 
> YARN-4676.012.patch, YARN-4676.013.patch
>
>
> YARN-4676 implements an automatic, asynchronous and flexible mechanism to 
> graceful decommission
> YARN nodes. After user issues the refreshNodes request, ResourceManager 
> automatically evaluates
> status of all affected nodes to kicks out decommission or recommission 
> actions. RM asynchronously
> tracks container and application status related to DECOMMISSIONING nodes to 
> decommission the
> nodes immediately after there are ready to be decommissioned. Decommissioning 
> timeout at individual
> nodes granularity is supported and could be dynamically updated. The 
> mechanism naturally supports multiple
> independent graceful decommissioning “sessions” where each one involves 
> different sets of nodes with
> different timeout settings. Such support is ideal and necessary for graceful 
> decommission request issued
> by external cluster management software instead of human.
> DecommissioningNodeWatcher inside ResourceTrackingService tracks 
> DECOMMISSIONING nodes status automatically and asynchronously after 
> client/admin made the graceful decommission request. It tracks 
> DECOMMISSIONING nodes status to decide when, after all running containers on 
> the node have completed, will be transitioned into DECOMMISSIONED state. 
> NodesListManager detect and handle include and exclude list changes to kick 
> out decommission or recommission as necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to