[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169299#comment-14169299
 ] 

Karthik Kambatla commented on YARN-2641:


bq. If NodeListManager#refreshNodes happens right after 
NodeListManager#isValidNode and before create a new RMNode in 
ResourceTrackerService#registerNodeManager.

Issuing yarn rmadmin -refreshNodes on the RM and the NM registering with the 
RM are inherently racy, and I don't think we need to serialize them. 

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-13 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170011#comment-14170011
 ] 

zhihai xu commented on YARN-2641:
-

Hi [~kasha], Yes, they are inherently racy. The user should know whether the 
node is already registered before decommission the node.
I attached a new patch YARN-2641.003.patch which remove the lock in 
ResourceTrackerService#registerNodeManager.

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-12 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168564#comment-14168564
 ] 

zhihai xu commented on YARN-2641:
-

[~kasha] thanks to review the patch.
Just synchronized hostsReader in NodeListManager#isValidNode is not enough.
There is a race condition between ResourceTrackerService#registerNodeManager 
and NodeListManager#refreshNodes:
If NodeListManager#refreshNodes happens right after NodeListManager#isValidNode 
and before create a new RMNode in ResourceTrackerService#registerNodeManager.
The node to be decommissioned will be added after NodeListManager#refreshNodes.
And it will never be decommissioned until next time 
NodeListManager#refreshNodes is called.

By synchronizing isValidNode and create-a-new-RMNode on hostsReader in 
ResourceTrackerService#registerNodeManager,
we can make sure the NodeListManager#refreshNodes is called either before 
isValidNode or after create-a-new-RMNode.
If it is called before check the node, isValidNode will return false which will 
shutdown the node.
If it is called after RMNode is created, the new created RMNode will be 
decommissioned by NodeListManager#refreshNodes immediately.

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-12 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168743#comment-14168743
 ] 

zhihai xu commented on YARN-2641:
-

I attached a new patch YARN-2641.002.patch which add comment in 
ResourceTrackerService#registerNodeManager for this race condition.

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168767#comment-14168767
 ] 

Hadoop QA commented on YARN-2641:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674426/YARN-2641.002.patch
  against trunk revision e8a31f2.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5371//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5371//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5371//console

This message is automatically generated.

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-11 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168278#comment-14168278
 ] 

Karthik Kambatla commented on YARN-2641:


Thanks for clarifying it, Zhihai. Verified that if we decommission the node 
after the NM goes down, the tasks are rescheduled only when the liveliness 
monitor kicks in. 

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-11 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168283#comment-14168283
 ] 

zhihai xu commented on YARN-2641:
-

[~kasha], thanks to spend a lot of effort to verify the issue.

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-11 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168333#comment-14168333
 ] 

Karthik Kambatla commented on YARN-2641:


Patch looks good to me, except for the following: 
- Are the changes to ResourceTrackerService#registerNodeManager required? 
NodeListManager#isValidNode is synchronized on hostsReader and that should be 
sufficient. No? 

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-10 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167692#comment-14167692
 ] 

Wei Yan commented on YARN-2641:
---

bq. I think the actual decommission happen when NM receive shutdown from RM 
heartbeat back. Isn't it? So the latency between decommission CLI and node get 
decommissioned won't affected. Also, in most cases, resource scheduling is 
triggered by NM's heartbeat with RM. So the latency of decommission CLI and 
scheduling container on nodes won't get affected (except attempt scheduling). 
So IMO, this patch only improve the latency for attempt scheduling case. Do we 
have some other scenarios to address?

From my understanding, currently if one NM failed or killed, the RM cannot 
gets that information until yarn.nm.liveness-monitor.expiry-interval-ms 
expired. That means, all containers running on that failed NM are assumed to 
be still running from the RM and AM sides, until the timeout. However, 
[~zxu]'s point is that, the RM doesn't need to wait a long time to get NM 
killed information, the RM can get this information directly when 
refreshNodes command is triggered. For example, if the user removes one NM, 
and then does refreshNodes, the RM can understand that NM killed quickly and 
can notify all applications about that, without needing to wait for the 
heartbeat timeout. And the AMs can act on that quickly.

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-10 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167915#comment-14167915
 ] 

Karthik Kambatla commented on YARN-2641:


I poked around on a cluster with 2 NMs. Submitted a sleep job with 4 mappers 
each sleeping for 10 minutes, the mappers got assigned 2 on each node. After 
the 4 mappers made some progress (11%), I decommissioned a node. When I 
decommissioned the node with AM, the AM died and the job restarted from 
scratch. When I decommissioned the node without the AM, the tasks immediately 
got re-scheduled onto the active node (job progress came down to 6% before 
going up again). 

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-10 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167920#comment-14167920
 ] 

Karthik Kambatla commented on YARN-2641:


I don't see the 10 mins wait happening either in practice or in my casual 
observation of the code. I can see how we can improve the decommissioning 
latency by the node-heartbeat-interval, but not more. Am I missing something 
here?

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-10 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167985#comment-14167985
 ] 

zhihai xu commented on YARN-2641:
-

Sorry, I didn't describe clearly the second scenario: We need first kill the NM 
process then call refreshNodes CLI to put the node in the blacklist. To make 
the refreshNodes CLI work correctly, we need create a file for example 
exclude_host.txt which should have the node name to decommission, then we 
should set the yarn.resourcemanager.nodes.exclude-path to the file 
exclude_host.txt.
See the code in TestResourceTrackerService.java for the configuration and node 
list file:
{code}
conf.set(YarnConfiguration.RM_NODES_EXCLUDE_FILE_PATH, hostFile
.getAbsolutePath());
  private void writeToHostsFile(String... hosts) throws IOException {
if (!hostFile.exists()) {
  TEMP_DIR.mkdirs();
  hostFile.createNewFile();
}
FileOutputStream fStream = null;
try {
  fStream = new FileOutputStream(hostFile);
  for (int i = 0; i  hosts.length; i++) {
fStream.write(hosts[i].getBytes());
fStream.write(\n.getBytes());
  }
} finally {
  if (fStream != null) {
IOUtils.closeStream(fStream);
fStream = null;
  }
}
  }
{code}

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-08 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163348#comment-14163348
 ] 

Junping Du commented on YARN-2641:
--

[~zxu], please see my comments inline.
bq. Did you still see the decommission happen after the heartbeat back to NM in 
the patch?
I think the *actual* decommission happen when NM receive shutdown from RM 
heartbeat back. Isn't it? So the latency between decommission CLI and node get 
decommissioned won't affected. Also, in most cases, resource scheduling is 
triggered by NM's heartbeat with RM. So the latency of decommission CLI and 
scheduling container on nodes won't get affected (except attempt scheduling). 
So IMO,  this patch only improve the latency for attempt scheduling case. Do we 
have some other scenarios to address?

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164152#comment-14164152
 ] 

zhihai xu commented on YARN-2641:
-

Hi [~djp], thanks for the explanation, this is a good discussion.

The patch can address the other scenario:
when RM receive decommission CLI, the RM can't receive the NM's heartbeat to 
send shutdown command to the NM, which is to be  decommissioned, because the NM 
to be decommissioned is killed right after RM receive decommission CLI. In this 
scenario, The RMNode will never be decommissioned in RM. The RMNode will only 
expire(RMNodeEventType.EXPIRE) in RM(NMLivelinessMonitor) after 
yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.

About the first scenario:
for attempt scheduling case, before create this JIRA, I think about to check 
whether the node is valid in the scheduler before allocate container at the 
node. But I think this may not be a good way because the scheduler is always 
the bottleneck in RM, if we can offload the work from scheduler, it would be 
better to do that.

what do you think? thanks

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161658#comment-14161658
 ] 

Hadoop QA commented on YARN-2641:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12673309/YARN-2641.000.patch
  against trunk revision 0fb2735.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5302//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5302//console

This message is automatically generated.

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-07 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161660#comment-14161660
 ] 

Junping Du commented on YARN-2641:
--

The idea here sounds interestingHowever, the decommission still happen 
after the heartbeat back to NM. AM I missing something here?

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161664#comment-14161664
 ] 

Hadoop QA commented on YARN-2641:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12673311/YARN-2641.001.patch
  against trunk revision 0fb2735.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5303//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5303//console

This message is automatically generated.

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-07 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162740#comment-14162740
 ] 

zhihai xu commented on YARN-2641:
-

Hi [~djp], thanks to review the patch. I removed the following RMNode 
decommission in nodeHeartbeat(ResourceTrackerService.java).

{code}
   this.rmContext.getDispatcher().getEventHandler().handle(
  new RMNodeEvent(nodeId, RMNodeEventType.DECOMMISSION));
{code}

I added RMNode decommission in refreshNodes(NodesListManager.java).

Did you still see  the decommission happen after the heartbeat back to NM in 
the patch?

I didn't have unit test in my first patch(YARN-2641.000.patch).

In my second patch(YARN-2641.001.patch), I change the unit test in 
TestResourceTrackerService to verify the RMNodeEventType.DECOMMISSION is sent 
in {code}rm.getNodesListManager().refreshNodes(conf);{code} instead of
{code}nodeHeartbeat = nm1.nodeHeartbeat(true); {code}


 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162764#comment-14162764
 ] 

Hadoop QA commented on YARN-2641:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12673446/YARN-2641.002.patch
  against trunk revision 9b8a35a.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5316//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5316//console

This message is automatically generated.

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

2014-10-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161299#comment-14161299
 ] 

Hadoop QA commented on YARN-2641:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12673239/YARN-2641.000.patch
  against trunk revision 519e5a7.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5293//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5293//console

This message is automatically generated.

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)