[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation

2015-01-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263472#comment-14263472
 ] 

Hadoop QA commented on YARN-2637:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12689916/YARN-2637.23.patch
  against trunk revision 947578c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 8 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestFifoScheduler
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler

  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6235//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6235//console

This message is automatically generated.

 maximum-am-resource-percent could be violated when resource of AM is  
 minimumAllocation
 

 Key: YARN-2637
 URL: https://issues.apache.org/jira/browse/YARN-2637
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: Craig Welch
Priority: Critical
 Attachments: YARN-2637.0.patch, YARN-2637.1.patch, 
 YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, 
 YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, 
 YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, 
 YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, 
 YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch


 Currently, number of AM in leaf queue will be calculated in following way:
 {code}
 max_am_resource = queue_max_capacity * maximum_am_resource_percent
 #max_am_number = max_am_resource / minimum_allocation
 #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor
 {code}
 And when submit new application to RM, it will check if an app can be 
 activated in following way:
 {code}
 for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); 
  i.hasNext(); ) {
   FiCaSchedulerApp application = i.next();
   
   // Check queue limit
   if (getNumActiveApplications() = getMaximumActiveApplications()) {
 break;
   }
   
   // Check user limit
   User user = getUser(application.getUser());
   if (user.getActiveApplications()  
 getMaximumActiveApplicationsPerUser()) {
 user.activateApplication();
 activeApplications.add(application);
 i.remove();
 LOG.info(Application  + application.getApplicationId() +
  from user:  + application.getUser() + 
  activated in queue:  + getQueueName());
   }
 }
 {code}
 An example is,
 If a queue has capacity = 1G, max_am_resource_percent  = 0.2, the maximum 
 resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be 
 launched is 200, and if user uses 5M for each AM ( minimum_allocation). All 
 apps can still be activated, and it will occupy all resource of a queue 
 instead of only a max_am_resource_percent of a queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation

2015-01-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263537#comment-14263537
 ] 

Hadoop QA commented on YARN-2637:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12689921/YARN-2637.25.patch
  against trunk revision 947578c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6236//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6236//console

This message is automatically generated.

 maximum-am-resource-percent could be violated when resource of AM is  
 minimumAllocation
 

 Key: YARN-2637
 URL: https://issues.apache.org/jira/browse/YARN-2637
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: Craig Welch
Priority: Critical
 Attachments: YARN-2637.0.patch, YARN-2637.1.patch, 
 YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, 
 YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, 
 YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, 
 YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, 
 YARN-2637.25.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch


 Currently, number of AM in leaf queue will be calculated in following way:
 {code}
 max_am_resource = queue_max_capacity * maximum_am_resource_percent
 #max_am_number = max_am_resource / minimum_allocation
 #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor
 {code}
 And when submit new application to RM, it will check if an app can be 
 activated in following way:
 {code}
 for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); 
  i.hasNext(); ) {
   FiCaSchedulerApp application = i.next();
   
   // Check queue limit
   if (getNumActiveApplications() = getMaximumActiveApplications()) {
 break;
   }
   
   // Check user limit
   User user = getUser(application.getUser());
   if (user.getActiveApplications()  
 getMaximumActiveApplicationsPerUser()) {
 user.activateApplication();
 activeApplications.add(application);
 i.remove();
 LOG.info(Application  + application.getApplicationId() +
  from user:  + application.getUser() + 
  activated in queue:  + getQueueName());
   }
 }
 {code}
 An example is,
 If a queue has capacity = 1G, max_am_resource_percent  = 0.2, the maximum 
 resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be 
 launched is 200, and if user uses 5M for each AM ( minimum_allocation). All 
 apps can still be activated, and it will occupy all resource of a queue 
 instead of only a max_am_resource_percent of a queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2991) TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk

2015-01-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263543#comment-14263543
 ] 

Hudson commented on YARN-2991:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #58 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/58/])
YARN-2991. Fixed DrainDispatcher to reuse the draining code path in 
AsyncDispatcher. Contributed by Rohith Sharmaks. (zjshen: rev 
947578c1c1413f9043ceb1e87df6a97df048e854)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* hadoop-yarn-project/CHANGES.txt


 TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on 
 trunk
 --

 Key: YARN-2991
 URL: https://issues.apache.org/jira/browse/YARN-2991
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Rohith
Priority: Blocker
 Fix For: 2.7.0

 Attachments: 0001-YARN-2991.patch, 0002-YARN-2991.patch


 {code}
 Error Message
 test timed out after 6 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 6 milliseconds
   at java.lang.Object.wait(Native Method)
   at java.lang.Thread.join(Thread.java:1281)
   at java.lang.Thread.join(Thread.java:1355)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
   at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1106)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testDecomissionedNMsMetricsOnRMRestart(TestRMRestart.java:1873)
 {code}
 It happened twice this months:
 https://builds.apache.org/job/PreCommit-YARN-Build/6096/
 https://builds.apache.org/job/PreCommit-YARN-Build/6182/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2991) TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk

2015-01-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263499#comment-14263499
 ] 

Hudson commented on YARN-2991:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #61 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/61/])
YARN-2991. Fixed DrainDispatcher to reuse the draining code path in 
AsyncDispatcher. Contributed by Rohith Sharmaks. (zjshen: rev 
947578c1c1413f9043ceb1e87df6a97df048e854)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java


 TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on 
 trunk
 --

 Key: YARN-2991
 URL: https://issues.apache.org/jira/browse/YARN-2991
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Rohith
Priority: Blocker
 Fix For: 2.7.0

 Attachments: 0001-YARN-2991.patch, 0002-YARN-2991.patch


 {code}
 Error Message
 test timed out after 6 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 6 milliseconds
   at java.lang.Object.wait(Native Method)
   at java.lang.Thread.join(Thread.java:1281)
   at java.lang.Thread.join(Thread.java:1355)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
   at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1106)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testDecomissionedNMsMetricsOnRMRestart(TestRMRestart.java:1873)
 {code}
 It happened twice this months:
 https://builds.apache.org/job/PreCommit-YARN-Build/6096/
 https://builds.apache.org/job/PreCommit-YARN-Build/6182/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2991) TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk

2015-01-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263547#comment-14263547
 ] 

Hudson commented on YARN-2991:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1993 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1993/])
YARN-2991. Fixed DrainDispatcher to reuse the draining code path in 
AsyncDispatcher. Contributed by Rohith Sharmaks. (zjshen: rev 
947578c1c1413f9043ceb1e87df6a97df048e854)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* hadoop-yarn-project/CHANGES.txt


 TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on 
 trunk
 --

 Key: YARN-2991
 URL: https://issues.apache.org/jira/browse/YARN-2991
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Rohith
Priority: Blocker
 Fix For: 2.7.0

 Attachments: 0001-YARN-2991.patch, 0002-YARN-2991.patch


 {code}
 Error Message
 test timed out after 6 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 6 milliseconds
   at java.lang.Object.wait(Native Method)
   at java.lang.Thread.join(Thread.java:1281)
   at java.lang.Thread.join(Thread.java:1355)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
   at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1106)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testDecomissionedNMsMetricsOnRMRestart(TestRMRestart.java:1873)
 {code}
 It happened twice this months:
 https://builds.apache.org/job/PreCommit-YARN-Build/6096/
 https://builds.apache.org/job/PreCommit-YARN-Build/6182/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2991) TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk

2015-01-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263507#comment-14263507
 ] 

Hudson commented on YARN-2991:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #795 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/795/])
YARN-2991. Fixed DrainDispatcher to reuse the draining code path in 
AsyncDispatcher. Contributed by Rohith Sharmaks. (zjshen: rev 
947578c1c1413f9043ceb1e87df6a97df048e854)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java


 TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on 
 trunk
 --

 Key: YARN-2991
 URL: https://issues.apache.org/jira/browse/YARN-2991
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Rohith
Priority: Blocker
 Fix For: 2.7.0

 Attachments: 0001-YARN-2991.patch, 0002-YARN-2991.patch


 {code}
 Error Message
 test timed out after 6 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 6 milliseconds
   at java.lang.Object.wait(Native Method)
   at java.lang.Thread.join(Thread.java:1281)
   at java.lang.Thread.join(Thread.java:1355)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
   at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1106)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testDecomissionedNMsMetricsOnRMRestart(TestRMRestart.java:1873)
 {code}
 It happened twice this months:
 https://builds.apache.org/job/PreCommit-YARN-Build/6096/
 https://builds.apache.org/job/PreCommit-YARN-Build/6182/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation

2015-01-03 Thread Craig Welch (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-2637:
--
Attachment: YARN-2637.25.patch

 maximum-am-resource-percent could be violated when resource of AM is  
 minimumAllocation
 

 Key: YARN-2637
 URL: https://issues.apache.org/jira/browse/YARN-2637
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: Craig Welch
Priority: Critical
 Attachments: YARN-2637.0.patch, YARN-2637.1.patch, 
 YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, 
 YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, 
 YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, 
 YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, 
 YARN-2637.25.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch


 Currently, number of AM in leaf queue will be calculated in following way:
 {code}
 max_am_resource = queue_max_capacity * maximum_am_resource_percent
 #max_am_number = max_am_resource / minimum_allocation
 #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor
 {code}
 And when submit new application to RM, it will check if an app can be 
 activated in following way:
 {code}
 for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); 
  i.hasNext(); ) {
   FiCaSchedulerApp application = i.next();
   
   // Check queue limit
   if (getNumActiveApplications() = getMaximumActiveApplications()) {
 break;
   }
   
   // Check user limit
   User user = getUser(application.getUser());
   if (user.getActiveApplications()  
 getMaximumActiveApplicationsPerUser()) {
 user.activateApplication();
 activeApplications.add(application);
 i.remove();
 LOG.info(Application  + application.getApplicationId() +
  from user:  + application.getUser() + 
  activated in queue:  + getQueueName());
   }
 }
 {code}
 An example is,
 If a queue has capacity = 1G, max_am_resource_percent  = 0.2, the maximum 
 resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be 
 launched is 200, and if user uses 5M for each AM ( minimum_allocation). All 
 apps can still be activated, and it will occupy all resource of a queue 
 instead of only a max_am_resource_percent of a queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2958) RMStateStore seems to unnecessarily and wrongly store sequence number separately

2015-01-03 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263602#comment-14263602
 ] 

Varun Saxena commented on YARN-2958:


[~zjshen], thanks for the review. Please find my replies below.

bq. No need to add isUpdateSeqNo. Updating a non-existing znode is storing a 
DT, we should update the seq number of it. So we just need to use isUpdate
The reason I added this new flag is because when we update the Delegation 
token, we first check whether znode exists or not. And if it doesnt exist we 
store it as a new token(not update it). In this case, I think we should not 
overwrite the sequence number. Now I am not sure if non existence of znode 
while updating DT is a valid use case(could not think of any) or just defensive 
programming but anyhow we store DT if znode to be updated is not found.. Refer 
to code below.
{code:title=ZKRMStateStore.java}
  protected synchronized void updateRMDelegationTokenAndSequenceNumberInternal(
  RMDelegationTokenIdentifier rmDTIdentifier, Long renewDate,
  int latestSequenceNumber) throws Exception {
...
if (existsWithRetries(nodeRemovePath, true) == null) {
  // in case znode doesn't exist
  addStoreOrUpdateOps(
  opList, rmDTIdentifier, renewDate, false, false);
  LOG.debug(Attempted to update a non-existing znode  + nodeRemovePath);
} else {
  // in case znode exists
  addStoreOrUpdateOps(
  opList, rmDTIdentifier, renewDate, true, false);
}
..
  }
{code}

bq. store|updateRMDelegationTokenAndSequenceNumber is better to be renamed to 
store|updateRMDelegationToken
Ok. Will change.

bq.  Instead of changing sequenceNumber to 0, can we set  to dtId1 and 
verify it later?
Will do so.

 RMStateStore seems to unnecessarily and wrongly store sequence number 
 separately
 

 Key: YARN-2958
 URL: https://issues.apache.org/jira/browse/YARN-2958
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Zhijie Shen
Assignee: Varun Saxena
Priority: Blocker
 Attachments: YARN-2958.001.patch, YARN-2958.002.patch, 
 YARN-2958.003.patch


 It seems that RMStateStore updates last sequence number when storing or 
 updating each individual DT, to recover the latest sequence number when RM 
 restarting.
 First, the current logic seems to be problematic:
 {code}
   public synchronized void updateRMDelegationTokenAndSequenceNumber(
   RMDelegationTokenIdentifier rmDTIdentifier, Long renewDate,
   int latestSequenceNumber) {
 if(isFencedState()) {
   LOG.info(State store is in Fenced state. Can't update RM Delegation 
 Token.);
   return;
 }
 try {
   updateRMDelegationTokenAndSequenceNumberInternal(rmDTIdentifier, 
 renewDate,
   latestSequenceNumber);
 } catch (Exception e) {
   notifyStoreOperationFailed(e);
 }
   }
 {code}
 {code}
   @Override
   protected void updateStoredToken(RMDelegationTokenIdentifier id,
   long renewDate) {
 try {
   LOG.info(updating RMDelegation token with sequence number: 
   + id.getSequenceNumber());
   rmContext.getStateStore().updateRMDelegationTokenAndSequenceNumber(id,
 renewDate, id.getSequenceNumber());
 } catch (Exception e) {
   LOG.error(Error in updating persisted RMDelegationToken with sequence 
 number: 
 + id.getSequenceNumber());
   ExitUtil.terminate(1, e);
 }
   }
 {code}
 According to code above, even when renewing a DT, the last sequence number is 
 updated in the store, which is wrong. For example, we have the following 
 sequence:
 1. Get DT 1 (seq = 1)
 2. Get DT 2( seq = 2)
 3. Renew DT 1 (seq = 1)
 4. Restart RM
 The stored and then recovered last sequence number is 1. It makes the next 
 created DT after RM restarting will conflict with DT 2 on sequence num.
 Second, the aforementioned bug doesn't happen actually, because the recovered 
 last sequence num has been overwritten at by the correctly one.
 {code}
   public void recover(RMState rmState) throws Exception {
 LOG.info(recovering RMDelegationTokenSecretManager.);
 // recover RMDTMasterKeys
 for (DelegationKey dtKey : rmState.getRMDTSecretManagerState()
   .getMasterKeyState()) {
   addKey(dtKey);
 }
 // recover RMDelegationTokens
 MapRMDelegationTokenIdentifier, Long rmDelegationTokens =
 rmState.getRMDTSecretManagerState().getTokenState();
 this.delegationTokenSequenceNumber =
 rmState.getRMDTSecretManagerState().getDTSequenceNumber();
 for (Map.EntryRMDelegationTokenIdentifier, Long entry : 
 rmDelegationTokens
   .entrySet()) {
   addPersistedDelegationToken(entry.getKey(), entry.getValue());
 }
   }
 

[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation

2015-01-03 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263608#comment-14263608
 ] 

Craig Welch commented on YARN-2637:
---

bq. I think there should at least one AM can be launched in each queue ... 
MockRM test config settings

That's been the case since switching to approach 2, some tests need to start  
1 app in a queue ;) In any case, I've removed the MockRM test config settings, 
it's only needed in a few tests now, so I'm setting it those tests directly
(done)

bq. -re maximumActiveApplications ... MAXIMUM_ACTIVE_APPLICATIONS_SUFFIX
I removed this new configuration point.  It is no longer possible to directly 
control how many apps start in a queue since the AM's are not all the same 
size, so it's not possible to actually control that now outside of testing (it 
was before, not it's not).  However, the cases I recall using that were all to 
work around the fact that the max am percent wasn't working properly, so 
hopefully this won't be missed
(done)

-re null checks in FiCaSchedulerApp constructor
So, the ResourceManager itself checks for null rmapps (ResourceManager.java~ 
line 830), this is a pre-existing case which is tolerated and I'm not going to 
address it.  The getAMResourceRequest() can also be null for unmanaged AM's.  
I've reduced the null checks for the app to just these two cases but those 
checks should remain.
(partly done/remaining should stay as-is)

All the build quality checks and tests are passing, not sure why the overall is 
red, think it's a build server issue...

 maximum-am-resource-percent could be violated when resource of AM is  
 minimumAllocation
 

 Key: YARN-2637
 URL: https://issues.apache.org/jira/browse/YARN-2637
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: Craig Welch
Priority: Critical
 Attachments: YARN-2637.0.patch, YARN-2637.1.patch, 
 YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, 
 YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, 
 YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, 
 YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, 
 YARN-2637.25.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch


 Currently, number of AM in leaf queue will be calculated in following way:
 {code}
 max_am_resource = queue_max_capacity * maximum_am_resource_percent
 #max_am_number = max_am_resource / minimum_allocation
 #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor
 {code}
 And when submit new application to RM, it will check if an app can be 
 activated in following way:
 {code}
 for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); 
  i.hasNext(); ) {
   FiCaSchedulerApp application = i.next();
   
   // Check queue limit
   if (getNumActiveApplications() = getMaximumActiveApplications()) {
 break;
   }
   
   // Check user limit
   User user = getUser(application.getUser());
   if (user.getActiveApplications()  
 getMaximumActiveApplicationsPerUser()) {
 user.activateApplication();
 activeApplications.add(application);
 i.remove();
 LOG.info(Application  + application.getApplicationId() +
  from user:  + application.getUser() + 
  activated in queue:  + getQueueName());
   }
 }
 {code}
 An example is,
 If a queue has capacity = 1G, max_am_resource_percent  = 0.2, the maximum 
 resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be 
 launched is 200, and if user uses 5M for each AM ( minimum_allocation). All 
 apps can still be activated, and it will occupy all resource of a queue 
 instead of only a max_am_resource_percent of a queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2978) ResourceManager crashes with NPE while getting queue info

2015-01-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263646#comment-14263646
 ] 

Hadoop QA commented on YARN-2978:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12689937/YARN-2978.002.patch
  against trunk revision 947578c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6237//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6237//console

This message is automatically generated.

 ResourceManager crashes with NPE while getting queue info
 -

 Key: YARN-2978
 URL: https://issues.apache.org/jira/browse/YARN-2978
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.1
Reporter: Jason Tufo
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-2978.001.patch, YARN-2978.002.patch


  java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto.isInitialized(YarnProtos.java:29625)
   at 
 org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto$Builder.build(YarnProtos.java:29939)
   at 
 org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.mergeLocalToProto(QueueInfoPBImpl.java:290)
   at 
 org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.getProto(QueueInfoPBImpl.java:157)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.convertToProtoFormat(GetQueueInfoResponsePBImpl.java:128)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToBuilder(GetQueueInfoResponsePBImpl.java:104)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToProto(GetQueueInfoResponsePBImpl.java:111)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.getProto(GetQueueInfoResponsePBImpl.java:53)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:235)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2978) ResourceManager crashes with NPE while getting queue info

2015-01-03 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2978:
---
Attachment: YARN-2978.002.patch

 ResourceManager crashes with NPE while getting queue info
 -

 Key: YARN-2978
 URL: https://issues.apache.org/jira/browse/YARN-2978
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.1
Reporter: Jason Tufo
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-2978.001.patch, YARN-2978.002.patch


  java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto.isInitialized(YarnProtos.java:29625)
   at 
 org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto$Builder.build(YarnProtos.java:29939)
   at 
 org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.mergeLocalToProto(QueueInfoPBImpl.java:290)
   at 
 org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.getProto(QueueInfoPBImpl.java:157)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.convertToProtoFormat(GetQueueInfoResponsePBImpl.java:128)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToBuilder(GetQueueInfoResponsePBImpl.java:104)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToProto(GetQueueInfoResponsePBImpl.java:111)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.getProto(GetQueueInfoResponsePBImpl.java:53)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:235)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2991) TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk

2015-01-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263554#comment-14263554
 ] 

Hudson commented on YARN-2991:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #62 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/62/])
YARN-2991. Fixed DrainDispatcher to reuse the draining code path in 
AsyncDispatcher. Contributed by Rohith Sharmaks. (zjshen: rev 
947578c1c1413f9043ceb1e87df6a97df048e854)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java


 TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on 
 trunk
 --

 Key: YARN-2991
 URL: https://issues.apache.org/jira/browse/YARN-2991
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Rohith
Priority: Blocker
 Fix For: 2.7.0

 Attachments: 0001-YARN-2991.patch, 0002-YARN-2991.patch


 {code}
 Error Message
 test timed out after 6 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 6 milliseconds
   at java.lang.Object.wait(Native Method)
   at java.lang.Thread.join(Thread.java:1281)
   at java.lang.Thread.join(Thread.java:1355)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
   at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1106)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testDecomissionedNMsMetricsOnRMRestart(TestRMRestart.java:1873)
 {code}
 It happened twice this months:
 https://builds.apache.org/job/PreCommit-YARN-Build/6096/
 https://builds.apache.org/job/PreCommit-YARN-Build/6182/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2991) TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk

2015-01-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263558#comment-14263558
 ] 

Hudson commented on YARN-2991:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2012 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2012/])
YARN-2991. Fixed DrainDispatcher to reuse the draining code path in 
AsyncDispatcher. Contributed by Rohith Sharmaks. (zjshen: rev 
947578c1c1413f9043ceb1e87df6a97df048e854)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java


 TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on 
 trunk
 --

 Key: YARN-2991
 URL: https://issues.apache.org/jira/browse/YARN-2991
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Rohith
Priority: Blocker
 Fix For: 2.7.0

 Attachments: 0001-YARN-2991.patch, 0002-YARN-2991.patch


 {code}
 Error Message
 test timed out after 6 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 6 milliseconds
   at java.lang.Object.wait(Native Method)
   at java.lang.Thread.join(Thread.java:1281)
   at java.lang.Thread.join(Thread.java:1355)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
   at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1106)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testDecomissionedNMsMetricsOnRMRestart(TestRMRestart.java:1873)
 {code}
 It happened twice this months:
 https://builds.apache.org/job/PreCommit-YARN-Build/6096/
 https://builds.apache.org/job/PreCommit-YARN-Build/6182/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-03 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263656#comment-14263656
 ] 

Jian He commented on YARN-2997:
---

I think we can simplify the logic in getContainerStatuses as such:
{code}
  if (containerStatus.getState() == ContainerState.COMPLETE) {
if (!isContainerRecentlyStopped(containerId)) {
  addCompletedContainer(containerId);
  containerStatuses.add(containerStatus);
}
  } else {
containerStatuses.add(containerStatus);
  }
{code}
bq. I didn't see an equals method defined in the abstract class
the sub class has the equal method.
bq.  So I guess we have to keep it
that's limitation of the test, we should fix the tests.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2919) Potential race between renew and cancel in DelegationTokenRenwer

2015-01-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263584#comment-14263584
 ] 

Naganarasimha G R commented on YARN-2919:
-

Hi [~kasha]
Can you please take a look @ my previous 
[comment|https://issues.apache.org/jira/browse/YARN-2919?focusedCommentId=14258208page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14258208],
 and if approach mentioned in it is not clear then would give a WIP patch, or 
please inform if any other approach you have in your mind.
 

 Potential race between renew and cancel in DelegationTokenRenwer 
 -

 Key: YARN-2919
 URL: https://issues.apache.org/jira/browse/YARN-2919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Naganarasimha G R
Priority: Critical
 Attachments: YARN-2919.20141209-1.patch


 YARN-2874 fixes a deadlock in DelegationTokenRenewer, but there is still a 
 race because of which a renewal in flight isn't interrupted by a cancel. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2980) Move health check script related functionality to hadoop-common

2015-01-03 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263592#comment-14263592
 ] 

Varun Saxena commented on YARN-2980:


[~aw], kindly review

 Move health check script related functionality to hadoop-common
 ---

 Key: YARN-2980
 URL: https://issues.apache.org/jira/browse/YARN-2980
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Varun Saxena
 Attachments: YARN-2980.001.patch, YARN-2980.002.patch


 HDFS might want to leverage health check functionality available in YARN in 
 both namenode https://issues.apache.org/jira/browse/HDFS-7400 and datanode 
 https://issues.apache.org/jira/browse/HDFS-7441.
 We can move health check functionality including the protocol between hadoop 
 daemons and health check script to hadoop-common. That will simplify the 
 development and maintenance for both hadoop source code and health check 
 script.
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2996) Refine some fs operations in FileSystemRMStateStore to improve performance

2015-01-03 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263743#comment-14263743
 ] 

Yi Liu commented on YARN-2996:
--

The 3 tests failures are not related.

 Refine some fs operations in FileSystemRMStateStore to improve performance
 --

 Key: YARN-2996
 URL: https://issues.apache.org/jira/browse/YARN-2996
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Yi Liu
Assignee: Yi Liu
 Attachments: YARN-2996.001.patch


 In {{FileSystemRMStateStore}}, we can refine some fs operations to improve 
 performance:
 *1.* There are several places invoke {{fs.exists}}, then 
 {{fs.getFileStatus}}, we can merge them to save one RPC call
 {code}
 if (fs.exists(versionNodePath)) {
 FileStatus status = fs.getFileStatus(versionNodePath);
 {code}
 *2.*
 {code}
 protected void updateFile(Path outputPath, byte[] data) throws Exception {
   Path newPath = new Path(outputPath.getParent(), outputPath.getName() + 
 .new);
   // use writeFile to make sure .new file is created atomically
   writeFile(newPath, data);
   replaceFile(newPath, outputPath);
 }
 {code}
 The {{updateFile}} is not good too, it write file to _output\_file_.tmp, then 
 rename to _output\_file_.new, then rename it to _output\_file_, we can reduce 
 one rename operation.
 Also there is one unnecessary import, we can remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-03 Thread Chengbing Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263747#comment-14263747
 ] 

Chengbing Liu commented on YARN-2997:
-

{quote}
I think we can simplify the logic in getContainerStatuses as such:
{quote}
It seems that if we do not remove the containers whose app is already stopped, 
we will rely on the heartbeat response from RM to remove containers acked by 
AM. If something goes wrong on the AM or RM side, the NM will never remove 
these containers from context. So in my opinion, that could be a potential leak.

{quote}
the sub class has the equal method.
{quote}
Yes, you are right. However, I'm still not sure if it is a good idea to use 
{{SetContainerStatus}} instead of {{MapContainerId, ContainerStatus}} for 
the following reasons:
* {{ContainerId}} is a unique identifier for a container, while 
{{ContainerStatus}} can be changed over time, even for the same container.
* We want to ensure no duplicate container status reported to RM. 
{{ContainerStatus}} has not only containerId, but also container state, exit 
status and diagnostic message, we may run into a situation where we report two 
different {{ContainerStatus}} with same ID and different states or other stuffs.
* {{ContainerId}} has {{equals}} method and annotated as public and stable, 
while {{ContainerStatus}} has no {{equals}} method and 
{{ContainerStatusPBImpl}} is annotated as private and unstable. It may not be a 
good idea to rely on the implementation of {{ContainerStatus}}.
* The use {{SetContainerStatus}} never appears in the current code base.

{quote}
that's limitation of the test, we should fix the tests.
{quote}
Yes, I see. I will fix them.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2997) NM keeps sending finished containers to RM until app is finished

2015-01-03 Thread Chengbing Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated YARN-2997:

Attachment: YARN-2997.3.patch

Updated patch:
* fix potential pendingContainersToRemove leak.
* remove unnecessary {{pendingCompletedContainers.clear();}} and add 
clearPendingCompletedContainers() for testing purpose only.
* Add comments for modified tests.
* Switch order of {{assertEquals}}. Expected value should come first to prevent 
confusions.

 NM keeps sending finished containers to RM until app is finished
 

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)