[jira] [Commented] (YARN-1419) TestFifoScheduler.testAppAttemptMetrics fails intermittently under jdk7
[ https://issues.apache.org/jira/browse/YARN-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825651#comment-13825651 ] Jason Lowe commented on YARN-1419: -- +1, lgtm. Committing this. TestFifoScheduler.testAppAttemptMetrics fails intermittently under jdk7 Key: YARN-1419 URL: https://issues.apache.org/jira/browse/YARN-1419 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 3.0.0, 2.3.0, 0.23.10 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Priority: Minor Labels: java7 Attachments: YARN-1419.patch, YARN-1419.patch QueueMetrics holds its data in a static variable causing metrics to bleed over from test to test. clearQueueMetrics is to be called for tests that need to measure metrics correctly for a single test. jdk7 comes into play since tests are run out of order, and in the case make the metrics unreliable. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-713) ResourceManager can exit unexpectedly if DNS is unavailable
[ https://issues.apache.org/jira/browse/YARN-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825677#comment-13825677 ] Hadoop QA commented on YARN-713: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614185/YARN-713.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2474//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2474//console This message is automatically generated. ResourceManager can exit unexpectedly if DNS is unavailable --- Key: YARN-713 URL: https://issues.apache.org/jira/browse/YARN-713 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Priority: Critical Fix For: 2.3.0 Attachments: YARN-713.09052013.1.patch, YARN-713.09062013.1.patch, YARN-713.1.patch, YARN-713.2.patch, YARN-713.20130910.1.patch, YARN-713.patch, YARN-713.patch, YARN-713.patch, YARN-713.patch As discussed in MAPREDUCE-5261, there's a possibility that a DNS outage could lead to an unhandled exception in the ResourceManager's AsyncDispatcher, and that ultimately would cause the RM to exit. The RM should not exit during DNS hiccups. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1421) Node managers will not receive application finish event where containers ran before RM restart
Omkar Vinit Joshi created YARN-1421: --- Summary: Node managers will not receive application finish event where containers ran before RM restart Key: YARN-1421 URL: https://issues.apache.org/jira/browse/YARN-1421 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Critical Problem :- Today for every application we track the node managers where container ran. So when application finishes it notifies all those node managers about application finish event (via node manager heartbeat). However if rm restarts then we forget this past information and those node managers will never get application finish event and will keep reporting finished applications. Propose Solution :- Instead of remembering the node managers where containers ran for this particular application it would be better if we depend on node manager heartbeat to take this decision. i.e. when node manager heartbeats saying it is running application (app1, app2) then we should those application's status in RM's memory {code}rmContext.getRMApps(){code} and if either they are not found (very old applications) or they are in their final state (FINISHED, KILLED, FAILED) then we should immediately notify the node manager about the application finish event. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-584) In fair scheduler web UI, queues unexpand on refresh
[ https://issues.apache.org/jira/browse/YARN-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harshit Daga updated YARN-584: -- Attachment: YARN-584-branch-2.2.0.patch Updated patch with - indentation and spacing after brackets using already present class as reference. - renaming of methods / inner class name In fair scheduler web UI, queues unexpand on refresh Key: YARN-584 URL: https://issues.apache.org/jira/browse/YARN-584 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Labels: newbie Attachments: YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch In the fair scheduler web UI, you can expand queue information. Refreshing the page causes the expansions to go away, which is annoying for someone who wants to monitor the scheduler page and needs to reopen all the queues they care about each time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1421) Node managers will not receive application finish event where containers ran before RM restart
[ https://issues.apache.org/jira/browse/YARN-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1421: Description: Problem :- Today for every application we track the node managers where containers ran. So when application finishes it notifies all those node managers about application finish event (via node manager heartbeat). However if rm restarts then we forget this past information and those node managers will never get application finish event and will keep reporting finished applications. Proposed Solution :- Instead of remembering the node managers where containers ran for this particular application it would be better if we depend on node manager heartbeat to take this decision. i.e. when node manager heartbeats saying it is running application (app1, app2) then we should check those application's status in RM's memory {code}rmContext.getRMApps(){code} and if either they are not found (very old applications) or they are in their final state (FINISHED, KILLED, FAILED) then we should immediately notify the node manager about the application finish event. By doing this we are reducing the state which we need to store at RM after restart. was: Problem :- Today for every application we track the node managers where container ran. So when application finishes it notifies all those node managers about application finish event (via node manager heartbeat). However if rm restarts then we forget this past information and those node managers will never get application finish event and will keep reporting finished applications. Propose Solution :- Instead of remembering the node managers where containers ran for this particular application it would be better if we depend on node manager heartbeat to take this decision. i.e. when node manager heartbeats saying it is running application (app1, app2) then we should those application's status in RM's memory {code}rmContext.getRMApps(){code} and if either they are not found (very old applications) or they are in their final state (FINISHED, KILLED, FAILED) then we should immediately notify the node manager about the application finish event. Node managers will not receive application finish event where containers ran before RM restart -- Key: YARN-1421 URL: https://issues.apache.org/jira/browse/YARN-1421 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Critical Problem :- Today for every application we track the node managers where containers ran. So when application finishes it notifies all those node managers about application finish event (via node manager heartbeat). However if rm restarts then we forget this past information and those node managers will never get application finish event and will keep reporting finished applications. Proposed Solution :- Instead of remembering the node managers where containers ran for this particular application it would be better if we depend on node manager heartbeat to take this decision. i.e. when node manager heartbeats saying it is running application (app1, app2) then we should check those application's status in RM's memory {code}rmContext.getRMApps(){code} and if either they are not found (very old applications) or they are in their final state (FINISHED, KILLED, FAILED) then we should immediately notify the node manager about the application finish event. By doing this we are reducing the state which we need to store at RM after restart. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1312) Job History server queue attribute incorrectly reports default when username is actually used for queue at runtime
[ https://issues.apache.org/jira/browse/YARN-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825746#comment-13825746 ] Harshit Daga commented on YARN-1312: Hi Philip will like to fix this issue. Have tried and replicated the same and getting queue name as default (as you mentioned). Can you provide me with some starting point (in code) to fix the issue. Job History server queue attribute incorrectly reports default when username is actually used for queue at runtime Key: YARN-1312 URL: https://issues.apache.org/jira/browse/YARN-1312 Project: Hadoop YARN Issue Type: Bug Reporter: Philip Zeyliger If you run a MapReduce job with the fair scheduler and you query the JobHistory server for its metadata, you might see something like the following at http://jh_host:19888/ws/v1/history/mapreduce/jobs/job_1381878638171_0001/ {code} job startTime1381890132608/startTime finishTime1381890141988/finishTime idjob_1381878638171_0001/id nameTeraGen/name queuedefault/queue userhdfs/user ... /job {code} The same is true if you query the RM while it's running via http://rm_host:8088/ws/v1/cluster/apps/application_1381878638171_0002: {code} app idapplication_1381878638171_0002/id userhdfs/user nameTeraGen/name queuedefault/queue ... /app {code} As it turns out, in both of these cases, the job is actually executing in root.hdfs and not in root.default because {{yarn.scheduler.fair.user-as-default-queue}} is set to true. This makes it hard to figure out after the fact (or during!) what queue the MR job was running under. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.8.patch Thanks [~vinodkv] for pointing it out..didn't understand earlier. Adding synchronized block to service state change. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-584) In fair scheduler web UI, queues unexpand on refresh
[ https://issues.apache.org/jira/browse/YARN-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825776#comment-13825776 ] Hadoop QA commented on YARN-584: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614477/YARN-584-branch-2.2.0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2475//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2475//console This message is automatically generated. In fair scheduler web UI, queues unexpand on refresh Key: YARN-584 URL: https://issues.apache.org/jira/browse/YARN-584 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Labels: newbie Attachments: YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch In the fair scheduler web UI, you can expand queue information. Refreshing the page causes the expansions to go away, which is annoying for someone who wants to monitor the scheduler page and needs to reopen all the queues they care about each time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825780#comment-13825780 ] Hadoop QA commented on YARN-674: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614481/YARN-674.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2476//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2476//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2476//console This message is automatically generated. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-584) In fair scheduler web UI, queues unexpand on refresh
[ https://issues.apache.org/jira/browse/YARN-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825804#comment-13825804 ] Sandy Ryza commented on YARN-584: - +1 In fair scheduler web UI, queues unexpand on refresh Key: YARN-584 URL: https://issues.apache.org/jira/browse/YARN-584 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Labels: newbie Attachments: YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch In the fair scheduler web UI, you can expand queue information. Refreshing the page causes the expansions to go away, which is annoying for someone who wants to monitor the scheduler page and needs to reopen all the queues they care about each time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-584) In fair scheduler web UI, queues unexpand on refresh
[ https://issues.apache.org/jira/browse/YARN-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-584: Assignee: Harshit Daga In fair scheduler web UI, queues unexpand on refresh Key: YARN-584 URL: https://issues.apache.org/jira/browse/YARN-584 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Harshit Daga Labels: newbie Attachments: YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch, YARN-584-branch-2.2.0.patch In the fair scheduler web UI, you can expand queue information. Refreshing the page causes the expansions to go away, which is annoying for someone who wants to monitor the scheduler page and needs to reopen all the queues they care about each time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1210: Attachment: YARN-1210.7.patch During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch, YARN-1210.6.patch, YARN-1210.7.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1422) RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing
Adam Kawa created YARN-1422: --- Summary: RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing Key: YARN-1422 URL: https://issues.apache.org/jira/browse/YARN-1422 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.2.0 Reporter: Adam Kawa If getQueueUserAclInfo() on a parent/root queue (e.g. via CapacityScheduler.getQueueUserAclInfo) is called, and a container is completing, then the ResourceManager can deadlock. It is similar to https://issues.apache.org/jira/browse/YARN-325. *More details:* * Thread A 1) In a synchronized block of code (a lockid 0xc18d8870=LeafQueue.class), LeafQueue.completedContainer wants to inform the parent queue that a container is being completed and invokes ParentQueue.completedContainer method. 3) The ParentQueue.completedContainer waits to aquire a lock on itself (a lockid 0xc1846350=ParentQueue.class) to go to synchronized block of code. It can not accuire this lock, because Thread B already has this lock. * Thread B 0) A moment earlier, CapacityScheduler.getQueueUserAclInfo is called. This method invokes a synchronized method on ParentQueue.class i.e. ParentQueue.getQueueUserAclInfo (a lockid 0xc1846350=ParentQueue.class) and aquires the lock that Thread A will be waiting for. 2) Unluckyly, ParentQueue.getQueueUserAclInfo iterates over children queue acls and it wants to run a synchonized method, LeafQueue.getQueueUserAclInfo, but it does not have a lock on LeafQueue.class (a lockid 0xc18d8870=LeafQueue.class). This lock is already held by LeafQueue.completedContainer in Thread A. The order that causes the deadlock: B0 - A1 - B2 - A3. *Java Stacktrace* {code} Found one Java-level deadlock: = 1956747953@qtp-109760451-1959: waiting to lock monitor 0x434e10c8 (object 0xc1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by IPC Server handler 39 on 8032 IPC Server handler 39 on 8032: waiting to lock monitor 0x422bbc58 (object 0xc18d8870, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue), which is held by ResourceManager Event Processor ResourceManager Event Processor: waiting to lock monitor 0x434e10c8 (object 0xc1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by IPC Server handler 39 on 8032 Java stack information for the threads listed above: === 1956747953@qtp-109760451-1959: at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getUsedCapacity(ParentQueue.java:276) - waiting to lock 0xc1846350 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.init(CapacitySchedulerInfo.java:49) at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:203) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:76) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at
[jira] [Updated] (YARN-1422) RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing
[ https://issues.apache.org/jira/browse/YARN-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Kawa updated YARN-1422: Priority: Critical (was: Major) RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing Key: YARN-1422 URL: https://issues.apache.org/jira/browse/YARN-1422 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.2.0 Reporter: Adam Kawa Priority: Critical If getQueueUserAclInfo() on a parent/root queue (e.g. via CapacityScheduler.getQueueUserAclInfo) is called, and a container is completing, then the ResourceManager can deadlock. It is similar to https://issues.apache.org/jira/browse/YARN-325. *More details:* * Thread A 1) In a synchronized block of code (a lockid 0xc18d8870=LeafQueue.class), LeafQueue.completedContainer wants to inform the parent queue that a container is being completed and invokes ParentQueue.completedContainer method. 3) The ParentQueue.completedContainer waits to aquire a lock on itself (a lockid 0xc1846350=ParentQueue.class) to go to synchronized block of code. It can not accuire this lock, because Thread B already has this lock. * Thread B 0) A moment earlier, CapacityScheduler.getQueueUserAclInfo is called. This method invokes a synchronized method on ParentQueue.class i.e. ParentQueue.getQueueUserAclInfo (a lockid 0xc1846350=ParentQueue.class) and aquires the lock that Thread A will be waiting for. 2) Unluckyly, ParentQueue.getQueueUserAclInfo iterates over children queue acls and it wants to run a synchonized method, LeafQueue.getQueueUserAclInfo, but it does not have a lock on LeafQueue.class (a lockid 0xc18d8870=LeafQueue.class). This lock is already held by LeafQueue.completedContainer in Thread A. The order that causes the deadlock: B0 - A1 - B2 - A3. *Java Stacktrace* {code} Found one Java-level deadlock: = 1956747953@qtp-109760451-1959: waiting to lock monitor 0x434e10c8 (object 0xc1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by IPC Server handler 39 on 8032 IPC Server handler 39 on 8032: waiting to lock monitor 0x422bbc58 (object 0xc18d8870, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue), which is held by ResourceManager Event Processor ResourceManager Event Processor: waiting to lock monitor 0x434e10c8 (object 0xc1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by IPC Server handler 39 on 8032 Java stack information for the threads listed above: === 1956747953@qtp-109760451-1959: at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getUsedCapacity(ParentQueue.java:276) - waiting to lock 0xc1846350 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.init(CapacitySchedulerInfo.java:49) at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:203) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:76) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825965#comment-13825965 ] Hadoop QA commented on YARN-1210: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614492/YARN-1210.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2477//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2477//console This message is automatically generated. During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch, YARN-1210.6.patch, YARN-1210.7.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.9.patch Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1266) inheriting Application client and History Protocol from base protocol and implement PB service and clients.
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825976#comment-13825976 ] Mayank Bansal commented on YARN-1266: - [~zjshen] thanks for review bq. IMHO, application_base_protocol.proto should not be necessary, because the base interface is to extract the common code, not to be directly used from the RPC interface. We need it as service impl needs it. bq. 2. ApplicationClientProtocolPB and ApplicationHistoryProtocolPB don't need to extend ApplicationBaseProtocolService.BlockingInterface Done. Thanks, Mayank inheriting Application client and History Protocol from base protocol and implement PB service and clients. --- Key: YARN-1266 URL: https://issues.apache.org/jira/browse/YARN-1266 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1266-1.patch, YARN-1266-2.patch, YARN-1266-3.patch, YARN-1266-4.patch Adding ApplicationHistoryProtocolPBService to make web apps to work and changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1266) inheriting Application client and History Protocol from base protocol and implement PB service and clients.
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1266: Attachment: YARN-1266-4.patch Attaching latest patch. Thanks, Mayank inheriting Application client and History Protocol from base protocol and implement PB service and clients. --- Key: YARN-1266 URL: https://issues.apache.org/jira/browse/YARN-1266 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1266-1.patch, YARN-1266-2.patch, YARN-1266-3.patch, YARN-1266-4.patch Adding ApplicationHistoryProtocolPBService to make web apps to work and changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1422) RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing
[ https://issues.apache.org/jira/browse/YARN-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825988#comment-13825988 ] Omkar Vinit Joshi commented on YARN-1422: - Yes this looks to be a problem. check this [synchronization locking problem | https://issues.apache.org/jira/browse/YARN-897?focusedCommentId=13706284page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13706284] The ordering always should be from root to leaf queue. I think there can be other places too where this ordering is mixed. RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing Key: YARN-1422 URL: https://issues.apache.org/jira/browse/YARN-1422 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.2.0 Reporter: Adam Kawa Priority: Critical If getQueueUserAclInfo() on a parent/root queue (e.g. via CapacityScheduler.getQueueUserAclInfo) is called, and a container is completing, then the ResourceManager can deadlock. It is similar to https://issues.apache.org/jira/browse/YARN-325. *More details:* * Thread A 1) In a synchronized block of code (a lockid 0xc18d8870=LeafQueue.class), LeafQueue.completedContainer wants to inform the parent queue that a container is being completed and invokes ParentQueue.completedContainer method. 3) The ParentQueue.completedContainer waits to aquire a lock on itself (a lockid 0xc1846350=ParentQueue.class) to go to synchronized block of code. It can not accuire this lock, because Thread B already has this lock. * Thread B 0) A moment earlier, CapacityScheduler.getQueueUserAclInfo is called. This method invokes a synchronized method on ParentQueue.class i.e. ParentQueue.getQueueUserAclInfo (a lockid 0xc1846350=ParentQueue.class) and aquires the lock that Thread A will be waiting for. 2) Unluckyly, ParentQueue.getQueueUserAclInfo iterates over children queue acls and it wants to run a synchonized method, LeafQueue.getQueueUserAclInfo, but it does not have a lock on LeafQueue.class (a lockid 0xc18d8870=LeafQueue.class). This lock is already held by LeafQueue.completedContainer in Thread A. The order that causes the deadlock: B0 - A1 - B2 - A3. *Java Stacktrace* {code} Found one Java-level deadlock: = 1956747953@qtp-109760451-1959: waiting to lock monitor 0x434e10c8 (object 0xc1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by IPC Server handler 39 on 8032 IPC Server handler 39 on 8032: waiting to lock monitor 0x422bbc58 (object 0xc18d8870, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue), which is held by ResourceManager Event Processor ResourceManager Event Processor: waiting to lock monitor 0x434e10c8 (object 0xc1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by IPC Server handler 39 on 8032 Java stack information for the threads listed above: === 1956747953@qtp-109760451-1959: at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getUsedCapacity(ParentQueue.java:276) - waiting to lock 0xc1846350 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.init(CapacitySchedulerInfo.java:49) at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:203) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:76) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at
[jira] [Commented] (YARN-744) Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated.
[ https://issues.apache.org/jira/browse/YARN-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825989#comment-13825989 ] Bikas Saha commented on YARN-744: - Better name? {code} +AllocateResponseLock res = responseMap.get(applicationAttemptId); {code} reuse throwApplicationAttemptDoesNotExistInCacheException() in registerApplicationMaster()? use InvalidApplicationMasterRequestException or a new specific exception instead of generic RPCUtil.throwRemoteException()? {code} + private void throwApplicationAttemptDoesNotExistInCacheException( + ApplicationAttemptId appAttemptId) throws YarnException { +String message = Application doesn't exist in cache ++ appAttemptId; +LOG.error(message); +throw RPCUtil.getRemoteException(message); + } {code} The new logic is not the same as the old one. If the app is no longer in the cache then it would send a resync response. Now it will send a regular response instead of a resync response. {code} - // before returning response, verify in sync - AllocateResponse oldResponse = - responseMap.put(appAttemptId, allocateResponse); - if (oldResponse == null) { -// appAttempt got unregistered, remove it back out -responseMap.remove(appAttemptId); -String message = App Attempt removed from the cache during allocate -+ appAttemptId; -LOG.error(message); -return resync; - } - + res.setAllocateResponse(allocateResponse); return allocateResponse; {code} Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. - Key: YARN-744 URL: https://issues.apache.org/jira/browse/YARN-744 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bikas Saha Assignee: Omkar Vinit Joshi Priority: Minor Attachments: MAPREDUCE-3899-branch-0.23.patch, YARN-744-20130711.1.patch, YARN-744-20130715.1.patch, YARN-744-20130726.1.patch, YARN-744.1.patch, YARN-744.patch Looks like the lock taken in this is broken. It takes a lock on lastResponse object and then puts a new lastResponse object into the map. At this point a new thread entering this function will get a new lastResponse object and will be able to take its lock and enter the critical section. Presumably we want to limit one response per app attempt. So the lock could be taken on the ApplicationAttemptId key of the response map object. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1266) inheriting Application client and History Protocol from base protocol and implement PB service and clients.
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826005#comment-13826005 ] Hadoop QA commented on YARN-1266: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614515/YARN-1266-4.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2478//console This message is automatically generated. inheriting Application client and History Protocol from base protocol and implement PB service and clients. --- Key: YARN-1266 URL: https://issues.apache.org/jira/browse/YARN-1266 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1266-1.patch, YARN-1266-2.patch, YARN-1266-3.patch, YARN-1266-4.patch Adding ApplicationHistoryProtocolPBService to make web apps to work and changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1403) Separate out configuration loading from QueueManager in the Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1403: - Attachment: YARN-1403-2.patch Separate out configuration loading from QueueManager in the Fair Scheduler -- Key: YARN-1403 URL: https://issues.apache.org/jira/browse/YARN-1403 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-1403-1.patch, YARN-1403-2.patch, YARN-1403.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-955) [YARN-321] Implementation of ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826014#comment-13826014 ] Mayank Bansal commented on YARN-955: Thanks [~zjshen] for review. bq. Please add the corresponding configs in yarn-default.xml as well. Done Thanks, Mayank [YARN-321] Implementation of ApplicationHistoryProtocol --- Key: YARN-955 URL: https://issues.apache.org/jira/browse/YARN-955 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-955-1.patch, YARN-955-2.patch, YARN-955-3.patch, YARN-955-4.patch, YARN-955-5.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-955) [YARN-321] Implementation of ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-955: --- Attachment: YARN-955-5.patch Attaching latest patch. Thanks, Mayank [YARN-321] Implementation of ApplicationHistoryProtocol --- Key: YARN-955 URL: https://issues.apache.org/jira/browse/YARN-955 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-955-1.patch, YARN-955-2.patch, YARN-955-3.patch, YARN-955-4.patch, YARN-955-5.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826037#comment-13826037 ] Hadoop QA commented on YARN-674: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614512/YARN-674.9.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2479//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2479//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2479//console This message is automatically generated. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-709) verify that new jobs submitted with old RM delegation tokens after RM restart are accepted
[ https://issues.apache.org/jira/browse/YARN-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826039#comment-13826039 ] Hudson commented on YARN-709: - SUCCESS: Integrated in Hadoop-trunk-Commit #4754 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4754/]) YARN-709. Added tests to verify validity of delegation tokens and logging of appsummary after RM restart. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1543269) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java verify that new jobs submitted with old RM delegation tokens after RM restart are accepted -- Key: YARN-709 URL: https://issues.apache.org/jira/browse/YARN-709 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Fix For: 2.3.0 Attachments: YARN-709.1.patch More elaborate test for restoring RM delegation tokens on RM restart. New jobs with old RM delegation tokens should be accepted by new RM as long as the token is still valid -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-754) Allow for black-listing resources in FS
[ https://issues.apache.org/jira/browse/YARN-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved YARN-754. - Resolution: Duplicate Closing as duplicate of YARN-1333 Allow for black-listing resources in FS --- Key: YARN-754 URL: https://issues.apache.org/jira/browse/YARN-754 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-384) add virtual cores info to the queue metrics
[ https://issues.apache.org/jira/browse/YARN-384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved YARN-384. - Resolution: Duplicate Closing as duplicate of YARN-598 add virtual cores info to the queue metrics --- Key: YARN-384 URL: https://issues.apache.org/jira/browse/YARN-384 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Thomas Graves Now that we have cores as a resource in the scheduler we should add metrics so we can use usage - allocated, requested, whatever else might apply. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826055#comment-13826055 ] Omkar Vinit Joshi commented on YARN-674: [~bikassaha] I completely missed your comment. What you are saying will not occur. {code} pool.allowCoreThreadTimeOut(true); {code} this should time out core threads if there are any lying around. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-955) [YARN-321] Implementation of ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826062#comment-13826062 ] Hadoop QA commented on YARN-955: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614525/YARN-955-5.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2480//console This message is automatically generated. [YARN-321] Implementation of ApplicationHistoryProtocol --- Key: YARN-955 URL: https://issues.apache.org/jira/browse/YARN-955 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-955-1.patch, YARN-955-2.patch, YARN-955-3.patch, YARN-955-4.patch, YARN-955-5.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826066#comment-13826066 ] Omkar Vinit Joshi commented on YARN-674: I think we should just ignore the find bug warning.. it is never going to occur...plus TestRMRestart is passing locally... there must be some race condition here not related to this patch. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1403) Separate out configuration loading from QueueManager in the Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826078#comment-13826078 ] Hadoop QA commented on YARN-1403: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614522/YARN-1403-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2481//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2481//console This message is automatically generated. Separate out configuration loading from QueueManager in the Fair Scheduler -- Key: YARN-1403 URL: https://issues.apache.org/jira/browse/YARN-1403 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-1403-1.patch, YARN-1403-2.patch, YARN-1403.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-674: - Attachment: YARN-674.10.patch +1 for the latest patch, save for the findbugs issue. Trying to fix it myself. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.10.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-955) [YARN-321] Implementation of ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826193#comment-13826193 ] Zhijie Shen commented on YARN-955: -- +1, LGMT [YARN-321] Implementation of ApplicationHistoryProtocol --- Key: YARN-955 URL: https://issues.apache.org/jira/browse/YARN-955 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-955-1.patch, YARN-955-2.patch, YARN-955-3.patch, YARN-955-4.patch, YARN-955-5.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826196#comment-13826196 ] Hadoop QA commented on YARN-674: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614545/YARN-674.10.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2482//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2482//console This message is automatically generated. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.10.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826208#comment-13826208 ] Vinod Kumar Vavilapalli commented on YARN-1210: --- +1, looks good. Checking this in. During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch, YARN-1210.6.patch, YARN-1210.7.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1266) inheriting Application client and History Protocol from base protocol and implement PB service and clients.
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826220#comment-13826220 ] Zhijie Shen commented on YARN-1266: --- 1. Let's mark getApplicationReport/getApplications stable, though they are moved to base protocol. How do you think? 2. In ApplicationBaseProtocol, please do not mention history sever. 3. Where's ApplicationBaseProtocolPBClientImpl? 4. Should you modify ApplicationClientProtocolPBClientImpl as well? 5. ApplicationHistoryProtocolPBClientImpl should extend ApplicationBaseProtocolPBClientImpl. inheriting Application client and History Protocol from base protocol and implement PB service and clients. --- Key: YARN-1266 URL: https://issues.apache.org/jira/browse/YARN-1266 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1266-1.patch, YARN-1266-2.patch, YARN-1266-3.patch, YARN-1266-4.patch Adding ApplicationHistoryProtocolPBService to make web apps to work and changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1318) Promote AdminService to an Always-On service and merge in RMHAProtocolService
[ https://issues.apache.org/jira/browse/YARN-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826230#comment-13826230 ] Vinod Kumar Vavilapalli commented on YARN-1318: --- Apologies, was busy all of last week and was off from work for the later part of it. Patch doesn't seem to apply anymore. Did a patch-file 'review' (not a fan of these), but quick comments: - Throw exception instead of logging when admin refresh* s on a standby? - RMContext was originally supposed to be read-only interface. We did have to add a few setters to resolve circular dependencies, but I think it was a mistake to add the setters to the interface. With this patch, it becomes worse. Can we atleast try to keep the interface read-only and add all the setters in the implementation only? Promote AdminService to an Always-On service and merge in RMHAProtocolService - Key: YARN-1318 URL: https://issues.apache.org/jira/browse/YARN-1318 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Labels: ha Attachments: yarn-1318-0.patch, yarn-1318-1.patch, yarn-1318-2.patch, yarn-1318-2.patch, yarn-1318-3.patch Per discussion in YARN-1068, we want AdminService to handle HA-admin operations in addition to the regular non-HA admin operations. To facilitate this, we need to move AdminService an Always-On service. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826255#comment-13826255 ] Hudson commented on YARN-1210: -- SUCCESS: Integrated in Hadoop-trunk-Commit #4757 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4757/]) YARN-1210. Changed RM to start new app-attempts on RM restart only after ensuring that previous AM exited or after expiry time. Contributed by Omkar Vinit Joshi. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1543310) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NodeHeartbeatRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NodeHeartbeatRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/records/NodeStatus.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/records/impl/pb/NodeStatusPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdater.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptState.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Fix For: 2.3.0 Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch, YARN-1210.6.patch, YARN-1210.7.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826256#comment-13826256 ] Hudson commented on YARN-674: - SUCCESS: Integrated in Hadoop-trunk-Commit #4757 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4757/]) YARN-674. Fixed ResourceManager to renew DelegationTokens on submission asynchronously to work around potential slowness in state-store. Contributed by Omkar Vinit Joshi. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1543312) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAppManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Fix For: 2.3.0 Attachments: YARN-674.1.patch, YARN-674.10.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)