[jira] [Commented] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class
[ https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226107#comment-14226107 ] Tsuyoshi OZAWA commented on YARN-2404: -- Thanks for your review, Jian! Remove ApplicationAttemptState and ApplicationState class in RMStateStore class Key: YARN-2404 URL: https://issues.apache.org/jira/browse/YARN-2404 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Tsuyoshi OZAWA Fix For: 2.7.0 Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, YARN-2404.4.patch, YARN-2404.5.patch, YARN-2404.6.patch, YARN-2404.7.patch, YARN-2404.8.patch We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we may just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2025) Possible NPE in schedulers#addApplicationAttempt()
[ https://issues.apache.org/jira/browse/YARN-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223993#comment-14223993 ] Tsuyoshi OZAWA commented on YARN-2025: -- Thanks for your point, [~rohithsharma]. I'll take a look. Possible NPE in schedulers#addApplicationAttempt() -- Key: YARN-2025 URL: https://issues.apache.org/jira/browse/YARN-2025 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2025.1.patch In FifoScheduler/FairScheduler/CapacityScheduler#addApplicationAttempt(), we don't check whether {{application}} is null. This can cause NPE in following sequences: addApplication() - doneApplication() (e.g. AppKilledTransition) - addApplicationAttempt(). {code} SchedulerApplication application = applications.get(applicationAttemptId.getApplicationId()); String user = application.getUser(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class
[ https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2404: - Attachment: YARN-2404.8.patch [~jianhe], good catch. Updated a patch. Remove ApplicationAttemptState and ApplicationState class in RMStateStore class Key: YARN-2404 URL: https://issues.apache.org/jira/browse/YARN-2404 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Tsuyoshi OZAWA Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, YARN-2404.4.patch, YARN-2404.5.patch, YARN-2404.6.patch, YARN-2404.7.patch, YARN-2404.8.patch We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we may just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class
[ https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2404: - Attachment: YARN-2404.7.patch Remove ApplicationAttemptState and ApplicationState class in RMStateStore class Key: YARN-2404 URL: https://issues.apache.org/jira/browse/YARN-2404 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Tsuyoshi OZAWA Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, YARN-2404.4.patch, YARN-2404.5.patch, YARN-2404.6.patch, YARN-2404.7.patch We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we may just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class
[ https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222333#comment-14222333 ] Tsuyoshi OZAWA commented on YARN-2404: -- Jian, thanks for your review! Updated a patch to address your comments: * Removing unused Credentials, ApplicationAttemptId, and always-true-assertion. * Changed get/setAppAttemptTokens to accept/return Credentials instead of ByteBuffer. Remove ApplicationAttemptState and ApplicationState class in RMStateStore class Key: YARN-2404 URL: https://issues.apache.org/jira/browse/YARN-2404 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Tsuyoshi OZAWA Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, YARN-2404.4.patch, YARN-2404.5.patch, YARN-2404.6.patch, YARN-2404.7.patch We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we may just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222338#comment-14222338 ] Tsuyoshi OZAWA commented on YARN-2517: -- [~zjshen], thanks for your comment. {code} // Async call - Type (1) void asyncCall(Input, CallBackHandler); {code} If we choose Type(1), TimelineClient need to manage all objects of callback handlers it's passed. It can get more complex. I think it's better to add {{registerAsyncCallbackHandler(CallBackHandler)}} to add TimelineClient separately or pass the callback via constructor for the simplicity. Or, do we have use cases to switch CallBackHandler for each method calls? {quote} Maybe compromise now is to add putEntitiesAsync to TimelineClient. In the future, let's see if we want to have a separate TimelineClientAsync that contains a bunch of async APIs. {quote} If we will have a plan to migrate from putEntitiesAsync to TimelineClientAsync, I think we should add TimelineClientAsync from the beginning. I think it's enough to add *Async methods to current TimelineClient for now since users don't need to create specific clients for async calls. [~mitdesai] thanks for your help. Looking forward to your opinion. We're discussing the design based on [Vinod's suggestion|https://issues.apache.org/jira/browse/YARN-2517?focusedCommentId=14128819page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14128819]. Implement TimelineClientAsync - Key: YARN-2517 URL: https://issues.apache.org/jira/browse/YARN-2517 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-2517.1.patch, YARN-2517.2.patch In some scenarios, we'd like to put timeline entities in another thread no to block the current one. It's good to have a TimelineClientAsync like AMRMClientAsync and NMClientAsync. It can buffer entities, put them in a separate thread, and have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class
[ https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2404: - Attachment: YARN-2404.5.patch Refreshed a patch. I found that TestRMRestart#testAppRecoveredInOrderOnRMRestart fails after the refactoring since recoverApplication loads data from RMStateStore#RMState#appState, which is created as a instance of HashMap. We should make it TreeMap to preserve the restoring order by key, so I fixed it in this patch. [~jianhe], could you take a look? Remove ApplicationAttemptState and ApplicationState class in RMStateStore class Key: YARN-2404 URL: https://issues.apache.org/jira/browse/YARN-2404 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Tsuyoshi OZAWA Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, YARN-2404.4.patch, YARN-2404.5.patch We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we may just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class
[ https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2404: - Attachment: YARN-2404.6.patch Fixed warnings by findbugs. Remove ApplicationAttemptState and ApplicationState class in RMStateStore class Key: YARN-2404 URL: https://issues.apache.org/jira/browse/YARN-2404 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Tsuyoshi OZAWA Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, YARN-2404.4.patch, YARN-2404.5.patch, YARN-2404.6.patch We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we may just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2517: - Attachment: YARN-2517.2.patch Sorry for the delay. Attached Future-based implementation for simplicity. I think this design is one of the best way we go to. [~vinodkv], [~zjshen], should we add read APIs on another JIRA? And, do you have any opinions about Future-based design? Implement TimelineClientAsync - Key: YARN-2517 URL: https://issues.apache.org/jira/browse/YARN-2517 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-2517.1.patch, YARN-2517.2.patch In some scenarios, we'd like to put timeline entities in another thread no to block the current one. It's good to have a TimelineClientAsync like AMRMClientAsync and NMClientAsync. It can buffer entities, put them in a separate thread, and have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219554#comment-14219554 ] Tsuyoshi OZAWA commented on YARN-2517: -- [~mitdesai] I think you're one of the users of TimelineClient. If you have any feedbacks about the interface, please let me know. Implement TimelineClientAsync - Key: YARN-2517 URL: https://issues.apache.org/jira/browse/YARN-2517 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-2517.1.patch, YARN-2517.2.patch In some scenarios, we'd like to put timeline entities in another thread no to block the current one. It's good to have a TimelineClientAsync like AMRMClientAsync and NMClientAsync. It can buffer entities, put them in a separate thread, and have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2865) Application recovery continuously fails with Application with id already present. Cannot duplicate
[ https://issues.apache.org/jira/browse/YARN-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217448#comment-14217448 ] Tsuyoshi OZAWA commented on YARN-2865: -- [~rohithsharma], thanks for taking this issue. I'd like to +1 for adding Private and Unstable annotation to the methods defined in RMActiveServiceContext as Karthik mentioned. Otherwise points looks good to me. Application recovery continuously fails with Application with id already present. Cannot duplicate Key: YARN-2865 URL: https://issues.apache.org/jira/browse/YARN-2865 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Critical Attachments: YARN-2865.patch, YARN-2865.patch YARN-2588 handles exception thrown while transitioningToActive and reset activeServices. But it misses out clearing RMcontext apps/nodes details and ClusterMetrics and QueueMetrics. This causes application recovery to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2800) Remove MemoryNodeLabelsStore and add a way to enable/disable node labels feature
[ https://issues.apache.org/jira/browse/YARN-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217569#comment-14217569 ] Tsuyoshi OZAWA commented on YARN-2800: -- Thanks for your comment, Vinod and thanks for the patch, Wangda. +1 for removing MemoryNodeLabelsStore. My comments: * MemoryRMNodeLabelsManager for tests do nothing in new patch. How about renaming MemoryRMNodeLabelsManager to NullRMNodeLabelsManager for the consistency with RMStateStore? * Maybe not related to this JIRA, but it's better to add testing RMRestart with NodeLabelManager to avoid regressions. Remove MemoryNodeLabelsStore and add a way to enable/disable node labels feature Key: YARN-2800 URL: https://issues.apache.org/jira/browse/YARN-2800 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2800-20141102-1.patch, YARN-2800-20141102-2.patch, YARN-2800-20141118-1.patch, YARN-2800-20141118-2.patch In the past, we have a MemoryNodeLabelStore, mostly for user to try this feature without configuring where to store node labels on file system. It seems convenient for user to try this, but actually it causes some bad use experience. User may add/remove labels, and edit capacity-scheduler.xml. After RM restart, labels will gone, (we store it in mem). And RM cannot get started if we have some queue uses labels, and the labels don't exist in cluster. As what we discussed, we should have an explicitly way to let user specify if he/she wants this feature or not. If node label is disabled, any operations trying to modify/use node labels will throw exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2607) TestDistributedShell fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2607: - Attachment: test.log [~leftnoteasy] yeah, it failed on trunk. Attaching a log. TestDistributedShell fails in trunk --- Key: YARN-2607 URL: https://issues.apache.org/jira/browse/YARN-2607 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Wangda Tan Fix For: 2.6.0 Attachments: YARN-2607-1.patch, YARN-2607-2.patch, YARN-2607-3.patch, test.log From https://builds.apache.org/job/Hadoop-Yarn-trunk/691/console : {code} testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 35.641 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308) {code} On Linux, I got the following locally: {code} testDSAttemptFailuresValidityIntervalFailed(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 64.715 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalFailed(TestDistributedShell.java:384) testDSAttemptFailuresValidityIntervalSucess(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 115.842 sec ERROR! java.lang.Exception: test timed out after 9 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:680) at org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:661) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalSucess(TestDistributedShell.java:342) testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 35.633 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2830) Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode
[ https://issues.apache.org/jira/browse/YARN-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204006#comment-14204006 ] Tsuyoshi OZAWA commented on YARN-2830: -- [~acmurthy] Let me review the latest patch. Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode -- Key: YARN-2830 URL: https://issues.apache.org/jira/browse/YARN-2830 Project: Hadoop YARN Issue Type: Bug Reporter: Jonathan Eagles Assignee: Jonathan Eagles Priority: Blocker Attachments: YARN-2830-v1.patch, YARN-2830-v2.patch, YARN-2830-v3.patch, YARN-2830-v4.patch YARN-2229 modified the private unstable api for constructing. Tez uses this api (shouldn't, but does) for use with Tez Local Mode. This causes a NoSuchMethod error when using Tez compiled against pre-2.6. Instead I propose we add the backwards compatible api since overflow is not a problem in tez local mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2607) TestDistributedShell fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204073#comment-14204073 ] Tsuyoshi OZAWA commented on YARN-2607: -- TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression still fails on my local because of timeout. TestDistributedShell fails in trunk --- Key: YARN-2607 URL: https://issues.apache.org/jira/browse/YARN-2607 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Wangda Tan Fix For: 2.6.0 Attachments: YARN-2607-1.patch, YARN-2607-2.patch, YARN-2607-3.patch From https://builds.apache.org/job/Hadoop-Yarn-trunk/691/console : {code} testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 35.641 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308) {code} On Linux, I got the following locally: {code} testDSAttemptFailuresValidityIntervalFailed(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 64.715 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalFailed(TestDistributedShell.java:384) testDSAttemptFailuresValidityIntervalSucess(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 115.842 sec ERROR! java.lang.Exception: test timed out after 9 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:680) at org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:661) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalSucess(TestDistributedShell.java:342) testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 35.633 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2830) Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode
[ https://issues.apache.org/jira/browse/YARN-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204074#comment-14204074 ] Tsuyoshi OZAWA commented on YARN-2830: -- +1, the failure of TestApplicationClientProtocolOnHA is not related to the patch and succeeded on my local. Vinod, thanks for pointing the link. Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode -- Key: YARN-2830 URL: https://issues.apache.org/jira/browse/YARN-2830 Project: Hadoop YARN Issue Type: Bug Reporter: Jonathan Eagles Assignee: Jonathan Eagles Priority: Blocker Attachments: YARN-2830-v1.patch, YARN-2830-v2.patch, YARN-2830-v3.patch, YARN-2830-v4.patch YARN-2229 modified the private unstable api for constructing. Tez uses this api (shouldn't, but does) for use with Tez Local Mode. This causes a NoSuchMethod error when using Tez compiled against pre-2.6. Instead I propose we add the backwards compatible api since overflow is not a problem in tez local mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2830) Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode
[ https://issues.apache.org/jira/browse/YARN-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202580#comment-14202580 ] Tsuyoshi OZAWA commented on YARN-2830: -- Hi [~jeagles], thanks for your reporting and contribution. I checked the source code of Tez Local Mode: {code} ContainerId cId = ContainerId.newInstance(applicationAttemptId, 1); {code} I think it's more straight forward to fix as following at Tez than adding int-type newInstance method: {code} ContainerId cId = ContainerId.newInstance(applicationAttemptId, 1L); {code} What do you think? cc: [~bikassaha] Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode -- Key: YARN-2830 URL: https://issues.apache.org/jira/browse/YARN-2830 Project: Hadoop YARN Issue Type: Bug Reporter: Jonathan Eagles Assignee: Jonathan Eagles Priority: Blocker Attachments: YARN-2830-v1.patch YARN-2229 modified the private unstable api for constructing. Tez uses this api (shouldn't, but does) for use with Tez Local Mode. This causes a NoSuchMethod error when using Tez compiled against pre-2.6. Instead I propose we add the backwards compatible api since overflow is not a problem in tez local mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2830) Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode
[ https://issues.apache.org/jira/browse/YARN-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202599#comment-14202599 ] Tsuyoshi OZAWA commented on YARN-2830: -- [~jeagles] Make sense. Do you mind marking new method as Deprecated? It's useful for users not to use this method wrongly. Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode -- Key: YARN-2830 URL: https://issues.apache.org/jira/browse/YARN-2830 Project: Hadoop YARN Issue Type: Bug Reporter: Jonathan Eagles Assignee: Jonathan Eagles Priority: Blocker Attachments: YARN-2830-v1.patch YARN-2229 modified the private unstable api for constructing. Tez uses this api (shouldn't, but does) for use with Tez Local Mode. This causes a NoSuchMethod error when using Tez compiled against pre-2.6. Instead I propose we add the backwards compatible api since overflow is not a problem in tez local mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2830) Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode
[ https://issues.apache.org/jira/browse/YARN-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203076#comment-14203076 ] Tsuyoshi OZAWA commented on YARN-2830: -- +1 for Sid's approach. [~jeagles], I'll review a patch after fixing the build failure. Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode -- Key: YARN-2830 URL: https://issues.apache.org/jira/browse/YARN-2830 Project: Hadoop YARN Issue Type: Bug Reporter: Jonathan Eagles Assignee: Jonathan Eagles Priority: Blocker Attachments: YARN-2830-v1.patch, YARN-2830-v2.patch, YARN-2830-v3.patch YARN-2229 modified the private unstable api for constructing. Tez uses this api (shouldn't, but does) for use with Tez Local Mode. This causes a NoSuchMethod error when using Tez compiled against pre-2.6. Instead I propose we add the backwards compatible api since overflow is not a problem in tez local mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2801) Documentation development for Node labels requirment
[ https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194705#comment-14194705 ] Tsuyoshi OZAWA commented on YARN-2801: -- [~gururaj] Great. Feel free to ask us if you have any problem or questions. Documentation development for Node labels requirment Key: YARN-2801 URL: https://issues.apache.org/jira/browse/YARN-2801 Project: Hadoop YARN Issue Type: New Feature Components: documentation Reporter: Gururaj Shetty Documentation needs to be developed for the node label requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2800) Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled
[ https://issues.apache.org/jira/browse/YARN-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194719#comment-14194719 ] Tsuyoshi OZAWA commented on YARN-2800: -- [~leftnoteasy], thanks for taking this JIRA. IMHO, it's too verbose for users to dump the warning when users try the commands since users can use MemoryRMNodeLabelsManager experimentally. I prefer to log startup time once and show what kind of RMNodeLabelsManager on Web UI like RMStateStore. The patch on YARN-1326 can help you to add the type to Web UI. Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled -- Key: YARN-2800 URL: https://issues.apache.org/jira/browse/YARN-2800 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2800-20141102-1.patch, YARN-2800-20141102-2.patch Even though we have documented this, but it will be better to explicitly print a message in both RM/RMAdminCLI side to explicitly say that the node label being added will be lost across RM restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2800) Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled
[ https://issues.apache.org/jira/browse/YARN-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195130#comment-14195130 ] Tsuyoshi OZAWA commented on YARN-2800: -- [~leftnoteasy], thanks for your comments. {quote} + + this message is based on the yarn-site.xml settings + + in the machine you run \yarn rmadmin ...\, if you + + already edited the field in yarn-site.xml of the node + + running RM, please ignore this message.; {quote} I think printing the message based on client-side configuration can confuse the user - it can be different from RM-side. Every users doesn't have a copy of RM-side configuration and some users doesn't know the contents of RM-side configuration. {quote} But if user configured mem-based node labels manager, user may add labels to queue configurations, when RM will be failed to launch (specifically, CS cannot initialize) if a queue use a label but not existed in node labels manager {quote} Let me clarify this case - do you mean RM will fail to allocate containers on labeled nodes after RM restart since RM uses MemoryRMNodeLabelsManager and forget the mapping of node-to-labels? In this case, I think we should arise the warnings to submitter of yarn apps like application cannot be submitted for now since no node has the required label after restart. It's more straight forward because users can notice the mistake of configurations of labels. So I think it's better way to log the warning at startup once and add the information to Web UI for the consistency of the information. What do you think? Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled -- Key: YARN-2800 URL: https://issues.apache.org/jira/browse/YARN-2800 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2800-20141102-1.patch, YARN-2800-20141102-2.patch Even though we have documented this, but it will be better to explicitly print a message in both RM/RMAdminCLI side to explicitly say that the node label being added will be lost across RM restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2800) Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled
[ https://issues.apache.org/jira/browse/YARN-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195220#comment-14195220 ] Tsuyoshi OZAWA commented on YARN-2800: -- [~leftnoteasy], Thanks for your clarification. Essentially, the configurations about labels are a part of RM's state. IMHO, we should move the essential configuration onto RMStateStore to prevent the mismatch ideally. I think ZK can handle it since frequency of updating labels is not so high and number of labels are not so large. cc: [~jianhe], [~kkambatl], what do you think? Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled -- Key: YARN-2800 URL: https://issues.apache.org/jira/browse/YARN-2800 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2800-20141102-1.patch, YARN-2800-20141102-2.patch Even though we have documented this, but it will be better to explicitly print a message in both RM/RMAdminCLI side to explicitly say that the node label being added will be lost across RM restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2794) Fix log msgs about distributing system-credentials
[ https://issues.apache.org/jira/browse/YARN-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195258#comment-14195258 ] Tsuyoshi OZAWA commented on YARN-2794: -- [~jianhe], thanks for taking this JIRA. Shouldn't we use ConcurrentHashMap? IIUC, making the variable volatile for this case is not enough to synchronize. Please correct me if I'm wrong. Fix log msgs about distributing system-credentials --- Key: YARN-2794 URL: https://issues.apache.org/jira/browse/YARN-2794 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2794.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2794) Fix log msgs about distributing system-credentials
[ https://issues.apache.org/jira/browse/YARN-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195273#comment-14195273 ] Tsuyoshi OZAWA commented on YARN-2794: -- [~jianhe], oops, I got it. The updating about systemCredentials is done only via setSystemCredentials. Then, your solution is enough. One minor nits: TestLogAggregationService#testAddNewTokenSentFromRMForLogAggregation calls {{this.context.getSystemCredentialsForApps().put(application1, credentials);}}. We should use ConcurrentHashMap for the test case. Fix log msgs about distributing system-credentials --- Key: YARN-2794 URL: https://issues.apache.org/jira/browse/YARN-2794 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2794.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2800) Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled
[ https://issues.apache.org/jira/browse/YARN-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195407#comment-14195407 ] Tsuyoshi OZAWA commented on YARN-2800: -- [~leftnoteasy], If we assume the labels as a configuration which can be highly updated, ZK is not good option as you mentioned. In this case, I think NodeLabelsManager, whose backend can be leveldb or rockdb, should be loosely coupling with RM like TimelineServer for stabilization of RM. One option is making NodeLabelsManager NodeLabelsServer. It means RM should work correctly even if NodeLabelsManager is temporary unavailable. And update operation should only affect NodeLabelsManager(it doesn't affect RM). For example, RM pulls the label information from NodeLabelsServer periodically. RM treats the lable information as a hint and does schedule based on label information. Even without the information, RM should schedule apps. I think this weak consistency approach is suitable for large-scale updating. Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled -- Key: YARN-2800 URL: https://issues.apache.org/jira/browse/YARN-2800 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2800-20141102-1.patch, YARN-2800-20141102-2.patch Even though we have documented this, but it will be better to explicitly print a message in both RM/RMAdminCLI side to explicitly say that the node label being added will be lost across RM restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
[ https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189720#comment-14189720 ] Tsuyoshi OZAWA commented on YARN-2712: -- [~adhoot] [~kkambatl] [~jianhe] do you have additional comments? Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart --- Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2712.1.patch, YARN-2712.2.patch TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2712) TestWorkPreservingRMRestart: Augment FS tests with queue and headroom checks
[ https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189835#comment-14189835 ] Tsuyoshi OZAWA commented on YARN-2712: -- Thanks Anubhav and Karhitk for the reviews. TestWorkPreservingRMRestart: Augment FS tests with queue and headroom checks Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.7.0 Attachments: YARN-2712.1.patch, YARN-2712.2.patch TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2765) Add leveldb-based implementation for RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189934#comment-14189934 ] Tsuyoshi OZAWA commented on YARN-2765: -- Currently we can assume that LeveldbRMStateStore is access from single process, thus failure detection itself is done by EmbeddedElector which depends on ZooKeeper in addition to no support of fencing on LevelDBRMStateStore. It means we need to launch ZooKeeper and it's normal decision to use ZKRMStateStore in this case. Please correct me if I'm wrong. On another front, if we use RockDB as a backend db of timeline server, we don't need to use leveldb and it's good decision to switch the dependency. Add leveldb-based implementation for RMStateStore - Key: YARN-2765 URL: https://issues.apache.org/jira/browse/YARN-2765 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-2765.patch, YARN-2765v2.patch It would be nice to have a leveldb option to the resourcemanager recovery store. Leveldb would provide some benefits over the existing filesystem store such as better support for atomic operations, fewer I/O ops per state update, and far fewer total files on the filesystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1813) Better error message for yarn logs when permission denied
[ https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1813: - Attachment: YARN-1813.5.patch Refreshed a patch. Better error message for yarn logs when permission denied --- Key: YARN-1813 URL: https://issues.apache.org/jira/browse/YARN-1813 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Andrew Wang Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1813.1.patch, YARN-1813.2.patch, YARN-1813.2.patch, YARN-1813.3.patch, YARN-1813.4.patch, YARN-1813.5.patch I ran some MR jobs as the hdfs user, and then forgot to sudo -u when grabbing the logs. yarn logs prints an error message like the following: {noformat} [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at a2402.halxg.cloudera.com/10.20.212.10:8032 Logs not available at /tmp/logs/andrew.wang/logs/application_1394482121761_0010 Log aggregation has not completed or is not enabled. {noformat} It'd be nicer if it said Permission denied or AccessControlException or something like that instead, since that's the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1813) Better error message for yarn logs when permission denied
[ https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1813: - Affects Version/s: 2.4.1 2.5.1 Better error message for yarn logs when permission denied --- Key: YARN-1813 URL: https://issues.apache.org/jira/browse/YARN-1813 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0, 2.4.1, 2.5.1 Reporter: Andrew Wang Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1813.1.patch, YARN-1813.2.patch, YARN-1813.2.patch, YARN-1813.3.patch, YARN-1813.4.patch, YARN-1813.5.patch I ran some MR jobs as the hdfs user, and then forgot to sudo -u when grabbing the logs. yarn logs prints an error message like the following: {noformat} [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at a2402.halxg.cloudera.com/10.20.212.10:8032 Logs not available at /tmp/logs/andrew.wang/logs/application_1394482121761_0010 Log aggregation has not completed or is not enabled. {noformat} It'd be nicer if it said Permission denied or AccessControlException or something like that instead, since that's the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2737) Misleading msg in LogCLI when app is not successfully submitted
[ https://issues.apache.org/jira/browse/YARN-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186548#comment-14186548 ] Tsuyoshi OZAWA commented on YARN-2737: -- YARN-1813 is addressing a issue for handling AccessControlException correctly. Misleading msg in LogCLI when app is not successfully submitted Key: YARN-2737 URL: https://issues.apache.org/jira/browse/YARN-2737 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Jian He Assignee: Rohith {{LogCLiHelpers#logDirNotExist}} prints msg {{Log aggregation has not completed or is not enabled.}} if the app log file doesn't exist. This is misleading because if the application is not submitted successfully. Clearly, we won't have logs for this application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1813) Better error message for yarn logs when permission denied
[ https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1813: - Attachment: YARN-1813.6.patch Thanks for your review, Rohith! Updated: 1. Changed a log message to Permission denied. : /path/to/dir 2. Removed needless change in AggregatedLogsBlock. 3. Updated log message in {{logDirNotExist}}. Better error message for yarn logs when permission denied --- Key: YARN-1813 URL: https://issues.apache.org/jira/browse/YARN-1813 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0, 2.4.1, 2.5.1 Reporter: Andrew Wang Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1813.1.patch, YARN-1813.2.patch, YARN-1813.2.patch, YARN-1813.3.patch, YARN-1813.4.patch, YARN-1813.5.patch, YARN-1813.6.patch I ran some MR jobs as the hdfs user, and then forgot to sudo -u when grabbing the logs. yarn logs prints an error message like the following: {noformat} [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at a2402.halxg.cloudera.com/10.20.212.10:8032 Logs not available at /tmp/logs/andrew.wang/logs/application_1394482121761_0010 Log aggregation has not completed or is not enabled. {noformat} It'd be nicer if it said Permission denied or AccessControlException or something like that instead, since that's the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2737) Misleading msg in LogCLI when app is not successfully submitted
[ https://issues.apache.org/jira/browse/YARN-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187253#comment-14187253 ] Tsuyoshi OZAWA commented on YARN-2737: -- [~jianhe], [~rohithsharma] reviewed a patch on YARN-1813. It includes the comment about this issue. Could you take a look? Misleading msg in LogCLI when app is not successfully submitted Key: YARN-2737 URL: https://issues.apache.org/jira/browse/YARN-2737 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Jian He Assignee: Rohith {{LogCLiHelpers#logDirNotExist}} prints msg {{Log aggregation has not completed or is not enabled.}} if the app log file doesn't exist. This is misleading because if the application is not submitted successfully. Clearly, we won't have logs for this application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2742) FairSchedulerConfiguration fails to parse if there is extra space between value and unit
[ https://issues.apache.org/jira/browse/YARN-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187264#comment-14187264 ] Tsuyoshi OZAWA commented on YARN-2742: -- [~ywskycn], thanks for contribution. How about adding a test case to include trailing space like 1024 mb, 4 core ? FairSchedulerConfiguration fails to parse if there is extra space between value and unit Key: YARN-2742 URL: https://issues.apache.org/jira/browse/YARN-2742 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.0 Reporter: Sangjin Lee Assignee: Wei Yan Priority: Minor Attachments: YARN-2742-1.patch FairSchedulerConfiguration is very strict about the number of space characters between the value and the unit: 0 or 1 space. For example, for values like the following: {noformat} maxResources4096 mb, 2 vcoresmaxResources {noformat} (note 2 spaces) This above line fails to parse: {noformat} 2014-10-24 22:56:40,802 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Failed to reload fair scheduler config file - will use existing allocations. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: Missing resource: mb at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSchedulerConfiguration.findResource(FairSchedulerConfiguration.java:247) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSchedulerConfiguration.parseResourceConfigValue(FairSchedulerConfiguration.java:231) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:347) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:381) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:293) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService$1.run(AllocationFileLoaderService.java:117) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2742) FairSchedulerConfiguration fails to parse if there is extra space between value and unit
[ https://issues.apache.org/jira/browse/YARN-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187428#comment-14187428 ] Tsuyoshi OZAWA commented on YARN-2742: -- LGTM. FairSchedulerConfiguration fails to parse if there is extra space between value and unit Key: YARN-2742 URL: https://issues.apache.org/jira/browse/YARN-2742 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.0 Reporter: Sangjin Lee Assignee: Wei Yan Priority: Minor Attachments: YARN-2742-1.patch, YARN-2742-2.patch FairSchedulerConfiguration is very strict about the number of space characters between the value and the unit: 0 or 1 space. For example, for values like the following: {noformat} maxResources4096 mb, 2 vcoresmaxResources {noformat} (note 2 spaces) This above line fails to parse: {noformat} 2014-10-24 22:56:40,802 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Failed to reload fair scheduler config file - will use existing allocations. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: Missing resource: mb at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSchedulerConfiguration.findResource(FairSchedulerConfiguration.java:247) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSchedulerConfiguration.parseResourceConfigValue(FairSchedulerConfiguration.java:231) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:347) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:381) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:293) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService$1.run(AllocationFileLoaderService.java:117) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2765) Add leveldb-based implementation for RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187955#comment-14187955 ] Tsuyoshi OZAWA commented on YARN-2765: -- [~jlowe], great work! It looks good to me overall including error handling and resource management. Minor nits: How about adding helper methods like getKeyPrefix/getNodePath for getting key prefix and node path? ZKRMStateStore also does so. {code} String keyPrefix = RM_APP_ROOT + / + appId + /; ... String appKey = RM_APP_ROOT + / + appId {code} I found that the patch includes lots hard-coded /. I think it's better to have private field SEPARATOR = /. Add leveldb-based implementation for RMStateStore - Key: YARN-2765 URL: https://issues.apache.org/jira/browse/YARN-2765 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-2765.patch It would be nice to have a leveldb option to the resourcemanager recovery store. Leveldb would provide some benefits over the existing filesystem store such as better support for atomic operations, fewer I/O ops per state update, and far fewer total files on the filesystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186402#comment-14186402 ] Tsuyoshi OZAWA commented on YARN-2725: -- [~jianhe], do you mind taking a look? Adding test cases of retrying requests about ZKRMStateStore --- Key: YARN-2725 URL: https://issues.apache.org/jira/browse/YARN-2725 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2725.1.patch YARN-2721 found a race condition for ZK-specific retry semantics. We should add tests about the case of retry requests to ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
[ https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184603#comment-14184603 ] Tsuyoshi OZAWA commented on YARN-2712: -- [~kkambatl], could you take a look? Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart --- Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2712.1.patch, YARN-2712.2.patch TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
[ https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184679#comment-14184679 ] Tsuyoshi OZAWA commented on YARN-2712: -- [~adhoot] [~kkambatl], oops, I misread the previous review comment from Karthik. Thanks for your review, Anubhav. Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart --- Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2712.1.patch, YARN-2712.2.patch TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
[ https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA reassigned YARN-2712: Assignee: Tsuyoshi OZAWA Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart --- Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
[ https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2712: - Attachment: YARN-2712.1.patch Attaching a first patch including following changes: 1. Adding tests about FSQueue({{checkFSQueue}}). 2. Moving headroom tests into check*Queue. 3. Renamed asserteMetrics to assertMetrics. 4. Calling {{updateRootQueueMetrics}} explicitly in {{FairScheduler#update}} because I found a unexpected behavior about rootQueue while writing code - {{updateRootQueueMetrics}} isn't called until RMNode is registered, updated, removed, and added. Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart --- Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2712.1.patch TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2725: - Attachment: YARN-2725.1.patch Attaching a first patch. Adding test cases of retrying requests about ZKRMStateStore --- Key: YARN-2725 URL: https://issues.apache.org/jira/browse/YARN-2725 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA Attachments: YARN-2725.1.patch YARN-2721 found a race condition for ZK-specific retry semantics. We should add tests about the case of retry requests to ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
[ https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182760#comment-14182760 ] Tsuyoshi OZAWA commented on YARN-2712: -- The test failure looks not related to the patch. [~kkambatl], [~jianhe], do you mind taking a look, please? Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart --- Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2712.1.patch TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
[ https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2712: - Attachment: YARN-2712.2.patch Thanks for your review, Karthik. All lines you pointed can be removed. Updated. Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart --- Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2712.1.patch, YARN-2712.2.patch TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
[ https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183845#comment-14183845 ] Tsuyoshi OZAWA commented on YARN-2712: -- The javadoc warning and test failures of TestMetricsSystemImpl and TestWebDelegationToken looks intermittent. Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart --- Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2712.1.patch, YARN-2712.2.patch TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-810: Issue Type: Improvement (was: Bug) Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us {noformat} Burns the cores again: {noformat} Cpu(s): 11.1%us, 1.7%sy, 0.0%ni, 83.9%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st CPU 4370 criccomi 20 0 1157m 563m 14m S 253.9 0.8 89:32.31 ... {noformat} On my dev box, I was testing CGroups by running a python process eight times, to burn through all the cores, since it was doing as described above (giving extra CPU to the process, even with a cpu.shares limit). Toggling the cfs_quota_us seems to enforce a hard limit. Implementation: What do you guys think about introducing a variable to YarnConfiguration: bq.
[jira] [Updated] (YARN-2398) TestResourceTrackerOnHA crashes
[ https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2398: - Attachment: (was: TestResourceTrackerOnHA-output.txt) TestResourceTrackerOnHA crashes --- Key: YARN-2398 URL: https://issues.apache.org/jira/browse/YARN-2398 Project: Hadoop YARN Issue Type: Bug Reporter: Jason Lowe TestResourceTrackerOnHA is currently crashing and failing trunk builds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2398) TestResourceTrackerOnHA crashes
[ https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179797#comment-14179797 ] Tsuyoshi OZAWA commented on YARN-2398: -- Rohith, Wangda, yeah, thanks for your pointing. the log I attached looks not related to the issue Jason mentioned. Removing it. TestResourceTrackerOnHA crashes --- Key: YARN-2398 URL: https://issues.apache.org/jira/browse/YARN-2398 Project: Hadoop YARN Issue Type: Bug Reporter: Jason Lowe TestResourceTrackerOnHA is currently crashing and failing trunk builds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2725) Adding retry requests about ZKRMStateStore
Tsuyoshi OZAWA created YARN-2725: Summary: Adding retry requests about ZKRMStateStore Key: YARN-2725 URL: https://issues.apache.org/jira/browse/YARN-2725 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA YARN-2721 found a race condition for ZK-specific retry semantics. We should add tests about the case of retry requests to ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception
[ https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179842#comment-14179842 ] Tsuyoshi OZAWA commented on YARN-2721: -- Good job, Jian. Created YARN-2725 for adding tests to cover these cases. Race condition: ZKRMStateStore retry logic may throw NodeExist exception - Key: YARN-2721 URL: https://issues.apache.org/jira/browse/YARN-2721 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2721.1.patch Blindly retrying operations in zookeeper will not work for non-idempotent operations (like create znode). The reason is that the client can do a create znode, but the response may not be returned because the server can die or timeout. In case of retrying the create znode, it will throw a NODE_EXISTS exception from the earlier create from the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2725: - Summary: Adding test cases of retrying requests about ZKRMStateStore (was: Adding retry requests about ZKRMStateStore) Adding test cases of retrying requests about ZKRMStateStore --- Key: YARN-2725 URL: https://issues.apache.org/jira/browse/YARN-2725 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA YARN-2721 found a race condition for ZK-specific retry semantics. We should add tests about the case of retry requests to ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2710) RM HA tests failed intermittently on trunk
[ https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2710: - Attachment: TestResourceTrackerOnHA-output.2.txt I could reproduced same issue about TestResourceTrackerOnHA - it's intermittent failure, and it happens rarely. Attaching log on my local. RM HA tests failed intermittently on trunk -- Key: YARN-2710 URL: https://issues.apache.org/jira/browse/YARN-2710 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan Attachments: TestResourceTrackerOnHA-output.2.txt, org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt Failure like, it can be happened in TestApplicationClientProtocolOnHA, TestResourceTrackerOnHA, etc. {code} org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 9.491 sec ERROR! java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 to asf905.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
Tsuyoshi OZAWA created YARN-2712: Summary: Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Reporter: Tsuyoshi OZAWA TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
[ https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2712: - Issue Type: Sub-task (was: Test) Parent: YARN-556 Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart --- Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk
[ https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176288#comment-14176288 ] Tsuyoshi OZAWA commented on YARN-2710: -- [~leftnoteasy], it passes on my local too. I checked the log you attached - it failed since EOFException occured. EOFException can happen with different protobuf format mixes. Could you retry the test after {{mvn clean}}? It sometimes resolves the problem. RM HA tests failed intermittently on trunk -- Key: YARN-2710 URL: https://issues.apache.org/jira/browse/YARN-2710 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan Attachments: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt Failure like, it can be happened in TestApplicationClientProtocolOnHA, TestResourceTrackerOnHA, etc. {code} org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 9.491 sec ERROR! java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 to asf905.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176320#comment-14176320 ] Tsuyoshi OZAWA commented on YARN-1879: -- Thanks Jian, Karthik, Vinod, Xuan and Anubhav for reviews and comments! Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Fix For: 2.6.0 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, YARN-1879.29.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk
[ https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176421#comment-14176421 ] Tsuyoshi OZAWA commented on YARN-2710: -- BTW, YARN-2398 is addressing intermittent failure of TestResourceTrackerOnHA. RM HA tests failed intermittently on trunk -- Key: YARN-2710 URL: https://issues.apache.org/jira/browse/YARN-2710 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan Attachments: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt Failure like, it can be happened in TestApplicationClientProtocolOnHA, TestResourceTrackerOnHA, etc. {code} org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 9.491 sec ERROR! java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 to asf905.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2710) RM HA tests failed intermittently on trunk
[ https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA resolved YARN-2710. -- Resolution: Duplicate RM HA tests failed intermittently on trunk -- Key: YARN-2710 URL: https://issues.apache.org/jira/browse/YARN-2710 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan Attachments: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt Failure like, it can be happened in TestApplicationClientProtocolOnHA, TestResourceTrackerOnHA, etc. {code} org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 9.491 sec ERROR! java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 to asf905.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk
[ https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176424#comment-14176424 ] Tsuyoshi OZAWA commented on YARN-2710: -- Closing this issue as dup of YARN-2398. RM HA tests failed intermittently on trunk -- Key: YARN-2710 URL: https://issues.apache.org/jira/browse/YARN-2710 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan Attachments: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt Failure like, it can be happened in TestApplicationClientProtocolOnHA, TestResourceTrackerOnHA, etc. {code} org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 9.491 sec ERROR! java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 to asf905.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174916#comment-14174916 ] Tsuyoshi OZAWA commented on YARN-1879: -- The failures of TestFairScheduler and TestResourceTrackerOnHA is not related to the patch: * TestFairScheduler fails intermittently - it passes on my local. * TestResourceTrackerOnHA fails on trunk. [~jianhe], could you take a look? Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.29.patch Thanks for your review, Jian. I see, it's better because it's more similar to the real code path. I also noticed that v28 patch includes needless changes which was introduced old version of patches. Attaching a new patch(v29) to removed them and making test code simple. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, YARN-1879.29.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175678#comment-14175678 ] Tsuyoshi OZAWA commented on YARN-1879: -- Thanks for your comment, Wangda. I confirmed the failure is not related to the patch - all results of failures are same: {quote} java.lang.IllegalArgumentException: Illegal capacity of -1.0 for label=x in queue=root.default {quote} I also confirmed that the tests passes on my local. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, YARN-1879.29.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.26.patch Rebased on trunk. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2398) TestResourceTrackerOnHA crashes
[ https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2398: - Attachment: TestResourceTrackerOnHA-output.txt Reproduced the issue on my local. Attaching log. TestResourceTrackerOnHA crashes --- Key: YARN-2398 URL: https://issues.apache.org/jira/browse/YARN-2398 Project: Hadoop YARN Issue Type: Bug Reporter: Jason Lowe Attachments: TestResourceTrackerOnHA-output.txt TestResourceTrackerOnHA is currently crashing and failing trunk builds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173728#comment-14173728 ] Tsuyoshi OZAWA commented on YARN-1879: -- The test failure is not related to the patch and being filed as YARN-2398 - it still fails without patch. [~jianhe], [~kkambatl] could you review latest patch? Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.27.patch Updated to reuse ugi object before and after restart. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.28.patch Fixed the test failure of TestContainerResourceUsage and TestClientToAMTokens. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2312: - Attachment: YARN-2312-branch-2.8.patch [~jianhe] [~jlowe] Thanks for your comment. Attaching a patch for branch-2. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-branch-2.8.patch, YARN-2312-wip.patch, YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch, YARN-2312.7.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.25.patch Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173346#comment-14173346 ] Tsuyoshi OZAWA commented on YARN-1879: -- Talked with Jian offline. {quote} In this case, token is expired for the application after finishing AM's container and I think we don't need to handle it. {quote} I'd like to confirm whether finishApplicationMaster() can be issued after AM containers exit. There are no such case, but finishApplicationMaster() can be issued after RM's removing AM's entry in a following case: 1. RM1 saves the app in RMStateStore and then crashes. 2. FinishApplicationMasterResponse#isRegistered still return false. 3. The AM still retries the 2nd RM. Thanks very much for clarifying, Jian. Attached a updated patch which includes a test for retried finishApplicationMaster and a test for retried registerApplicationMaster before and after RM-restart. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168671#comment-14168671 ] Tsuyoshi OZAWA commented on YARN-1879: -- The test faliure of TestAMRestart looks not related to the patch since it passes locally. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.24.patch Updated: * Renaming TestApplicationMasterServiceProtocolOnHA#initiate to initialize(). * About registerApplicatinMaster(), added test case for duplicated request. {quote} We should probably move the following check in finishApplicationMaster call upfront so that ApplicationDoesNotExistInCacheException is not thrown for already completed apps. {quote} In this case, token is expired for the application after finishing AM's container and I think we don't need to handle it. [~jianhe], what do you think? Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2312: - Attachment: YARN-2312.7.patch Thanks Jian, you're right. Updating to use Long.parseLong instead of Integer.parseInt in YarnChild.java. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch, YARN-2312.7.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163248#comment-14163248 ] Tsuyoshi OZAWA commented on YARN-2312: -- s/YARN-6115/MAPREDUCE-6115/ Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163468#comment-14163468 ] Tsuyoshi OZAWA commented on YARN-1879: -- Thanks Karthik and Jian for your review! {quote} We should probably move the following check in finishApplicationMaster call upfront so that ApplicationDoesNotExistInCacheException is not thrown for already completed apps. {quote} Good catch. OK, I'll move the check before upfront. {quote} 2. We should also probably add a test that makes duplicate requests to the same/different RM and verify the behavior is as expected. {quote} Maybe TestWorkPreservingRMRestart is good place to add tests. I'll try it. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2666) TestFairScheduler.testContinuousScheduling fails Intermittently
Tsuyoshi OZAWA created YARN-2666: Summary: TestFairScheduler.testContinuousScheduling fails Intermittently Key: YARN-2666 URL: https://issues.apache.org/jira/browse/YARN-2666 Project: Hadoop YARN Issue Type: Test Components: scheduler Reporter: Tsuyoshi OZAWA Assignee: Wei Yan The test fails on trunk. {code} Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler) Time elapsed: 0.582 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2252) Intermittent failure of TestFairScheduler.testContinuousScheduling
[ https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164570#comment-14164570 ] Tsuyoshi OZAWA commented on YARN-2252: -- [~kkambatl] [~ywskycn] Thank you, opened YARN-2666 to address the problem. Intermittent failure of TestFairScheduler.testContinuousScheduling -- Key: YARN-2252 URL: https://issues.apache.org/jira/browse/YARN-2252 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: trunk-win Reporter: Ratandeep Ratti Labels: hadoop2, scheduler, yarn Fix For: 2.6.0 Attachments: YARN-2252-1.patch, yarn-2252-2.patch This test-case is failing sporadically on my machine. I think I have a plausible explanation for this. It seems that when the Scheduler is being asked for resources, the resource requests that are being constructed have no preference for the hosts (nodes). The two mock hosts constructed, both have a memory of 8192 mb. The containers(resources) being requested each require a memory of 1024mb, hence a single node can execute both the resource requests for the application. In the end of the test-case it is being asserted that the containers (resource requests) be executed on different nodes, but since we haven't specified any preferences for nodes when requesting the resources, the scheduler (at times) executes both the containers (requests) on the same node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161724#comment-14161724 ] Tsuyoshi OZAWA commented on YARN-1458: -- Sure, thanks for reply :-) FairScheduler: Zero weight can lead to livelock --- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2312: - Attachment: YARN-2312.6.patch Thanks Jason and Jian for review. Updated: * Removed unnecessary change in TestTaskAttemptListenerImpl.java - sorry for this, this change was included wrongly. * Defined 0xffL as CONTAINER_ID_BITMASK and exposed it. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.23.patch The failure of TestClientToAMTokens looks not related - it passed on my local. Let me submit same patch again. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2252) Intermittent failure of TestFairScheduler.testContinuousScheduling
[ https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163118#comment-14163118 ] Tsuyoshi OZAWA commented on YARN-2252: -- Hi, this test failure is found on trunk - you can find it [here|https://issues.apache.org/jira/browse/YARN-2312?focusedCommentId=14161902page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14161902]. Log is as follows: {code} Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler) Time elapsed: 0.582 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372) {code} Can I reopen this problem? Intermittent failure of TestFairScheduler.testContinuousScheduling -- Key: YARN-2252 URL: https://issues.apache.org/jira/browse/YARN-2252 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: trunk-win Reporter: Ratandeep Ratti Labels: hadoop2, scheduler, yarn Fix For: 2.6.0 Attachments: YARN-2252-1.patch, yarn-2252-2.patch This test-case is failing sporadically on my machine. I think I have a plausible explanation for this. It seems that when the Scheduler is being asked for resources, the resource requests that are being constructed have no preference for the hosts (nodes). The two mock hosts constructed, both have a memory of 8192 mb. The containers(resources) being requested each require a memory of 1024mb, hence a single node can execute both the resource requests for the application. In the end of the test-case it is being asserted that the containers (resource requests) be executed on different nodes, but since we haven't specified any preferences for nodes when requesting the resources, the scheduler (at times) executes both the containers (requests) on the same node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163120#comment-14163120 ] Tsuyoshi OZAWA commented on YARN-2312: -- The test failure is not related - the failure of TestFairScheduler is reported on YARN-2252 and the failure of TestPipeApplication is reported on YARN-6115. [~jianhe], [~jlowe], could you review latest patch? Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.22.patch Made registerApplicationMaster Idempotent and marked registerApplicationMaser/finishApplicationMaster Idempotent. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161241#comment-14161241 ] Tsuyoshi OZAWA commented on YARN-1879: -- For now, I have no idea to reconstruct same response after failover. Currently latest patch only return empty response. This is one discussion point of this design. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161499#comment-14161499 ] Tsuyoshi OZAWA commented on YARN-1879: -- Thanks for your comments, Jian and Karthik. {quote} from RM’s perspective, these are just new requests, as the new RM doesn’t have any cache for previous requests from client. {quote} I confirmed that it's true. Not only {{finishApplicationMaster}} but also {{registerApplicationMaster}} don't touch the data in ZK directly, so RM can handle retried requests transparently following cases: 1. When EmbeddedElector choose different RM as a leader before and after the failover, ZK doesn't have the data of RMAttempt/RMApp. Then, RM recognizes a retried-request as a new request. e.g. there is active-RM(RM1) and standby-RM(RM2) and RM's leader failovers from RM1 to RM2. 2. Still when EmbeddedElector choose same RM as a leader before and after the failover, RM goes into standby state and RM stop all services before failover and it reload the data of RMAppAttempt/RMApp. In this case, RM recognizes a retried-request as a new request. e.g. there is active-RM(RM1) and standby-RM(RM2) and RM's leader failovers from RM1 to RM1. I think it has no problem to mark these methods as Idempotent. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.23.patch Marked Idempotent annotations to registerApplicationMaster and finishApplicationMaster. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159806#comment-14159806 ] Tsuyoshi OZAWA commented on YARN-1879: -- [~jianhe], [~xgong], thanks for your suggestion. I agree with your suggestion about {{registerApplicationMaster}}. How about {{finishApplicationMaster}}? We cannot distinguish the failure of RPC from the success of RPC after the change. What do you think? Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159832#comment-14159832 ] Tsuyoshi OZAWA commented on YARN-1879: -- Sounds good. Then we can simply remove retry-cache. I'm updating to remove retry-cache in next patch. [~kkambatl], please let me know if you have opinions about the design change. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2312: - Attachment: YARN-2312.5.patch Fixed the findbugs warning. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch, YARN-2312.5.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159425#comment-14159425 ] Tsuyoshi OZAWA commented on YARN-2312: -- The test failure of TestPipeApplication is not related to this JIRA and filed as MAPREDUCE-6120. [~jlowe], could you take a look, please? Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch, YARN-2312.5.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157751#comment-14157751 ] Tsuyoshi OZAWA commented on YARN-1879: -- Thanks for your comment, Karthik. Almost done to address your comments. {quote} Are there cases when we don't want RetryCache enabled? IMO, we should always use the RetryCache (no harm). If we decide on having a config, the default should be true. {quote} Basically, we can enable RetryCache without harm and implemented it. One concern is tests - on my local, lots tests fail because of following reason: {code} org.apache.hadoop.metrics2.MetricsException: Metrics source RetryCache.AppMasterServiceRetryCache already exists! {code} I'll try to fix them in next patch by adding {{DefaultMetricsSystem.setMiniClusterMode(true);}} to the tests. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.19.patch Updated: * Changed to enabled RetryCache by default. I think having configuration is better than always using RetryCache because of memory consumption. * Set DEFAULT_RM_RETRY_CACHE_EXPIRY_MS to 10 * 60 * 1000 instead of 60. * Changed TestApplicationMasterServiceRetryCache to have only lines shorter than 80 chars. * Fixed some tests to call DefaultMetricsSystem.setMiniClusterMode(true). * Renamed TestApplicationMasterService#*WithRetryCache to WithoutRetryCache and changed the tests with configuring RM_RETRY_CACHE_ENABLED false. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.20.patch Fixed TestAMRMClientOnRMRestart. The failure of TestAMRestart looks not related. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.21.patch Thank you for the review, Karthik. Updated a patch: * Rebased a patch on trunk. * Updated yarn-default.xml. * Fixed the value of RM_RETRY_CACHE_EXPIRY_MS from v20 as follows: {code} - public static final String RM_RETRY_CACHE_EXPIRY_MS = - RM_PREFIX + .retry-cache + .expiry-ms; + public static final String RM_RETRY_CACHE_EXPIRY_MS = + RM_PREFIX + retry-cache.expiry-ms; {code} Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158948#comment-14158948 ] Tsuyoshi OZAWA commented on YARN-1879: -- The failure TestRMWebServicesDelegationTokens looks not related and intermittent failure - the test passed on my local. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2312: - Attachment: YARN-2312.4.patch Refreshed a patch. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182
[ https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2562: - Attachment: (was: YARN-2562.5.patch) ContainerId@toString() is unreadable for epoch 0 after YARN-2182 - Key: YARN-2562 URL: https://issues.apache.org/jira/browse/YARN-2562 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, YARN-2562.4.patch, YARN-2562.5.patch ContainerID string format is unreadable for RMs that restarted at least once (epoch 0) after YARN-2182. For e.g, container_1410901177871_0001_01_05_17. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182
[ https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2562: - Attachment: YARN-2562.5.patch ContainerId@toString() is unreadable for epoch 0 after YARN-2182 - Key: YARN-2562 URL: https://issues.apache.org/jira/browse/YARN-2562 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, YARN-2562.4.patch, YARN-2562.5.patch ContainerID string format is unreadable for RMs that restarted at least once (epoch 0) after YARN-2182. For e.g, container_1410901177871_0001_01_05_17. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182
[ https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2562: - Attachment: YARN-2562.5-2.patch ContainerId@toString() is unreadable for epoch 0 after YARN-2182 - Key: YARN-2562 URL: https://issues.apache.org/jira/browse/YARN-2562 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, YARN-2562.4.patch, YARN-2562.5-2.patch, YARN-2562.5.patch ContainerID string format is unreadable for RMs that restarted at least once (epoch 0) after YARN-2182. For e.g, container_1410901177871_0001_01_05_17. -- This message was sent by Atlassian JIRA (v6.3.4#6332)