[jira] [Resolved] (YARN-6586) YARN to facilitate HTTPS in AM web server
[ https://issues.apache.org/jira/browse/YARN-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-6586. - Resolution: Fixed Fix Version/s: 3.3.0 All subtasks are now complete. Thanks for the reviews, especially [~haibochen]; and the help with the dependency issues, especially [~eyang]. > YARN to facilitate HTTPS in AM web server > - > > Key: YARN-6586 > URL: https://issues.apache.org/jira/browse/YARN-6586 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Robert Kanter >Priority: Major > Fix For: 3.3.0 > > Attachments: Design Document v1.pdf, Design Document v2.pdf, > YARN-6586.poc.patch > > > MR AM today does not support HTTPS in its web server, so the traffic between > RMWebproxy and MR AM is in clear text. > MR cannot easily achieve this mainly because MR AMs are untrusted by YARN. A > potential solution purely within MR, similar to what Spark has implemented, > is to allow users, when they enable HTTPS in MR job, to provide their own > keystore file, and then the file is uploaded to distributed cache and > localized for MR AM container. The configuration users need to do is complex. > More importantly, in typical deployments, however, web browsers go through > RMWebProxy to indirectly access MR AM web server. In order to support MR AM > HTTPs, RMWebProxy therefore needs to trust the user-provided keystore, which > is problematic. > Alternatively, we can add an endpoint in NM web server that acts as a proxy > between AM web server and RMWebProxy. RMWebproxy, when configured to do so, > will send requests in HTTPS to the NM on which the AM is running, and the NM > then can communicate with the local AM web server in HTTP. This adds one > hop between RMWebproxy and AM, but both MR and Spark can use such solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8922) Fix test-container-executor
Robert Kanter created YARN-8922: --- Summary: Fix test-container-executor Key: YARN-8922 URL: https://issues.apache.org/jira/browse/YARN-8922 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 3.3.0 Reporter: Robert Kanter Assignee: Robert Kanter YARN-8448 attempted to fix the {{test-container-executor}} C test to be able to run as root. The test claims that it should be possible to run as root; in fact, there are some tests that only run if you use root. One of the fixes was to change the permissions of the test's config dir to 0777 from 0755. The problem was that the directory was owned by root, but then other users would need to write files/directories under it, which would fail with 0755. YARN-8448 fixed this by making it 0777. However, this breaks running cetest because it expects the directory to be 0755, and it's run afterwards. The proper fix for all this is to leave the directory at 0755, but to make sure it's owned by the "nodemanager" user. Confusingly, in {{test-container-executor}}, that appears to be the {{username}} and not the {{yarn_username}} (i.e. {{username}} is the user running the NM while {{yarn_username}} is just some user running a Yarn app). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8857) Upgrade BouncyCastle
Robert Kanter created YARN-8857: --- Summary: Upgrade BouncyCastle Key: YARN-8857 URL: https://issues.apache.org/jira/browse/YARN-8857 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.2.0 Reporter: Robert Kanter Assignee: Robert Kanter As part of my work on YARN-6586, I noticed that we're using a very old version of BouncyCastle: {code:xml} org.bouncycastle bcprov-jdk16 1.46 test {code} The *-jdk16 artifacts have been discontinued and are not recommended (see [http://bouncy-castle.1462172.n4.nabble.com/Bouncycaslte-bcprov-jdk15-vs-bcprov-jdk16-td4656252.html]). In particular, the newest release, 1.46, is from {color:#FF}2011{color}! [https://mvnrepository.com/artifact/org.bouncycastle/bcprov-jdk16] The currently maintained and recommended artifacts are *-jdk15on: [https://www.bouncycastle.org/latest_releases.html] They're currently on version 1.60, released only a few months ago. We should update BouncyCastle to the *-jdk15on artifacts and the 1.60 release. It's currently a test-only artifact, so there should be no backwards-compatibility issues with updating this. It's also needed for YARN-6586, where we'll actually be shipping it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8582) Documentation for AM HTTPS Support
Robert Kanter created YARN-8582: --- Summary: Documentation for AM HTTPS Support Key: YARN-8582 URL: https://issues.apache.org/jira/browse/YARN-8582 Project: Hadoop YARN Issue Type: Sub-task Components: docs Reporter: Robert Kanter Assignee: Robert Kanter Documentation for YARN-6586. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8449) RM HA for AM HTTPS Support
Robert Kanter created YARN-8449: --- Summary: RM HA for AM HTTPS Support Key: YARN-8449 URL: https://issues.apache.org/jira/browse/YARN-8449 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Kanter Assignee: Robert Kanter -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8448) AM HTTPS Support
Robert Kanter created YARN-8448: --- Summary: AM HTTPS Support Key: YARN-8448 URL: https://issues.apache.org/jira/browse/YARN-8448 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Kanter Assignee: Robert Kanter -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8310) Handle old NMTokenIdentifier, AMRMTokenIdentifier, and ContainerTokenIdentifier formats
Robert Kanter created YARN-8310: --- Summary: Handle old NMTokenIdentifier, AMRMTokenIdentifier, and ContainerTokenIdentifier formats Key: YARN-8310 URL: https://issues.apache.org/jira/browse/YARN-8310 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter In some recent upgrade testing, we saw this error causing the NodeManager to fail to startup afterwards: {noformat} org.apache.hadoop.service.ServiceStateException: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero). at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:441) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:834) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:895) Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero). at com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89) at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108) at org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1860) at org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1824) at org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2016) at org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2011) at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.parseFrom(YarnSecurityTokenProtos.java:2686) at org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:254) at org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:177) at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:322) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:455) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:373) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) ... 5 more {noformat} The NodeManager fails because it's trying to read a {{ContainerTokenIdentifier}} in the "old" format before we changed them to protobufs (YARN-668). This is very similar to YARN-5594 where we ran into a similar problem with the ResourceManager and RM Delegation Tokens. To provide a better experience, we should make the code able to read the old format if it's unable to read it using the new format. We didn't run into any errors with the other two types of tokens that YARN-668 incompatibly changed (NMTokenIdentifier and AMRMTokenIdentifier), but we may as well fix those while we're at it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8051) TestRMEmbeddedElector#testCallbackSynchronization is flakey
Robert Kanter created YARN-8051: --- Summary: TestRMEmbeddedElector#testCallbackSynchronization is flakey Key: YARN-8051 URL: https://issues.apache.org/jira/browse/YARN-8051 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 3.2.0 Reporter: Robert Kanter Assignee: Robert Kanter We've seen some rare flakey failures in {{TestRMEmbeddedElector#testCallbackSynchronization}}: {noformat} org.mockito.exceptions.verification.WantedButNotInvoked: Wanted but not invoked: adminService.transitionToStandby(); -> at org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215) Actually, there were zero interactions with this mock. at org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215) at org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:146) at org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:109) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7645) TestContainerResourceUsage#testUsageAfterAMRestartWithMultipleContainers is flakey with FairScheduler
Robert Kanter created YARN-7645: --- Summary: TestContainerResourceUsage#testUsageAfterAMRestartWithMultipleContainers is flakey with FairScheduler Key: YARN-7645 URL: https://issues.apache.org/jira/browse/YARN-7645 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 3.0.0 Reporter: Robert Kanter Assignee: Robert Kanter We've noticed some flakiness in {{TestContainerResourceUsage#testUsageAfterAMRestartWithMultipleContainers}} when using {{FairScheduler}}: {noformat} java.lang.AssertionError: Attempt state is not correct (timeout). expected: but was: at org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.amRestartTests(TestContainerResourceUsage.java:275) at org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageAfterAMRestartWithMultipleContainers(TestContainerResourceUsage.java:254) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7458) TestContainerManagerSecurity is still flakey
Robert Kanter created YARN-7458: --- Summary: TestContainerManagerSecurity is still flakey Key: YARN-7458 URL: https://issues.apache.org/jira/browse/YARN-7458 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 3.0.0-beta1, 2.9.0 Reporter: Robert Kanter Assignee: Robert Kanter YARN-6150 made this less flakey, but we're still seeing an occasional issue here: {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:420) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:356) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:167) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7389) Make TestResourceManager Scheduler agnostic
Robert Kanter created YARN-7389: --- Summary: Make TestResourceManager Scheduler agnostic Key: YARN-7389 URL: https://issues.apache.org/jira/browse/YARN-7389 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.9.0, 3.0.0 Reporter: Robert Kanter Assignee: Robert Kanter Many of the tests in {{TestResourceManager}} override the scheduler to always be {{CapacityScheduler}}. However, these tests should be made scheduler agnostic (they are testing the RM, not the scheduler). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7385) TestFairScheduler#testUpdateDemand and TestFSLeafQueue#testUpdateDemand are failing with NPE
Robert Kanter created YARN-7385: --- Summary: TestFairScheduler#testUpdateDemand and TestFSLeafQueue#testUpdateDemand are failing with NPE Key: YARN-7385 URL: https://issues.apache.org/jira/browse/YARN-7385 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.9.0, 3.0.0 Reporter: Robert Kanter Assignee: Robert Kanter {{TestFairScheduler#testUpdateDemand}} and {{TestFSLeafQueue#testUpdateDemand}} are failing with NPE: {noformat} java.lang.NullPointerException: null at org.apache.hadoop.yarn.util.resource.Resources.addTo(Resources.java:180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.incUsedResource(FSQueue.java:494) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.addApp(FSLeafQueue.java:92) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testUpdateDemand(TestFairScheduler.java:5264) Standard Output84 ms {noformat} {noformat} java.lang.NullPointerException: null at org.apache.hadoop.yarn.util.resource.Resources.addTo(Resources.java:180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.incUsedResource(FSQueue.java:494) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.addApp(FSLeafQueue.java:92) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSLeafQueue.testUpdateDemand(TestFSLeafQueue.java:92) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7382) NoSuchElementException in FairScheduler after failover causes RM crash
Robert Kanter created YARN-7382: --- Summary: NoSuchElementException in FairScheduler after failover causes RM crash Key: YARN-7382 URL: https://issues.apache.org/jira/browse/YARN-7382 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.9.0, 3.0.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Blocker While running an MR job (e.g. sleep) and an RM failover occurs, once the maps gets to 100%, the now active RM will crash due to: {noformat} 2017-10-18 15:02:05,347 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1508361403235_0001_01_02 Container Transitioned from RUNNING to COMPLETED 2017-10-18 15:02:05,347 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1508361403235_0001 CONTAINERID=container_1508361403235_0001_01_02 RESOURCE= 2017-10-18 15:02:05,349 FATAL org.apache.hadoop.yarn.event.EventDispatcher: Error in handling event type NODE_UPDATE to the Event Dispatcher java.util.NoSuchElementException at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) at java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:371) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:901) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1326) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:371) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1019) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:887) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1104) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:128) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:748) 2017-10-18 15:02:05,360 INFO org.apache.hadoop.yarn.event.EventDispatcher: Exiting, bbye.. {noformat} This leaves the cluster with no RMs! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7341) TestRouterWebServiceUtil#testMergeMetrics is flakey
Robert Kanter created YARN-7341: --- Summary: TestRouterWebServiceUtil#testMergeMetrics is flakey Key: YARN-7341 URL: https://issues.apache.org/jira/browse/YARN-7341 Project: Hadoop YARN Issue Type: Bug Components: federation Affects Versions: 3.0.0-beta1, 2.9.0 Reporter: Robert Kanter Assignee: Robert Kanter {{TestRouterWebServiceUtil#testMergeMetrics}} is flakey. It sometimes fails with something like: {noformat} Running org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServiceUtil Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.252 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServiceUtil testMergeMetrics(org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServiceUtil) Time elapsed: 0.005 sec <<< FAILURE! java.lang.AssertionError: expected:<1092> but was:<584> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServiceUtil.testMergeMetrics(TestRouterWebServiceUtil.java:473) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7310) TestAMRMProxy#testAMRMProxyE2E fails with FairScheduler
Robert Kanter created YARN-7310: --- Summary: TestAMRMProxy#testAMRMProxyE2E fails with FairScheduler Key: YARN-7310 URL: https://issues.apache.org/jira/browse/YARN-7310 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Robert Kanter Assignee: Robert Kanter {{TestAMRMProxy#testAMRMProxyE2E}} fails with FairScheduler: {noformat} [ERROR] testAMRMProxyE2E(org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy) Time elapsed: 29.047 s <<< FAILURE! java.lang.AssertionError: expected:<2> but was:<1> at org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy.testAMRMProxyE2E(TestAMRMProxy.java:124) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7309) TestClientRMService#testUpdateApplicationPriorityRequest and TestClientRMService#testUpdatePriorityAndKillAppWithZeroClusterResource test functionality not supported by Fa
Robert Kanter created YARN-7309: --- Summary: TestClientRMService#testUpdateApplicationPriorityRequest and TestClientRMService#testUpdatePriorityAndKillAppWithZeroClusterResource test functionality not supported by FairScheduler Key: YARN-7309 URL: https://issues.apache.org/jira/browse/YARN-7309 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Robert Kanter Assignee: Robert Kanter {{TestClientRMService#testUpdateApplicationPriorityRequest}} and {{TestClientRMService#testUpdatePriorityAndKillAppWithZeroClusterResource}} test functionality (i.e. Application Priorities) not supported by FairScheduler. We should skip these two tests when using FairScheduler or they'll fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7308) TestApplicationACLs fails with FairScheduler
Robert Kanter created YARN-7308: --- Summary: TestApplicationACLs fails with FairScheduler Key: YARN-7308 URL: https://issues.apache.org/jira/browse/YARN-7308 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Robert Kanter Assignee: Robert Kanter {{TestApplicationACLs}} fails when using FairScheduler: {noformat} Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 98.389 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs testApplicationACLs(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs) Time elapsed: 94.563 sec <<< FAILURE! java.lang.AssertionError: App State is not correct (timeout). expected: but was: at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:284) at org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs.verifyInvalidQueueWithAcl(TestApplicationACLs.java:422) at org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs.testApplicationACLs(TestApplicationACLs.java:186) {noformat} There's a bunch of messages like this in the output: {noformat} 2017-10-09 17:00:54,572 INFO [main] resourcemanager.MockRM (MockRM.java:waitForState(277)) - App : application_1507593559080_0006 State is : ACCEPTED Waiting for state : FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7138) Fix incompatible API change for YarnScheduler involved by YARN-5521
[ https://issues.apache.org/jira/browse/YARN-7138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-7138. - Resolution: Won't Fix Ok. I've created YARN-7301. > Fix incompatible API change for YarnScheduler involved by YARN-5521 > --- > > Key: YARN-7138 > URL: https://issues.apache.org/jira/browse/YARN-7138 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Junping Du >Priority: Critical > > From JACC report for 2.8.2 against 2.7.4, it indicates that we have > incompatible changes happen in YarnScheduler: > {noformat} > hadoop-yarn-server-resourcemanager-2.7.4.jar, YarnScheduler.class > package org.apache.hadoop.yarn.server.resourcemanager.scheduler > YarnScheduler.allocate ( ApplicationAttemptId p1, List p2, > List p3, List p4, List p5 ) [abstract] : > Allocation > {noformat} > The root cause is YARN-5221. We should change it back or workaround this by > adding back original API (mark as deprecated if not used any more). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7301) Create stable Scheduler API
Robert Kanter created YARN-7301: --- Summary: Create stable Scheduler API Key: YARN-7301 URL: https://issues.apache.org/jira/browse/YARN-7301 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Robert Kanter Currently, it's not practical for a user to create their own scheduler. Besides it being a large undertaking, the API is a mess. A few of the problems: # We make incompatible changes to {{YarnScheduler}} sometimes (see YARN-7138). # Many methods in {{YarnScheduler}} are marked as {{\@Public}} {{\@Stable}}, but the class itself has no annotations, which defaults to {{\@Private}}. # We often cast a {{YarnScheduler}} to an {{AbstractYarnScheduler}}, which means that custom schedulers must also subclass {{AbstractYarnScheduler}} or they'll get a {{ClassCastException}}. However, {{AbstractYarnScheduler}} is {{\@Private}} {{\@Unstable}}. It could be useful to provide a proper usable API for custom schedulers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7262) Add a hierarchy into the ZKRMStateStore for delegation token znodes to prevent jute buffer overflow
Robert Kanter created YARN-7262: --- Summary: Add a hierarchy into the ZKRMStateStore for delegation token znodes to prevent jute buffer overflow Key: YARN-7262 URL: https://issues.apache.org/jira/browse/YARN-7262 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter We've seen users who are running into a problem where the RM is storing so many delegation tokens in the {{ZKRMStateStore}} that the _listing_ of those znodes is higher than the jute buffer. This is fine during operations, but becomes a problem on a fail over because the RM will try to read in all of the token znodes (i.e. call {{getChildren}} on the parent znode). This is particularly bad because everything appears to be okay, but then if a failover occurs you end up with no active RMs. There was a similar problem with the Yarn application data that was fixed in YARN-2962 by adding a (configurable) hierarchy of znodes so the RM could pull subchildren without overflowing the jute buffer (though it's off by default). We should add a hierarchy similar to that of YARN-2962, but for the delegation token znodes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7162) Remove XML excludes file format
Robert Kanter created YARN-7162: --- Summary: Remove XML excludes file format Key: YARN-7162 URL: https://issues.apache.org/jira/browse/YARN-7162 Project: Hadoop YARN Issue Type: Bug Components: graceful Affects Versions: 2.9.0, 3.0.0-beta1 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Blocker YARN-5536 aims to replace the XML format for the excludes file with a JSON format. However, it looks like we won't have time for that for Hadoop 3 Beta 1. The concern is that if we release it as-is, we'll now have to support the XML format as-is for all of Hadoop 3.x, which we're either planning on removing, or rewriting using a pluggable framework. [This comment in YARN-5536|https://issues.apache.org/jira/browse/YARN-5536?focusedCommentId=16126194&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16126194] proposed two quick solutions to prevent this compat issue. In this JIRA, we're going to remove the XML format. If we later want to add it back in, YARN-5536 can add it back, rewriting it to be in the pluggable framework. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7146) Many RM unit tests failing with FairScheduler
Robert Kanter created YARN-7146: --- Summary: Many RM unit tests failing with FairScheduler Key: YARN-7146 URL: https://issues.apache.org/jira/browse/YARN-7146 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 3.0.0-beta1 Reporter: Robert Kanter Assignee: Robert Kanter Many of the RM unit tests are failing when using the FairScheduler. Here is a list of affected test classes: {noformat} TestYarnClient TestApplicationCleanup TestApplicationMasterLauncher TestDecommissioningNodesWatcher TestKillApplicationWithRMHA TestNodeBlacklistingOnAMFailures TestRM TestRMAdminService TestRMRestart TestResourceTrackerService TestWorkPreservingRMRestart TestAMRMRPCNodeUpdates TestAMRMRPCResponseId TestAMRestart TestApplicationLifetimeMonitor TestNodesListManager TestRMContainerImpl TestAbstractYarnScheduler TestSchedulerUtils TestFairOrderingPolicy TestAMRMTokens TestDelegationTokenRenewer {noformat} Most of the test methods in these classes are failing, though some do succeed. There's two main categories of issues: # The test submits an application to the {{MockRM}} and waits for it to enter a specific state, which it never does, and the test times out. We need to call {{update()}} on the scheduler. # The test throws a {{ClassCastException}} on {{FSQueueMetrics}} to {{CSQueueMetrics}}. This is because {{QueueMetrics}} metrics are static, and a previous test using FairScheduler initialized it, and the current test is using CapacityScheduler. We need to reset the metrics. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7094) Document that server-side graceful decom is currently not recommended
Robert Kanter created YARN-7094: --- Summary: Document that server-side graceful decom is currently not recommended Key: YARN-7094 URL: https://issues.apache.org/jira/browse/YARN-7094 Project: Hadoop YARN Issue Type: Sub-task Components: graceful Affects Versions: 3.0.0-beta1 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Blocker Server-side NM graceful decom currently does not work correctly when an RM failover occurs because we don't persist the info in the state store (see YARN-5464). Given time constraints for Hadoop 3 beta 1, we've decided to document this limitation and recommend client-side NM graceful decom in the meantime if you need this functionality (see [this comment|https://issues.apache.org/jira/browse/YARN-5464?focusedCommentId=16126119&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16126119]). Once YARN-5464 is done, we can undo this doc change. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7020) TestAMRMProxy#testAMRMProxyTokenRenewal is flakey
Robert Kanter created YARN-7020: --- Summary: TestAMRMProxy#testAMRMProxyTokenRenewal is flakey Key: YARN-7020 URL: https://issues.apache.org/jira/browse/YARN-7020 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-beta1 Reporter: Robert Kanter Assignee: Robert Kanter {{TestAMRMProxy#testAMRMProxyTokenRenewal}} is flakey. It infrequently fails with: {noformat} testAMRMProxyTokenRenewal(org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy) Time elapsed: 19.036 sec <<< ERROR! org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1502837054903_0001_01 doesn't exist in ApplicationMasterService cache. at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:355) at org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor$3.allocate(DefaultRequestInterceptor.java:224) at org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor.allocate(DefaultRequestInterceptor.java:135) at org.apache.hadoop.yarn.server.nodemanager.amrmproxy.AMRMProxyService.allocate(AMRMProxyService.java:279) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1490) at org.apache.hadoop.ipc.Client.call(Client.java:1436) at org.apache.hadoop.ipc.Client.call(Client.java:1346) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy90.allocate(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy91.allocate(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy.testAMRMProxyTokenRenewal(TestAMRMProxy.java:190) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6993) getResourceCalculatorPlugin for the default should not intercept throwable
Robert Kanter created YARN-6993: --- Summary: getResourceCalculatorPlugin for the default should not intercept throwable Key: YARN-6993 URL: https://issues.apache.org/jira/browse/YARN-6993 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha1, 2.8.0 Reporter: Robert Kanter YARN-3917 made it so that when {{getResourceCalculatorPlugin}} tries to load the default calculator and something bad happens, it catches {{throwable}} and simply logs a warning. This should be {{Exception}} - we don't want to eat things like an OOM Error. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6974) Make CuratorBasedElectorService the default
Robert Kanter created YARN-6974: --- Summary: Make CuratorBasedElectorService the default Key: YARN-6974 URL: https://issues.apache.org/jira/browse/YARN-6974 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 3.0.0-beta1 Reporter: Robert Kanter YARN-4438 (and cleanup in YARN-5709) added the {{CuratorBasedElectorService}}, which does leader election via Curator. The intention was to leave it off by default to allow time for it to bake, and eventually make it the default and remove the {{ActiveStandbyElectorBasedElectorService}}. We should do that. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6643) TestRMFailover fails rarely due to port conflict
Robert Kanter created YARN-6643: --- Summary: TestRMFailover fails rarely due to port conflict Key: YARN-6643 URL: https://issues.apache.org/jira/browse/YARN-6643 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.9.0, 3.0.0-alpha3 Reporter: Robert Kanter Assignee: Robert Kanter We've seen various tests in {{TestRMFailover}} fail very rarely with a message like "org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: ResourceManager failed to start. Final state is STOPPED". After some digging, it turns out that it's due to a port conflict with the embedded ZooKeeper in the tests. The embedded ZooKeeper uses {{ServerSocketUtil#getPort}} to choose a free port, but the RMs are configured to 1 + and 2 + (e.g. the default port for the RM is 8032, so you'd use 18032 and 28032). When I was able to reproduce this, I saw that ZooKeeper was using port 18033, which is 1 + 8033, the default RM Admin port. It results in an error like this, causing the RM to be unable to start, and hence the original error message in the test failure: {noformat} 2017-05-24 01:16:52,735 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:18033] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:18033] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.startServer(AdminService.java:171) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceStart(AdminService.java:158) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1147) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.MiniYARNCluster$2.run(MiniYARNCluster.java:310) Caused by: java.net.BindException: Problem binding to [0.0.0.0:18033] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:720) at org.apache.hadoop.ipc.Server.bind(Server.java:482) at org.apache.hadoop.ipc.Server$Listener.(Server.java:688) at org.apache.hadoop.ipc.Server.(Server.java:2376) at org.apache.hadoop.ipc.RPC$Server.(RPC.java:1042) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:535) at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) ... 9 more Caused by: java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.apache.hadoop.ipc.Server.bind(Server.java:465) ... 17 more 2017-05-24 01:16:52,736 DEBUG service.AbstractService (AbstractService.java:enterState(452)) - Service: ResourceManager entered state STOPPED {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (YARN-6642) TestRMFailover fails rarely due to port conflict
Robert Kanter created YARN-6642: --- Summary: TestRMFailover fails rarely due to port conflict Key: YARN-6642 URL: https://issues.apache.org/jira/browse/YARN-6642 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.9.0, 3.0.0-alpha3 Reporter: Robert Kanter Assignee: Robert Kanter We've seen various tests in {{TestRMFailover}} fail very rarely with a message like "org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: ResourceManager failed to start. Final state is STOPPED". After some digging, it turns out that it's due to a port conflict with the embedded ZooKeeper in the tests. The embedded ZooKeeper uses {{ServerSocketUtil#getPort}} to choose a free port, but the RMs are configured to 1 + and 2 + (e.g. the default port for the RM is 8032, so you'd use 18032 and 28032). When I was able to reproduce this, I saw that ZooKeeper was using port 18033, which is 1 + 8033, the default RM Admin port. It results in an error like this, causing the RM to be unable to start, and hence the original error message in the test failure: {noformat} 2017-05-24 01:16:52,735 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:18033] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:18033] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.startServer(AdminService.java:171) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceStart(AdminService.java:158) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1147) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.MiniYARNCluster$2.run(MiniYARNCluster.java:310) Caused by: java.net.BindException: Problem binding to [0.0.0.0:18033] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:720) at org.apache.hadoop.ipc.Server.bind(Server.java:482) at org.apache.hadoop.ipc.Server$Listener.(Server.java:688) at org.apache.hadoop.ipc.Server.(Server.java:2376) at org.apache.hadoop.ipc.RPC$Server.(RPC.java:1042) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:535) at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) ... 9 more Caused by: java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.apache.hadoop.ipc.Server.bind(Server.java:465) ... 17 more 2017-05-24 01:16:52,736 DEBUG service.AbstractService (AbstractService.java:enterState(452)) - Service: ResourceManager entered state STOPPED {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (YARN-6602) Impersonation does not work if standby RM is contacted first
Robert Kanter created YARN-6602: --- Summary: Impersonation does not work if standby RM is contacted first Key: YARN-6602 URL: https://issues.apache.org/jira/browse/YARN-6602 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 3.0.0-alpha3 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Blocker When RM HA is enabled, impersonation does not work correctly if the Yarn Client connects to the standby RM first. When this happens, the impersonation is "lost" and the client does things on behalf of the impersonator user. We saw this with the OOZIE-1770 Oozie on Yarn feature. I need to investigate this some more, but it appears to be related to delegation tokens. When this issue occurs, the tokens have the owner as "oozie" instead of the actual user. On a hunch, we found a workaround that explicitly adding a correct RM HA delegation token fixes the problem: {code:java} org.apache.hadoop.yarn.api.records.Token token = yarnClient.getRMDelegationToken(ClientRMProxy.getRMDelegationTokenService(conf)); org.apache.hadoop.security.token.Token token2 = new org.apache.hadoop.security.token.Token(token.getIdentifier().array(), token.getPassword().array(), new Text(token.getKind()), new Text(token.getService())); UserGroupInformation.getCurrentUser().addToken(token2); {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5894) fixed license warning caused by de.ruedigermoeller:fst:jar:2.24
[ https://issues.apache.org/jira/browse/YARN-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-5894. - Resolution: Fixed Fix Version/s: 2.8.1 2.9.0 Sure thing - committed to branch-2, branch-2.8, and branch-2.8.1! > fixed license warning caused by de.ruedigermoeller:fst:jar:2.24 > --- > > Key: YARN-5894 > URL: https://issues.apache.org/jira/browse/YARN-5894 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0-alpha1 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Blocker > Fix For: 2.9.0, 2.8.1, 3.0.0-alpha3 > > Attachments: YARN-5894.00.patch, YARN-5894.01.patch > > > The artifact de.ruedigermoeller:fst:jar:2.24, that ApplicationHistoryService > depends on, shows its license being LGPL 2.1 in our license checking. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6527) Provide a better out-of-the-box experience for SLS
Robert Kanter created YARN-6527: --- Summary: Provide a better out-of-the-box experience for SLS Key: YARN-6527 URL: https://issues.apache.org/jira/browse/YARN-6527 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler-load-simulator Affects Versions: 3.0.0-alpha3 Reporter: Robert Kanter The example provided with SLS appears to be broken - I didn't see any jobs running. On top of that, it seems like getting SLS to run properly requires a lot of hadoop site configs, scheduler configs, etc. I was only able to get something running after [~yufeigu] provided a lot of config files. We should provide a better out-of-the-box experience for SLS. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6359) TestRM#testApplicationKillAtAcceptedState fails rarely due to race condition
Robert Kanter created YARN-6359: --- Summary: TestRM#testApplicationKillAtAcceptedState fails rarely due to race condition Key: YARN-6359 URL: https://issues.apache.org/jira/browse/YARN-6359 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.9.0, 3.0.0-alpha3 Reporter: Robert Kanter Assignee: Robert Kanter We've seen (very rarely) a test failure in {{TestRM#testApplicationKillAtAcceptedState}} {noformat} java.lang.AssertionError: expected:<1> but was:<0> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestRM.testApplicationKillAtAcceptedState(TestRM.java:645) {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6356) Allow different values of yarn.log-aggregation.retain-seconds for succeeded and failed jobs
Robert Kanter created YARN-6356: --- Summary: Allow different values of yarn.log-aggregation.retain-seconds for succeeded and failed jobs Key: YARN-6356 URL: https://issues.apache.org/jira/browse/YARN-6356 Project: Hadoop YARN Issue Type: Improvement Components: log-aggregation Reporter: Robert Kanter It would be useful to have a value of {{yarn.log-aggregation.retain-seconds}} for succeeded jobs and a different value for failed/killed jobs. For jobs that succeeded, you typically don't care about the logs, so a shorter retention time is fine (and saves space/blocks in HDFS). For jobs that failed or were killed, the logs are much more important, and it's likely to want to keep them around for longer so you have time to look at them. For instance, you could set it to keep logs for succeeded jobs for 1 day and logs for failed/killed jobs for 1 week. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6051) Create CS test for YARN-6050
Robert Kanter created YARN-6051: --- Summary: Create CS test for YARN-6050 Key: YARN-6051 URL: https://issues.apache.org/jira/browse/YARN-6051 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Kanter Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6050) AMs can't be scheduled on racks or nodes
Robert Kanter created YARN-6050: --- Summary: AMs can't be scheduled on racks or nodes Key: YARN-6050 URL: https://issues.apache.org/jira/browse/YARN-6050 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.9.0, 3.0.0-alpha2 Reporter: Robert Kanter Assignee: Robert Kanter Yarn itself supports rack/node aware scheduling for AMs; however, there currently are two problems: # To specify hard or soft rack/node requests, you have to specify more than one {{ResourceRequest}}. For example, if you want to schedule an AM only on "rackA", you have to create two {{ResourceRequest}}, like this: {code} ResourceRequest.newInstance(PRIORITY, ANY, CAPABILITY, NUM_CONTAINERS, false); ResourceRequest.newInstance(PRIORITY, "rackA", CAPABILITY, NUM_CONTAINERS, true); {code} The problem is that the Yarn API doesn't actually allow you to specify more than one {{ResourceRequest}} in the {{ApplicationSubmissionContext}}. The current behavior is to either build one from {{getResource}} or directly from {{getAMContainerResourceRequest}}, depending on if {{getAMContainerResourceRequest}} is null or not. We'll need to add a third method, say {{getAMContainerResourceRequests}}, which takes a list of {{ResourceRequest}} so that clients can specify the multiple resource requests. # There are some places where things are hardcoded to overwrite what the client specifies. These are pretty straightforward to fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5837) NPE when getting node status of a decommissioned node after an RM restart
Robert Kanter created YARN-5837: --- Summary: NPE when getting node status of a decommissioned node after an RM restart Key: YARN-5837 URL: https://issues.apache.org/jira/browse/YARN-5837 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha1, 2.7.3 Reporter: Robert Kanter Assignee: Robert Kanter If you decommission a node, the {{yarn node}} command shows it like this: {noformat} >> bin/yarn node -list -all 2016-11-04 08:54:37,169 INFO client.RMProxy: Connecting to ResourceManager at 0.0.0.0/0.0.0.0:8032 Total Nodes:1 Node-Id Node-State Node-Http-Address Number-of-Running-Containers 192.168.1.69:57560 DECOMMISSIONED 192.168.1.69:8042 0 {noformat} And a full report like this: {noformat} >> bin/yarn node -status 192.168.1.69:57560 2016-11-04 08:55:08,928 INFO client.RMProxy: Connecting to ResourceManager at 0.0.0.0/0.0.0.0:8032 Node Report : Node-Id : 192.168.1.69:57560 Rack : /default-rack Node-State : DECOMMISSIONED Node-Http-Address : 192.168.1.69:8042 Last-Health-Update : Fri 04/Nov/16 08:53:58:802PDT Health-Report : Containers : 0 Memory-Used : 0MB Memory-Capacity : 8192MB CPU-Used : 0 vcores CPU-Capacity : 8 vcores Node-Labels : Resource Utilization by Node : Resource Utilization by Containers : PMem:0 MB, VMem:0 MB, VCores:0.0 {noformat} If you then restart the ResourceManager, you get this report: {noformat} >> bin/yarn node -list -all 2016-11-04 08:57:18,512 INFO client.RMProxy: Connecting to ResourceManager at 0.0.0.0/0.0.0.0:8032 Total Nodes:4 Node-Id Node-State Node-Http-Address Number-of-Running-Containers 192.168.1.69:-1 DECOMMISSIONED 192.168.1.69:-1 0 {noformat} And when you try to get the full report on the now "-1" node, you get an NPE: {noformat} >> bin/yarn node -status 192.168.1.69:-1 2016-11-04 08:57:57,385 INFO client.RMProxy: Connecting to ResourceManager at 0.0.0.0/0.0.0.0:8032 Exception in thread "main" java.lang.NullPointerException at org.apache.hadoop.yarn.client.cli.NodeCLI.printNodeStatus(NodeCLI.java:296) at org.apache.hadoop.yarn.client.cli.NodeCLI.run(NodeCLI.java:116) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.yarn.client.cli.NodeCLI.main(NodeCLI.java:63) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5750) YARN-4126 broke Oozie on unsecure cluster
[ https://issues.apache.org/jira/browse/YARN-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-5750. - Resolution: Duplicate YARN-4126 has been reverted from branch-2 and 2.8. It's now only in 3, where it's okay to break this. > YARN-4126 broke Oozie on unsecure cluster > - > > Key: YARN-5750 > URL: https://issues.apache.org/jira/browse/YARN-5750 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: Peter Cseh > > Oozie is using a DummyRenewer on unsecure clusters and can't submit workflows > on an unsecure cluster after YARN-4126. > {noformat} > org.apache.oozie.action.ActionExecutorException: JA009: > org.apache.hadoop.yarn.exceptions.YarnException: java.io.IOException: > Delegation Token can be issued only with kerberos authentication > at > org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1092) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getDelegationToken(ApplicationClientProtocolPBServiceImpl.java:335) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:515) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:663) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2423) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2419) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1790) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2419) > Caused by: java.io.IOException: Delegation Token can be issued only with > kerberos authentication > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1065) > ... 10 more > at > org.apache.oozie.action.ActionExecutor.convertExceptionHelper(ActionExecutor.java:457) > at > org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:437) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1128) > at > org.apache.oozie.action.hadoop.TestJavaActionExecutor.submitAction(TestJavaActionExecutor.java:343) > at > org.apache.oozie.action.hadoop.TestJavaActionExecutor.submitAction(TestJavaActionExecutor.java:363) > at > org.apache.oozie.action.hadoop.TestJavaActionExecutor.testKill(TestJavaActionExecutor.java:602) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at junit.framework.TestCase.runTest(TestCase.java:168) > at junit.framework.TestCase.runBare(TestCase.java:134) > at junit.framework.TestResult$1.protect(TestResult.java:110) > at junit.framework.TestResult.runProtected(TestResult.java:128) > at junit.framework.TestResult.run(TestResult.java:113) > at junit.framework.TestCase.run(TestCase.java:124) > at junit.framework.TestSuite.runTest(TestSuite.java:232) > at junit.framework.TestSuite.run(TestSuite.java:227) > at > org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83) > at org.junit.runners.Suite.runChild(Suite.java:128) > at org.junit.runners.Suite.runChild(Suite.java:24) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: > org.apache.hadoop.yarn.exceptions.YarnException: java.io.IOException: > Delegation Token can be issued only with kerberos authentication > at > org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1092) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientPro
[jira] [Resolved] (YARN-3220) Create a Service in the RM to concatenate aggregated logs
[ https://issues.apache.org/jira/browse/YARN-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-3220. - Resolution: Won't Fix Closing this as "Won't Fix" given we have MAPREDUCE-6415. > Create a Service in the RM to concatenate aggregated logs > - > > Key: YARN-3220 > URL: https://issues.apache.org/jira/browse/YARN-3220 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > > Create an {{RMAggregatedLogsConcatenationService}} in the RM that will > concatenate the aggregated log files written by the NM (which are in the new > {{ConcatableAggregatedLogFormat}} format) when an application finishes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-3728) Add an rmadmin command to compact concatenated aggregated logs
[ https://issues.apache.org/jira/browse/YARN-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-3728. - Resolution: Won't Fix Closing this as "Won't Fix" given we have MAPREDUCE-6415. > Add an rmadmin command to compact concatenated aggregated logs > -- > > Key: YARN-3728 > URL: https://issues.apache.org/jira/browse/YARN-3728 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > > Create an {{rmadmin}} command to compact any concatenated aggregated log > files it finds in the aggregated logs directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-3729) Modify the yarn CLI to be able to read the ConcatenatableAggregatedLogFormat
[ https://issues.apache.org/jira/browse/YARN-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-3729. - Resolution: Won't Fix Closing this as "Won't Fix" given we have MAPREDUCE-6415. > Modify the yarn CLI to be able to read the ConcatenatableAggregatedLogFormat > > > Key: YARN-3729 > URL: https://issues.apache.org/jira/browse/YARN-3729 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > > When serving logs, the {{yarn}} CLI needs to be able to read the > ConcatenatableAggregatedLogFormat or the AggregatedLogFormat transparently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-3219) Modify the NM to write logs using the ConcatenatableAggregatedLogFormat
[ https://issues.apache.org/jira/browse/YARN-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-3219. - Resolution: Won't Fix Closing this as "Won't Fix" given we have MAPREDUCE-6415. > Modify the NM to write logs using the ConcatenatableAggregatedLogFormat > --- > > Key: YARN-3219 > URL: https://issues.apache.org/jira/browse/YARN-3219 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > > The NodeManager should use the {{ConcatenatableAggregatedLogFormat}} from > YARN-3218 instead of the {{AggregatedLogFormat}} for writing aggregated log > files to HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-2942) Aggregated Log Files should be combined
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-2942. - Resolution: Won't Fix Target Version/s: (was: 2.8.0) Closing this as "Won't Fix" given we have MAPREDUCE-6415. > Aggregated Log Files should be combined > --- > > Key: YARN-2942 > URL: https://issues.apache.org/jira/browse/YARN-2942 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.6.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: CombinedAggregatedLogsProposal_v3.pdf, > CombinedAggregatedLogsProposal_v6.pdf, CombinedAggregatedLogsProposal_v7.pdf, > CompactedAggregatedLogsProposal_v1.pdf, > CompactedAggregatedLogsProposal_v2.pdf, > ConcatableAggregatedLogsProposal_v4.pdf, > ConcatableAggregatedLogsProposal_v5.pdf, > ConcatableAggregatedLogsProposal_v8.pdf, YARN-2942-preliminary.001.patch, > YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, > YARN-2942.003.patch > > > Turning on log aggregation allows users to easily store container logs in > HDFS and subsequently view them in the YARN web UIs from a central place. > Currently, there is a separate log file for each Node Manager. This can be a > problem for HDFS if you have a cluster with many nodes as you’ll slowly start > accumulating many (possibly small) files per YARN application. The current > “solution” for this problem is to configure YARN (actually the JHS) to > automatically delete these files after some amount of time. > We should improve this by compacting the per-node aggregated log files into > one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-3218) Implement ConcatenatableAggregatedLogFormat Reader and Writer
[ https://issues.apache.org/jira/browse/YARN-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-3218. - Resolution: Won't Fix Closing this as "Won't Fix" given we have MAPREDUCE-6415. > Implement ConcatenatableAggregatedLogFormat Reader and Writer > - > > Key: YARN-3218 > URL: https://issues.apache.org/jira/browse/YARN-3218 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-3218.001.patch, YARN-3218.002.patch > > > We need to create a Reader and Writer for the > {{ConcatenatableAggregatedLogFormat}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5566) client-side NM graceful decom doesn't trigger when jobs finish
Robert Kanter created YARN-5566: --- Summary: client-side NM graceful decom doesn't trigger when jobs finish Key: YARN-5566 URL: https://issues.apache.org/jira/browse/YARN-5566 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter I was testing the client-side NM graceful decommission and noticed that it was always waiting for the timeout, even if all jobs running on that node (or even the cluster) had already finished. For example: # JobA is running with at least one container on NodeA # User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> NodeA enters DECOMMISSIONING state # JobA finishes at 6:00am and there are no other jobs running on NodeA # User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA NodeA should have decommissioned at 6:00am. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5515) Compatibility Docs should clarify the policy for what takes precedence when a conflict is found
Robert Kanter created YARN-5515: --- Summary: Compatibility Docs should clarify the policy for what takes precedence when a conflict is found Key: YARN-5515 URL: https://issues.apache.org/jira/browse/YARN-5515 Project: Hadoop YARN Issue Type: Task Components: documentation Affects Versions: 2.7.2 Reporter: Robert Kanter The Compatibility Docs (https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/Compatibility.html#Java_API) list the policies for Private, Public, not annotated, etc Classes and members, but it doesn't say what happens when there's a conflict. We should try obviously try to avoid this situation, but it would be good to explicitly state what takes precedence. As an example, until YARN-3225 made it consistent, {{RefreshNodesRequest}} looked like this: {code:java} @Private @Stable public abstract class RefreshNodesRequest { @Public @Stable public static RefreshNodesRequest newInstance() { RefreshNodesRequest request = Records.newRecord(RefreshNodesRequest.class); return request; } } {code} Note that the class is marked {{\@Private}}, but the method is marked {{\@Public}}. In this example, I'd say that the class level should have priority. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5514) Clarify DecommissionType.FORCEFUL comment
Robert Kanter created YARN-5514: --- Summary: Clarify DecommissionType.FORCEFUL comment Key: YARN-5514 URL: https://issues.apache.org/jira/browse/YARN-5514 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.8.0 Reporter: Robert Kanter The comment for {{org.apache.hadoop.yarn.api.records.DecommissionType.FORCEFUL}} is a little unclear. It says: {code} /** Forceful decommissioning of nodes which are already in progress **/ {code} It's not exactly clear of what the nodes are in progress. It should say something like "Forceful decommissioning of nodes which are already in progress of decommissioning". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5465) Server-Side NM Graceful Decommissioning subsequent call behavior
Robert Kanter created YARN-5465: --- Summary: Server-Side NM Graceful Decommissioning subsequent call behavior Key: YARN-5465 URL: https://issues.apache.org/jira/browse/YARN-5465 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Kanter The Server-Side NM Graceful Decommissioning feature added by YARN-4676 has the following behavior when subsequent calls are made: # Start a long-running job that has containers running on nodeA # Add nodeA to the exclude file # Run {{-refreshNodes -g 120 -server}} (2min) to begin gracefully decommissioning nodeA # Wait 30 seconds # Add nodeB to the exclude file # Run {{-refreshNodes -g 30 -server}} (30sec) # After 30 seconds, both nodeA and nodeB shut down In a nutshell, issuing a subsequent call to gracefully decommission nodes updates the timeout for any currently decommissioning nodes. This makes it impossible to gracefully decommission different sets of nodes with different timeouts. Though it does let you easily update the timeout of currently decommissioning nodes. Another behavior we could do is this: # {color:grey}Start a long-running job that has containers running on nodeA # {color:grey}Add nodeA to the exclude file{color} # {color:grey}Run {{-refreshNodes -g 120 -server}} (2min) to begin gracefully decommissioning nodeA{color} # {color:grey}Wait 30 seconds{color} # {color:grey}Add nodeB to the exclude file{color} # {color:grey}Run {{-refreshNodes -g 30 -server}} (30sec){color} # After 30 seconds, nodeB shuts down # After 60 more seconds, nodeA shuts down This keeps the nodes affected by each call to gracefully decommission nodes independent. You can now have different sets of decommissioning nodes with different timeouts. However, to update the timeout of a currently decommissioning node, you'd have to first recommission it, and then decommission it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5464) Server-Side NM Graceful Decommissioning with RM HA
Robert Kanter created YARN-5464: --- Summary: Server-Side NM Graceful Decommissioning with RM HA Key: YARN-5464 URL: https://issues.apache.org/jira/browse/YARN-5464 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Kanter Assignee: Robert Kanter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5434) Add -client|server argument for graceful decom
Robert Kanter created YARN-5434: --- Summary: Add -client|server argument for graceful decom Key: YARN-5434 URL: https://issues.apache.org/jira/browse/YARN-5434 Project: Hadoop YARN Issue Type: Sub-task Components: graceful Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter We should add {{-client|server}} argument to allow the user to specify if they want to use the client-side graceful decom tracking, or the server-side tracking (YARN-4676). Even though the server-side tracking won't go into 2.8, we should add the arguments to 2.8 for compatibility between 2.8 and 2.9, when it's added. In 2.8, using {{-server}} would just throw an Exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-4366) Fix Lint Warnings in YARN Common
[ https://issues.apache.org/jira/browse/YARN-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-4366. - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.9.0 Thanks [~templedf]. Committed to trunk and branch-2! > Fix Lint Warnings in YARN Common > > > Key: YARN-4366 > URL: https://issues.apache.org/jira/browse/YARN-4366 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton > Fix For: 2.9.0 > > Attachments: YARN-4366.001.patch > > > {noformat} > [WARNING] > /Users/daniel/NetBeansProjects/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/Router.java:[100,45] > non-varargs call of varargs method with inexact argument type for last > parameter; > cast to java.lang.Class for a varargs call > cast to java.lang.Class[] for a non-varargs call and to suppress this > warning > [WARNING] > /Users/daniel/NetBeansProjects/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/factory/providers/RpcFactoryProvider.java:[62,46] > non-varargs call of varargs method with inexact argument type for last > parameter; > cast to java.lang.Class for a varargs call > cast to java.lang.Class[] for a non-varargs call and to suppress this > warning > [WARNING] > /Users/daniel/NetBeansProjects/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/factory/providers/RpcFactoryProvider.java:[64,34] > non-varargs call of varargs method with inexact argument type for last > parameter; > cast to java.lang.Object for a varargs call > cast to java.lang.Object[] for a non-varargs call and to suppress this > warning > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-4946) RM should write out Aggregated Log Completion file flag next to logs
Robert Kanter created YARN-4946: --- Summary: RM should write out Aggregated Log Completion file flag next to logs Key: YARN-4946 URL: https://issues.apache.org/jira/browse/YARN-4946 Project: Hadoop YARN Issue Type: Improvement Components: log-aggregation Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Haibo Chen MAPREDUCE-6415 added a tool that combines the aggregated log files for each Yarn App into a HAR file. When run, it seeds the list by looking at the aggregated logs directory, and then filters out ineligible apps. One of the criteria involves checking with the RM that an Application's log aggregation status is not still running and has not failed. When the RM "forgets" about an older completed Application (e.g. RM failover, enough time has passed, etc), the tool won't find the Application in the RM and will just assume that its log aggregation succeeded, even if it actually failed or is still running. We can solve this problem by doing the following: # When the RM sees that an Application has successfully finished aggregation it's logs, it will write a flag file next to that Application's log files # The tool no longer talks to the RM at all. When looking at the FileSystem, it now uses that flag file to determine if it should process those log files. If the file is there, it archives, otherwise it does not. # As part of the archiving process, it will delete the flag file # (If you don't run the tool, the flag file will eventually be cleaned up by the JHS when it cleans up the aggregated logs because it's in the same directory) This improvement has several advantages: # The edge case about "forgotten" Applications is fixed # The tool no longer has to talk to the RM; it only has to consult HDFS. This is simpler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2736) Job.getHistoryUrl returns empty string
[ https://issues.apache.org/jira/browse/YARN-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-2736. - Resolution: Fixed > Job.getHistoryUrl returns empty string > -- > > Key: YARN-2736 > URL: https://issues.apache.org/jira/browse/YARN-2736 > Project: Hadoop YARN > Issue Type: Bug > Components: api >Affects Versions: 2.5.1 >Reporter: Kannan Rajah >Priority: Critical > > getHistoryUrl() method in Job class is returning empty string. Example code: > job = Job.getInstance(conf); > job.setJobName("MapReduceApp"); > job.setJarByClass(MapReduceApp.class); > job.setMapperClass(Mapper1.class); > job.setCombinerClass(Reducer1.class); > job.setReducerClass(Reducer1.class); > job.setMapOutputKeyClass(Text.class); > job.setMapOutputValueClass(IntWritable.class); > job.setOutputKeyClass(Text.class); > job.setOutputValueClass(IntWritable.class); > job.setNumReduceTasks(1); > job.setOutputFormatClass(TextOutputFormat.class); > job.setInputFormatClass(TextInputFormat.class); > FileInputFormat.addInputPath(job, inputPath); > FileOutputFormat.setOutputPath(job, outputPath); > job.waitForCompletion(true); > job.getHistoryUrl(); > It is always returning empty string. Looks like getHistoryUrl() support was > removed in YARN-321. > getTrackingURL() returns correct url though. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4408) NodeManager still reports negative running containers
Robert Kanter created YARN-4408: --- Summary: NodeManager still reports negative running containers Key: YARN-4408 URL: https://issues.apache.org/jira/browse/YARN-4408 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Robert Kanter Assignee: Robert Kanter YARN-1697 fixed a problem where the NodeManager metrics could report a negative number of running containers. However, it missed a rare case where this can still happen. YARN-1697 added a flag to indicate if the container was actually launched ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge if we actually ran the container and incremented the gauge . However, this flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to {{DONE}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4086) Allow Aggregated Log readers to handle HAR files
Robert Kanter created YARN-4086: --- Summary: Allow Aggregated Log readers to handle HAR files Key: YARN-4086 URL: https://issues.apache.org/jira/browse/YARN-4086 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter This is for the YARN changes for MAPREDUCE-6415. It allows the yarn CLI and web UIs to read aggregated logs from HAR files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4019) Add JvmPauseMonitor to ResourceManager and NodeManager
Robert Kanter created YARN-4019: --- Summary: Add JvmPauseMonitor to ResourceManager and NodeManager Key: YARN-4019 URL: https://issues.apache.org/jira/browse/YARN-4019 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the ResourceManager and NodeManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3950) Add unique SHELL_ID environment variable to DistributedShell
Robert Kanter created YARN-3950: --- Summary: Add unique SHELL_ID environment variable to DistributedShell Key: YARN-3950 URL: https://issues.apache.org/jira/browse/YARN-3950 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter As discussed in [this comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027], it would be useful to have a monotonically increasing and independent ID of some kind that is unique per shell in the distributed shell program. We can do that by adding a SHELL_ID env var. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3812) TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347
Robert Kanter created YARN-3812: --- Summary: TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347 Key: YARN-3812 URL: https://issues.apache.org/jira/browse/YARN-3812 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 3.0.0 Reporter: Robert Kanter {{TestRollingLevelDBTimelineStore}} is failing with the below errors in trunk. I did a git bisect and found that it was due to HADOOP-11347, which changed something with umasks in {{FsPermission}}. {noformat} Running org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 2.65 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore testGetDomains(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore) Time elapsed: 1.533 sec <<< ERROR! java.lang.UnsupportedOperationException: null at org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200) at org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65) testRelatingToNonExistingEntity(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore) Time elapsed: 0.085 sec <<< ERROR! java.lang.UnsupportedOperationException: null at org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200) at org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65) testValidateConfig(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore) Time elapsed: 0.07 sec <<< ERROR! java.lang.UnsupportedOperationException: null at org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207) at org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200) at org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65) testGetEntitiesWithPrimaryFilters(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore) Time elapsed: 0.061 sec <<< ERROR! java.lang.UnsupportedOperationException: null at org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWith
[jira] [Created] (YARN-3729) Modify the yarn CLI to be able to read the ConcatenatableAggregatedLogFormat
Robert Kanter created YARN-3729: --- Summary: Modify the yarn CLI to be able to read the ConcatenatableAggregatedLogFormat Key: YARN-3729 URL: https://issues.apache.org/jira/browse/YARN-3729 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter When serving logs, the {{yarn}} CLI needs to be able to read the ConcatenatableAggregatedLogFormat or the AggregatedLogFormat transparently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3728) Add an rmadmin command to compact concatenated aggregated logs
Robert Kanter created YARN-3728: --- Summary: Add an rmadmin command to compact concatenated aggregated logs Key: YARN-3728 URL: https://issues.apache.org/jira/browse/YARN-3728 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Create an {{rmadmin}} command to compact any concatenated aggregated log files it finds in the aggregated logs directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3580) [JDK 8] TestClientRMService.testGetLabelsToNodes fails
Robert Kanter created YARN-3580: --- Summary: [JDK 8] TestClientRMService.testGetLabelsToNodes fails Key: YARN-3580 URL: https://issues.apache.org/jira/browse/YARN-3580 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.8.0 Environment: JDK 8 Reporter: Robert Kanter Assignee: Robert Kanter When using JDK 8, {{TestClientRMService.testGetLabelsToNodes}} fails: {noformat} java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService.testGetLabelsToNodes(TestClientRMService.java:1499) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3400) [JDK 8] Build Failure due to unreported exceptions in RPCUtil
Robert Kanter created YARN-3400: --- Summary: [JDK 8] Build Failure due to unreported exceptions in RPCUtil Key: YARN-3400 URL: https://issues.apache.org/jira/browse/YARN-3400 Project: Hadoop YARN Issue Type: Bug Reporter: Robert Kanter When I try compiling Hadoop with JDK 8 like this {noformat} mvn clean package -Pdist -Dtar -DskipTests -Djavac.version=1.8 {noformat} I get this error: {noformat} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project hadoop-yarn-common: Compilation failure: Compilation failure: [ERROR] /Users/rkanter/dev/hadoop-common2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/ipc/RPCUtil.java:[101,11] unreported exception java.lang.Throwable; must be caught or declared to be thrown [ERROR] /Users/rkanter/dev/hadoop-common2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/ipc/RPCUtil.java:[104,11] unreported exception java.lang.Throwable; must be caught or declared to be thrown [ERROR] /Users/rkanter/dev/hadoop-common2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/ipc/RPCUtil.java:[107,11] unreported exception java.lang.Throwable; must be caught or declared to be thrown {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3220) JHS should display Combined Aggregated Logs when available
Robert Kanter created YARN-3220: --- Summary: JHS should display Combined Aggregated Logs when available Key: YARN-3220 URL: https://issues.apache.org/jira/browse/YARN-3220 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter The JHS should read the Combined Aggregated Log files created by YARN-3219 when the user asks it for logs. When unavailable, it should fallback to the regular Aggregated Log files (the current behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3219) Use CombinedAggregatedLogFormat Writer to combine aggregated log files
Robert Kanter created YARN-3219: --- Summary: Use CombinedAggregatedLogFormat Writer to combine aggregated log files Key: YARN-3219 URL: https://issues.apache.org/jira/browse/YARN-3219 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter The NodeManager should use the {{CombinedAggregatedLogFormat}} from YARN-3218 to append its aggregated log to the per-app log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3218) Implement CombinedAggregatedLogFormat Reader and Writer
Robert Kanter created YARN-3218: --- Summary: Implement CombinedAggregatedLogFormat Reader and Writer Key: YARN-3218 URL: https://issues.apache.org/jira/browse/YARN-3218 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter We need to create a Reader and Writer for the CombinedAggregatedLogFormat -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3183) Some classes define hashcode() but not equals()
Robert Kanter created YARN-3183: --- Summary: Some classes define hashcode() but not equals() Key: YARN-3183 URL: https://issues.apache.org/jira/browse/YARN-3183 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Minor These files all define {{hashCode}}, but don't define {{equals}}: {noformat} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingApplicationAttemptFinishEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingApplicationAttemptStartEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingApplicationFinishEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingApplicationStartEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingContainerFinishEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingContainerStartEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/AppAttemptFinishedEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/AppAttemptRegisteredEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/ApplicationCreatedEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/ApplicationFinishedEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/ContainerCreatedEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/ContainerFinishedEvent.java {noformat} This one unnecessarily defines {{equals}}: {noformat} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceRetentionSet.java {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2942) Aggregated Log Files should be compacted
Robert Kanter created YARN-2942: --- Summary: Aggregated Log Files should be compacted Key: YARN-2942 URL: https://issues.apache.org/jira/browse/YARN-2942 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Turning on log aggregation allows users to easily store container logs in HDFS and subsequently view them in the YARN web UIs from a central place. Currently, there is a separate log file for each Node Manager. This can be a problem for HDFS if you have a cluster with many nodes as you’ll slowly start accumulating many (possibly small) files per YARN application. The current “solution” for this problem is to configure YARN (actually the JHS) to automatically delete these files after some amount of time. We should improve this by compacting the per-node aggregated log files into one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2766) [JDK 8] TestApplicationHistoryClientService fails
Robert Kanter created YARN-2766: --- Summary: [JDK 8] TestApplicationHistoryClientService fails Key: YARN-2766 URL: https://issues.apache.org/jira/browse/YARN-2766 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter {{TestApplicationHistoryClientService.testContainers}} and {{TestApplicationHistoryClientService.testApplicationAttempts}} both fail because the test assertions are assuming a returned Collection is in a certain order. The collection comes from a HashMap, so the order is not guaranteed, plus, according to [this page|http://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html], there are situations where the iteration order of a HashMap will be different between Java 7 and 8. We should fix the test code to not assume a specific ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup
Robert Kanter created YARN-2241: --- Summary: Show nicer messages when ZNodes already exist in ZKRMStateStore on startup Key: YARN-2241 URL: https://issues.apache.org/jira/browse/YARN-2241 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Minor When using the RMZKStateStore, if you restart the RM, you get a bunch of stack traces with messages like {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /rmstore}}. This is expected as these nodes already exist from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2204) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
Robert Kanter created YARN-2204: --- Summary: TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler Key: YARN-2204 URL: https://issues.apache.org/jira/browse/YARN-2204 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2199) FairScheduler: Allow max-AM-share to be specified in the root queue
Robert Kanter created YARN-2199: --- Summary: FairScheduler: Allow max-AM-share to be specified in the root queue Key: YARN-2199 URL: https://issues.apache.org/jira/browse/YARN-2199 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter If users want to specify the max-AM-share, they have to do it for each leaf queue individually. It would be convenient if they could also specify it in the root queue so they'd only have to specify it once to apply to all queues. It could still be overridden in a specific leaf queue though. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2187) FairScheduler should have a way of disabling the max AM share check for launching new AMs
Robert Kanter created YARN-2187: --- Summary: FairScheduler should have a way of disabling the max AM share check for launching new AMs Key: YARN-2187 URL: https://issues.apache.org/jira/browse/YARN-2187 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter Say you have a small cluster with 8gb memory and 5 queues. This means that equal queue can have 8gb / 5 = 1.6gb but an AM requires 2gb to start so no AMs can be started. We should have a way of disabling this check to prevent this problem. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2015) HTTPS doesn't work properly for daemons (RM, JHS, NM)
[ https://issues.apache.org/jira/browse/YARN-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-2015. - Resolution: Invalid Nevermind, this appears to be fixed by YARN-1553 > HTTPS doesn't work properly for daemons (RM, JHS, NM) > - > > Key: YARN-2015 > URL: https://issues.apache.org/jira/browse/YARN-2015 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.3.0, 2.4.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Blocker > > Enabling SSL in the site files and setting up a certificate, keystore, etc > doesn't actually enable HTTPS. The RM, NMs, and JHS will use their https > port, but use only HTTP on them instead of only HTTPS. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2015) HTTPS doesn't work properly for daemons (RM, JHS, NM)
Robert Kanter created YARN-2015: --- Summary: HTTPS doesn't work properly for daemons (RM, JHS, NM) Key: YARN-2015 URL: https://issues.apache.org/jira/browse/YARN-2015 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Blocker Enabling SSL in the site files and setting up a certificate, keystore, etc doesn't actually enable HTTPS. The RM, NMs, and JHS will use their https port, but use only HTTP on them instead of only HTTPS. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1846) TestRM.testNMTokenSentForNormalContainer assumes CapacityScheduler
Robert Kanter created YARN-1846: --- Summary: TestRM.testNMTokenSentForNormalContainer assumes CapacityScheduler Key: YARN-1846 URL: https://issues.apache.org/jira/browse/YARN-1846 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Assignee: Robert Kanter TestRM.testNMTokenSentForNormalContainer assumes the CapacityScheduler is being used and tries to do: {code:java} CapacityScheduler cs = (CapacityScheduler) rm.getResourceScheduler(); {code} This throws a {{ClassCastException}} if you're not using the CapacityScheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1795) After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens
[ https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-1795. - Resolution: Duplicate Assignee: Robert Kanter (was: Karthik Kambatla) I tried the patch posted at YARN-1839 and it fixes the problem. Marking this as a duplicate of that. > After YARN-713, using FairScheduler can cause an InvalidToken Exception for > NMTokens > > > Key: YARN-1795 > URL: https://issues.apache.org/jira/browse/YARN-1795 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Blocker > Attachments: > org.apache.oozie.action.hadoop.TestMapReduceActionExecutor-output.txt, syslog > > > Running the Oozie unit tests against a Hadoop build with YARN-713 causes many > of the tests to be flakey. Doing some digging, I found that they were > failing because some of the MR jobs were failing; I found this in the syslog > of the failed jobs: > {noformat} > 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics > report from attempt_1394064846476_0013_m_00_0: Container launch failed > for container_1394064846476_0013_01_03 : > org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent > for 192.168.1.77:50759 >at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) >at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:196) >at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) >at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) >at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) >at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >at java.lang.Thread.run(Thread.java:744) > {noformat} > I did some debugging and found that the NMTokenCache has a different port > number than what's being looked up. For example, the NMTokenCache had one > token with address 192.168.1.77:58217 but > ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. > The 58213 address comes from ContainerLauncherImpl's constructor. So when the > Container is being launched it somehow has a different port than when the > token was created. > Any ideas why the port numbers wouldn't match? > Update: This also happens in an actual cluster, not just Oozie's unit tests -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1822) Revisit AM link being broken for work preserving restart
[ https://issues.apache.org/jira/browse/YARN-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter resolved YARN-1822. - Resolution: Invalid YARN-1811 is being done differently, and this is no longer needed > Revisit AM link being broken for work preserving restart > > > Key: YARN-1822 > URL: https://issues.apache.org/jira/browse/YARN-1822 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Robert Kanter > > We should revisit the issue in YARN-1811 as it may require changes once we > have work-preserving restarts. > Currently, the AmIpFilter is given the active RM at AM > initialization/startup, so when the RM fails over and the AM is restarted, > this gets recalculated properly. However, with work-preserving restart, this > will now point to the inactive RM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1822) Revisit AM link being broken for RM restart
Robert Kanter created YARN-1822: --- Summary: Revisit AM link being broken for RM restart Key: YARN-1822 URL: https://issues.apache.org/jira/browse/YARN-1822 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Robert Kanter We should revisit the issue in YARN-1811 as it may require changes once we have work-preserving restarts. Currently, the AmIpFilter is given the active RM at AM initialization/startup, so when the RM fails over and the AM is restarted, this gets recalculated properly. However, with work-preserving restart, this will now point to the inactive RM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1811) Error 500 when clicking the "Application Master" link in the RM UI while a job is running with RM HA
Robert Kanter created YARN-1811: --- Summary: Error 500 when clicking the "Application Master" link in the RM UI while a job is running with RM HA Key: YARN-1811 URL: https://issues.apache.org/jira/browse/YARN-1811 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Robert Kanter Assignee: Robert Kanter When using RM HA, if you click on the "Application Master" link in the RM web UI while the job is running, you get an Error 500: {noformat} HTTP ERROR 500 Problem accessing /proxy/application_1381788742937_0003/. Reason: Connection refused Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.(Socket.java:425) at java.net.Socket.(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:185) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:334) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1077) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.j
[jira] [Created] (YARN-1795) Oozie tests are flakey after YARN-713
Robert Kanter created YARN-1795: --- Summary: Oozie tests are flakey after YARN-713 Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1731) ResourceManager should record killed ApplicationMasters for History
Robert Kanter created YARN-1731: --- Summary: ResourceManager should record killed ApplicationMasters for History Key: YARN-1731 URL: https://issues.apache.org/jira/browse/YARN-1731 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: YARN-1731.patch Yarn changes required for MAPREDUCE-5641 to make the RM record when an AM is killed so the JHS (or something else) can know about it). See MAPREDUCE-5641 for the design I'm trying to follow. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1245) org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestart times out
Robert Kanter created YARN-1245: --- Summary: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestart times out Key: YARN-1245 URL: https://issues.apache.org/jira/browse/YARN-1245 Project: Hadoop YARN Issue Type: Bug Reporter: Robert Kanter Priority: Trivial Attachments: YARN-1245.patch We've been seeing org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestart time out. We should increase the timeout. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira