[jira] [Commented] (YARN-3768) Index out of range exception with environment variables without values
[ https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583064#comment-14583064 ] Hadoop QA commented on YARN-3768: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 57s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 55s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 35s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 56s | Tests passed in hadoop-yarn-common. | | | | 40m 2s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12739173/YARN-3768.000.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 83e8110 | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8241/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8241/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8241/console | This message was automatically generated. > Index out of range exception with environment variables without values > -- > > Key: YARN-3768 > URL: https://issues.apache.org/jira/browse/YARN-3768 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.5.0 >Reporter: Joe Ferner >Assignee: zhihai xu > Attachments: YARN-3768.000.patch > > > Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range > exception occurs if an environment variable is encountered without a value. > I believe this occurs because java will not return empty strings from the > split method. Similar to this > http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3750) yarn.log.server.url is not documented in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt reassigned YARN-3750: -- Assignee: Bibin A Chundatt > yarn.log.server.url is not documented in yarn-default.xml > - > > Key: YARN-3750 > URL: https://issues.apache.org/jira/browse/YARN-3750 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 2.7.0 >Reporter: Dmitry Sivachenko >Assignee: Bibin A Chundatt >Priority: Minor > > From > http://mail-archives.apache.org/mod_mbox/hadoop-user/201505.mbox/%3cd18c9931.52700%25xg...@hortonworks.com%3e > I learned about yarn.log.server.url setting. > But it is not mentioned in yarn-default.xml file. > I propose to add this variable there with some short description. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583093#comment-14583093 ] Naganarasimha G R commented on YARN-3644: - Hi [~raju.bairishetti], IIUC intention of this jira is to only make NM wait for RM infinitely and hence we don't want to set {{yarn.resourcemanager.connect.max-wait.ms}} to FOREVER retry policy which might affect other clients connecting to RM right ? If so i feel overall approach is fine except for the cosmetic comments below # {{NM_SHUTSDWON_ON_RM_CONNECTION_FAILURES}} typo, SHUTSDWON => SHUTDOWN # if agree on the earlier then {{DEFAULT_NM_SHUTSDOWN_ON_RM_CONNECTION_FAILURES}} => {{DEFAULT_NM_SHUTDOWN_ON_RM_CONNECTION_FAILURES}} # configuration could be {{yarn.nodemanager.shutdown.on.connection.failures}} => {{yarn.nodemanager.shutdown.on.RM.connection.failures}}. correct the same in yarn-default.xml's description and name also # Testcase introduces new {{MyNodeStatusUpdater6}} whose only change is to get the new Resource tracker for the test case, its becoming more and more duplicate code for NodeStatusUpdater as most of the other overloaded NodeStatusUpdater is also doing the same, so can we bring in a common NodeStatusUpdater class which accepts ResourceTracker as parameter to constructor ? (may be refactoring other classes can be taken up in other jira if req) # {{MyResourceTracker8}} could extend {{MyResourceTracker5}} and just override the required methods. Would also appreciate if some documentation is added above these classes so that in future it will be helpfull to reuse if req. > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.001.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3743) Allow admin specify labels from RM with node labels provider
[ https://issues.apache.org/jira/browse/YARN-3743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu reassigned YARN-3743: - Assignee: Dian Fu > Allow admin specify labels from RM with node labels provider > > > Key: YARN-3743 > URL: https://issues.apache.org/jira/browse/YARN-3743 > Project: Hadoop YARN > Issue Type: Task >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3743.1.patch > > > As discussed in YARN-3557, providing a node label configuration mechanism > similar to YARN-2495 at RM side would ease the use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583116#comment-14583116 ] Sidharta Seethana commented on YARN-2194: - [~ywskycn] , you'll need to change {{PrivilegedOperationExecutor}} as well {code} if (noneArgsOnly == false) { //We have already appended at least one tasks file. finalOpArg.append(","); finalOpArg.append(tasksFile); } else { finalOpArg.append(tasksFile); noneArgsOnly = false; } {code} The tests appear to pass in TestLinuxContainerExecutorWithMocks, but it not clear why. One example in {{TestLinuxContainerExecutorWithMocks}} that should have caused a test failure : {code} StringUtils.join(",", dirsHandler.getLocalDirs()), StringUtils.join(",", dirsHandler.getLogDirs()), "cgroups=none"), {code} It appears to me that this construction is done in enough places that it would make sense to create a static constant for use as a separator when constructing an argument for the container-executor binary. A good candidate location to add such a constant would be the {{PrivilegedOperation}} class. You could, in addition, also ‘hide’ the join functionality by adding a static function in the {{PrivilegedOperation}} class. > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch, > YARN-2194-4.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference
[ https://issues.apache.org/jira/browse/YARN-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3794: Hadoop Flags: Reviewed +1, Looks good to me, will commit it shortly. YARN-3790 exists to track the TestWorkPreservingRMRestart failure. > TestRMEmbeddedElector fails because of ambiguous LOG reference > -- > > Key: YARN-3794 > URL: https://issues.apache.org/jira/browse/YARN-3794 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: YARN-3794.01.patch > > > After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in > the following code snippet is ambiguous. > {code} > protected AdminService createAdminService() { > return new AdminService(MockRMWithElector.this, getRMContext()) { > @Override > protected EmbeddedElectorService createEmbeddedElectorService() { > return new EmbeddedElectorService(getRMContext()) { > @Override > public void becomeActive() throws > ServiceFailedException { > try { > callbackCalled.set(true); > LOG.info("Callback called. Sleeping now"); > Thread.sleep(delayMs); > LOG.info("Sleep done"); > } catch (InterruptedException e) { > e.printStackTrace(); > } > super.becomeActive(); > } > }; > } > }; > } > {code} > Eclipse gives the following error: > {quote} > The field LOG is defined in an inherited type and an enclosing scope > {quote} > IMO, we should fix this as {{TestRMEmbeddedElector.LOG}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference
[ https://issues.apache.org/jira/browse/YARN-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583140#comment-14583140 ] Hudson commented on YARN-3794: -- FAILURE: Integrated in Hadoop-trunk-Commit #8009 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8009/]) YARN-3794. TestRMEmbeddedElector fails because of ambiguous LOG reference. (devaraj: rev d8dcfa98e3ca6a6fea414fd503589bb83b7a9c51) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMEmbeddedElector.java > TestRMEmbeddedElector fails because of ambiguous LOG reference > -- > > Key: YARN-3794 > URL: https://issues.apache.org/jira/browse/YARN-3794 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Fix For: 2.8.0 > > Attachments: YARN-3794.01.patch > > > After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in > the following code snippet is ambiguous. > {code} > protected AdminService createAdminService() { > return new AdminService(MockRMWithElector.this, getRMContext()) { > @Override > protected EmbeddedElectorService createEmbeddedElectorService() { > return new EmbeddedElectorService(getRMContext()) { > @Override > public void becomeActive() throws > ServiceFailedException { > try { > callbackCalled.set(true); > LOG.info("Callback called. Sleeping now"); > Thread.sleep(delayMs); > LOG.info("Sleep done"); > } catch (InterruptedException e) { > e.printStackTrace(); > } > super.becomeActive(); > } > }; > } > }; > } > {code} > Eclipse gives the following error: > {quote} > The field LOG is defined in an inherited type and an enclosing scope > {quote} > IMO, we should fix this as {{TestRMEmbeddedElector.LOG}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3792) Test case failures in TestDistributedShell after changes for subjira's of YARN-2928
[ https://issues.apache.org/jira/browse/YARN-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3792: Description: # encountered [testcase failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] which was happening even without the patch modifications in YARN-3044 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression # while testing locally TestDistributedShell intermittently fails for the vmem-Pmem ratio, hence we need to increase it was: encountered [testcase failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] which was happening even without the patch modifications in YARN-3044 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression > Test case failures in TestDistributedShell after changes for subjira's of > YARN-2928 > --- > > Key: YARN-3792 > URL: https://issues.apache.org/jira/browse/YARN-3792 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > # encountered [testcase > failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] > which was happening even without the patch modifications in YARN-3044 > TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow > TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow > TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression > # while testing locally TestDistributedShell intermittently fails for the > vmem-Pmem ratio, hence we need to increase it -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference
[ https://issues.apache.org/jira/browse/YARN-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583188#comment-14583188 ] Chengbing Liu commented on YARN-3794: - Thanks [~devaraj.k] for committing! > TestRMEmbeddedElector fails because of ambiguous LOG reference > -- > > Key: YARN-3794 > URL: https://issues.apache.org/jira/browse/YARN-3794 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Fix For: 2.8.0 > > Attachments: YARN-3794.01.patch > > > After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in > the following code snippet is ambiguous. > {code} > protected AdminService createAdminService() { > return new AdminService(MockRMWithElector.this, getRMContext()) { > @Override > protected EmbeddedElectorService createEmbeddedElectorService() { > return new EmbeddedElectorService(getRMContext()) { > @Override > public void becomeActive() throws > ServiceFailedException { > try { > callbackCalled.set(true); > LOG.info("Callback called. Sleeping now"); > Thread.sleep(delayMs); > LOG.info("Sleep done"); > } catch (InterruptedException e) { > e.printStackTrace(); > } > super.becomeActive(); > } > }; > } > }; > } > {code} > Eclipse gives the following error: > {quote} > The field LOG is defined in an inherited type and an enclosing scope > {quote} > IMO, we should fix this as {{TestRMEmbeddedElector.LOG}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583189#comment-14583189 ] Raju Bairishetti commented on YARN-3644: [~amareshwari] [~Naganarasimha] Thanks for the review and comments. [~Naganarasimha] Yes, this jira is only to make NM wait for RM. > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.001.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
Bibin A Chundatt created YARN-3798: -- Summary: RM shutdown with NoNode exception while updating appAttempt on zk Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937
[jira] [Assigned] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-3798: -- Assignee: Varun Saxena > RM shutdown with NoNode exception while updating appAttempt on zk > - > > Key: YARN-3798 > URL: https://issues.apache.org/jira/browse/YARN-3798 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Varun Saxena > > RM going down with NoNode exception during create of znode for appattempt > *Please find the exception logs* > {code} > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-09 10:09:44,886 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 10:09:44,887 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! > 2015-06-09 10:09:44,887 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > updating appAttempt: appattempt_1433764310492_7152_01 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery
[jira] [Commented] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference
[ https://issues.apache.org/jira/browse/YARN-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583237#comment-14583237 ] Hudson commented on YARN-3794: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #226 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/226/]) YARN-3794. TestRMEmbeddedElector fails because of ambiguous LOG reference. (devaraj: rev d8dcfa98e3ca6a6fea414fd503589bb83b7a9c51) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMEmbeddedElector.java > TestRMEmbeddedElector fails because of ambiguous LOG reference > -- > > Key: YARN-3794 > URL: https://issues.apache.org/jira/browse/YARN-3794 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Fix For: 2.8.0 > > Attachments: YARN-3794.01.patch > > > After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in > the following code snippet is ambiguous. > {code} > protected AdminService createAdminService() { > return new AdminService(MockRMWithElector.this, getRMContext()) { > @Override > protected EmbeddedElectorService createEmbeddedElectorService() { > return new EmbeddedElectorService(getRMContext()) { > @Override > public void becomeActive() throws > ServiceFailedException { > try { > callbackCalled.set(true); > LOG.info("Callback called. Sleeping now"); > Thread.sleep(delayMs); > LOG.info("Sleep done"); > } catch (InterruptedException e) { > e.printStackTrace(); > } > super.becomeActive(); > } > }; > } > }; > } > {code} > Eclipse gives the following error: > {quote} > The field LOG is defined in an inherited type and an enclosing scope > {quote} > IMO, we should fix this as {{TestRMEmbeddedElector.LOG}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference
[ https://issues.apache.org/jira/browse/YARN-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583246#comment-14583246 ] Hudson commented on YARN-3794: -- FAILURE: Integrated in Hadoop-Yarn-trunk #956 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/956/]) YARN-3794. TestRMEmbeddedElector fails because of ambiguous LOG reference. (devaraj: rev d8dcfa98e3ca6a6fea414fd503589bb83b7a9c51) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMEmbeddedElector.java > TestRMEmbeddedElector fails because of ambiguous LOG reference > -- > > Key: YARN-3794 > URL: https://issues.apache.org/jira/browse/YARN-3794 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Fix For: 2.8.0 > > Attachments: YARN-3794.01.patch > > > After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in > the following code snippet is ambiguous. > {code} > protected AdminService createAdminService() { > return new AdminService(MockRMWithElector.this, getRMContext()) { > @Override > protected EmbeddedElectorService createEmbeddedElectorService() { > return new EmbeddedElectorService(getRMContext()) { > @Override > public void becomeActive() throws > ServiceFailedException { > try { > callbackCalled.set(true); > LOG.info("Callback called. Sleeping now"); > Thread.sleep(delayMs); > LOG.info("Sleep done"); > } catch (InterruptedException e) { > e.printStackTrace(); > } > super.becomeActive(); > } > }; > } > }; > } > {code} > Eclipse gives the following error: > {quote} > The field LOG is defined in an inherited type and an enclosing scope > {quote} > IMO, we should fix this as {{TestRMEmbeddedElector.LOG}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3705) forcemanual transitionToStandby in RM-HA automatic-failover mode should change elector state
[ https://issues.apache.org/jira/browse/YARN-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583378#comment-14583378 ] Masatake Iwasaki commented on YARN-3705: YARN-3790 is addressing the failure of TestWorkPreservingRMRestart. > forcemanual transitionToStandby in RM-HA automatic-failover mode should > change elector state > > > Key: YARN-3705 > URL: https://issues.apache.org/jira/browse/YARN-3705 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki > Attachments: YARN-3705.001.patch > > > Executing {{rmadmin -transitionToStandby --forcemanual}} in > automatic-failover.enabled mode makes ResouceManager standby while keeping > the state of ActiveStandbyElector. It should make elector to quit and rejoin > in order to enable other candidates to promote, otherwise forcemanual > transition should not be allowed in automatic-failover mode in order to avoid > confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference
[ https://issues.apache.org/jira/browse/YARN-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583389#comment-14583389 ] Hudson commented on YARN-3794: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2172 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2172/]) YARN-3794. TestRMEmbeddedElector fails because of ambiguous LOG reference. (devaraj: rev d8dcfa98e3ca6a6fea414fd503589bb83b7a9c51) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMEmbeddedElector.java * hadoop-yarn-project/CHANGES.txt > TestRMEmbeddedElector fails because of ambiguous LOG reference > -- > > Key: YARN-3794 > URL: https://issues.apache.org/jira/browse/YARN-3794 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Fix For: 2.8.0 > > Attachments: YARN-3794.01.patch > > > After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in > the following code snippet is ambiguous. > {code} > protected AdminService createAdminService() { > return new AdminService(MockRMWithElector.this, getRMContext()) { > @Override > protected EmbeddedElectorService createEmbeddedElectorService() { > return new EmbeddedElectorService(getRMContext()) { > @Override > public void becomeActive() throws > ServiceFailedException { > try { > callbackCalled.set(true); > LOG.info("Callback called. Sleeping now"); > Thread.sleep(delayMs); > LOG.info("Sleep done"); > } catch (InterruptedException e) { > e.printStackTrace(); > } > super.becomeActive(); > } > }; > } > }; > } > {code} > Eclipse gives the following error: > {quote} > The field LOG is defined in an inherited type and an enclosing scope > {quote} > IMO, we should fix this as {{TestRMEmbeddedElector.LOG}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583404#comment-14583404 ] Wei Yan commented on YARN-2194: --- [~kasha], [~sidharta-s], thanks for the comments. Looking into it. > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch, > YARN-2194-4.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583413#comment-14583413 ] Naganarasimha G R commented on YARN-3798: - hi [~bibinchundatt] & [~varun_saxena], i think we should retry again before making the job fail, thoughts ? > RM shutdown with NoNode exception while updating appAttempt on zk > - > > Key: YARN-3798 > URL: https://issues.apache.org/jira/browse/YARN-3798 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Varun Saxena > > RM going down with NoNode exception during create of znode for appattempt > *Please find the exception logs* > {code} > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-09 10:09:44,886 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 10:09:44,887 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! > 2015-06-09 10:09:44,887 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > updating appAttempt: appattempt_1433764310492_7152_01 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanag
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583433#comment-14583433 ] Varun Saxena commented on YARN-3798: We do retry a configurable number of times. > RM shutdown with NoNode exception while updating appAttempt on zk > - > > Key: YARN-3798 > URL: https://issues.apache.org/jira/browse/YARN-3798 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Varun Saxena > > RM going down with NoNode exception during create of znode for appattempt > *Please find the exception logs* > {code} > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-09 10:09:44,886 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 10:09:44,887 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! > 2015-06-09 10:09:44,887 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > updating appAttempt: appattempt_1433764310492_7152_01 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583438#comment-14583438 ] Varun Saxena commented on YARN-3798: Just to elaborate further, this issue comes because of Zookeeper being in an inconsistent state. This is because one of the zookeeper instances goes down. The application node doesnt exist because Zookeeper instance hasn't yet synced the application node. Probably on first failure, we can try and make a call to {{sync()}} to get consistent data from zookeeper. Or we can catch the exception and fail job(After retries). Because IMHO RM should not go down. Thoughts ? > RM shutdown with NoNode exception while updating appAttempt on zk > - > > Key: YARN-3798 > URL: https://issues.apache.org/jira/browse/YARN-3798 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Varun Saxena > > RM going down with NoNode exception during create of znode for appattempt > *Please find the exception logs* > {code} > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-09 10:09:44,886 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 10:09:44,887 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! > 2015-06-09 10:09:44,887 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > updating appAttempt: appattempt_1433764310492_7152_01 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583440#comment-14583440 ] Varun Saxena commented on YARN-3798: I meant "The application node doesnt exist because the new Zookeeper instance client connects to hasn't yet synced the application node." > RM shutdown with NoNode exception while updating appAttempt on zk > - > > Key: YARN-3798 > URL: https://issues.apache.org/jira/browse/YARN-3798 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Varun Saxena > > RM going down with NoNode exception during create of znode for appattempt > *Please find the exception logs* > {code} > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-09 10:09:44,886 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 10:09:44,887 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! > 2015-06-09 10:09:44,887 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > updating appAttempt: appattempt_1433764310492_7152_01 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.
[jira] [Commented] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference
[ https://issues.apache.org/jira/browse/YARN-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583442#comment-14583442 ] Hudson commented on YARN-3794: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2154 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2154/]) YARN-3794. TestRMEmbeddedElector fails because of ambiguous LOG reference. (devaraj: rev d8dcfa98e3ca6a6fea414fd503589bb83b7a9c51) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMEmbeddedElector.java * hadoop-yarn-project/CHANGES.txt > TestRMEmbeddedElector fails because of ambiguous LOG reference > -- > > Key: YARN-3794 > URL: https://issues.apache.org/jira/browse/YARN-3794 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Fix For: 2.8.0 > > Attachments: YARN-3794.01.patch > > > After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in > the following code snippet is ambiguous. > {code} > protected AdminService createAdminService() { > return new AdminService(MockRMWithElector.this, getRMContext()) { > @Override > protected EmbeddedElectorService createEmbeddedElectorService() { > return new EmbeddedElectorService(getRMContext()) { > @Override > public void becomeActive() throws > ServiceFailedException { > try { > callbackCalled.set(true); > LOG.info("Callback called. Sleeping now"); > Thread.sleep(delayMs); > LOG.info("Sleep done"); > } catch (InterruptedException e) { > e.printStackTrace(); > } > super.becomeActive(); > } > }; > } > }; > } > {code} > Eclipse gives the following error: > {quote} > The field LOG is defined in an inherited type and an enclosing scope > {quote} > IMO, we should fix this as {{TestRMEmbeddedElector.LOG}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3799) [JDK8] Fix javadoc errors caused by incorrect or illegal tags in hadoop-yarn-common
Akira AJISAKA created YARN-3799: --- Summary: [JDK8] Fix javadoc errors caused by incorrect or illegal tags in hadoop-yarn-common Key: YARN-3799 URL: https://issues.apache.org/jira/browse/YARN-3799 Project: Hadoop YARN Issue Type: Bug Reporter: Akira AJISAKA {{mvn package -Pdist -DskipTests}} fails with JDK8 by illegal tag. {code} [ERROR] /home/centos/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java:829: error: @param name not found [ERROR] * @param nodelabels [ERROR] ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3799) [JDK8] Fix javadoc errors caused by incorrect or illegal tags in hadoop-yarn-common
[ https://issues.apache.org/jira/browse/YARN-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA reassigned YARN-3799: --- Assignee: Akira AJISAKA > [JDK8] Fix javadoc errors caused by incorrect or illegal tags in > hadoop-yarn-common > --- > > Key: YARN-3799 > URL: https://issues.apache.org/jira/browse/YARN-3799 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akira AJISAKA >Assignee: Akira AJISAKA > > {{mvn package -Pdist -DskipTests}} fails with JDK8 by illegal tag. > {code} > [ERROR] > /home/centos/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java:829: > error: @param name not found > [ERROR] * @param nodelabels > [ERROR] ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference
[ https://issues.apache.org/jira/browse/YARN-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583509#comment-14583509 ] Hudson commented on YARN-3794: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #215 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/215/]) YARN-3794. TestRMEmbeddedElector fails because of ambiguous LOG reference. (devaraj: rev d8dcfa98e3ca6a6fea414fd503589bb83b7a9c51) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMEmbeddedElector.java * hadoop-yarn-project/CHANGES.txt > TestRMEmbeddedElector fails because of ambiguous LOG reference > -- > > Key: YARN-3794 > URL: https://issues.apache.org/jira/browse/YARN-3794 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Fix For: 2.8.0 > > Attachments: YARN-3794.01.patch > > > After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in > the following code snippet is ambiguous. > {code} > protected AdminService createAdminService() { > return new AdminService(MockRMWithElector.this, getRMContext()) { > @Override > protected EmbeddedElectorService createEmbeddedElectorService() { > return new EmbeddedElectorService(getRMContext()) { > @Override > public void becomeActive() throws > ServiceFailedException { > try { > callbackCalled.set(true); > LOG.info("Callback called. Sleeping now"); > Thread.sleep(delayMs); > LOG.info("Sleep done"); > } catch (InterruptedException e) { > e.printStackTrace(); > } > super.becomeActive(); > } > }; > } > }; > } > {code} > Eclipse gives the following error: > {quote} > The field LOG is defined in an inherited type and an enclosing scope > {quote} > IMO, we should fix this as {{TestRMEmbeddedElector.LOG}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3799) [JDK8] Fix javadoc errors caused by incorrect or illegal tags
[ https://issues.apache.org/jira/browse/YARN-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-3799: Affects Version/s: 2.8.0 > [JDK8] Fix javadoc errors caused by incorrect or illegal tags > - > > Key: YARN-3799 > URL: https://issues.apache.org/jira/browse/YARN-3799 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Akira AJISAKA >Assignee: Akira AJISAKA > > {{mvn package -Pdist -DskipTests}} fails with JDK8 by illegal tag. > {code} > [ERROR] > /home/centos/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java:829: > error: @param name not found > [ERROR] * @param nodelabels > [ERROR] ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3799) [JDK8] Fix javadoc errors caused by incorrect or illegal tags
[ https://issues.apache.org/jira/browse/YARN-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-3799: Summary: [JDK8] Fix javadoc errors caused by incorrect or illegal tags (was: [JDK8] Fix javadoc errors caused by incorrect or illegal tags in hadoop-yarn-common) > [JDK8] Fix javadoc errors caused by incorrect or illegal tags > - > > Key: YARN-3799 > URL: https://issues.apache.org/jira/browse/YARN-3799 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akira AJISAKA >Assignee: Akira AJISAKA > > {{mvn package -Pdist -DskipTests}} fails with JDK8 by illegal tag. > {code} > [ERROR] > /home/centos/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java:829: > error: @param name not found > [ERROR] * @param nodelabels > [ERROR] ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583524#comment-14583524 ] Varun Vasudev commented on YARN-3591: - Sorry for the late response. In my opinion, there's little benefit to storing the bad local dirs in the state store. We can just pass the LocalDirHandlerService to LocalResourcesTrackerImpl when it's created and incoming requests can be checked against the know error dirs in the isResourcePresent function. [~lavkesh], would that solve the problem? > Resource Localisation on a bad disk causes subsequent containers failure > - > > Key: YARN-3591 > URL: https://issues.apache.org/jira/browse/YARN-3591 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, > YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch > > > It happens when a resource is localised on the disk, after localising that > disk has gone bad. NM keeps paths for localised resources in memory. At the > time of resource request isResourcePresent(rsrc) will be called which calls > file.exists() on the localised path. > In some cases when disk has gone bad, inodes are stilled cached and > file.exists() returns true. But at the time of reading, file will not open. > Note: file.exists() actually calls stat64 natively which returns true because > it was able to find inode information from the OS. > A proposal is to call file.list() on the parent path of the resource, which > will call open() natively. If the disk is good it should return an array of > paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference
[ https://issues.apache.org/jira/browse/YARN-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583546#comment-14583546 ] Hudson commented on YARN-3794: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #224 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/224/]) YARN-3794. TestRMEmbeddedElector fails because of ambiguous LOG reference. (devaraj: rev d8dcfa98e3ca6a6fea414fd503589bb83b7a9c51) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMEmbeddedElector.java * hadoop-yarn-project/CHANGES.txt > TestRMEmbeddedElector fails because of ambiguous LOG reference > -- > > Key: YARN-3794 > URL: https://issues.apache.org/jira/browse/YARN-3794 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Fix For: 2.8.0 > > Attachments: YARN-3794.01.patch > > > After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in > the following code snippet is ambiguous. > {code} > protected AdminService createAdminService() { > return new AdminService(MockRMWithElector.this, getRMContext()) { > @Override > protected EmbeddedElectorService createEmbeddedElectorService() { > return new EmbeddedElectorService(getRMContext()) { > @Override > public void becomeActive() throws > ServiceFailedException { > try { > callbackCalled.set(true); > LOG.info("Callback called. Sleeping now"); > Thread.sleep(delayMs); > LOG.info("Sleep done"); > } catch (InterruptedException e) { > e.printStackTrace(); > } > super.becomeActive(); > } > }; > } > }; > } > {code} > Eclipse gives the following error: > {quote} > The field LOG is defined in an inherited type and an enclosing scope > {quote} > IMO, we should fix this as {{TestRMEmbeddedElector.LOG}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3768) Index out of range exception with environment variables without values
[ https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583659#comment-14583659 ] Xuan Gong commented on YARN-3768: - Thanks for working on this. [~zxu]. Could you also add a testcase which can verify if we give a bad environment, it will not throw out the exception ? Also, why do we need to keep trailing empty strings ? > Index out of range exception with environment variables without values > -- > > Key: YARN-3768 > URL: https://issues.apache.org/jira/browse/YARN-3768 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.5.0 >Reporter: Joe Ferner >Assignee: zhihai xu > Attachments: YARN-3768.000.patch > > > Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range > exception occurs if an environment variable is encountered without a value. > I believe this occurs because java will not return empty strings from the > split method. Similar to this > http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583742#comment-14583742 ] Bibin A Chundatt commented on YARN-3789: [~rohithsharma] and [~devaraj.k] .Please review the patch submitted. As mentioned earlier the check style issue seems unrelated and testcase addition not required since its just log updation. > Refactor logs for LeafQueue#activateApplications() to remove duplicate logging > -- > > Key: YARN-3789 > URL: https://issues.apache.org/jira/browse/YARN-3789 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, > 0003-YARN-3789.patch > > > Duplicate logging from resource manager > during am limit check for each application > {code} > 015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583839#comment-14583839 ] Li Lu commented on YARN-3051: - Hi [~varun_saxena], thanks for the update! Some of my quick thoughts for discussion... # I just realized in this JIRA we are creating "backing storage read interface for ATS readers", but not the user facing ATS reader APIs. I believe these two topics are different: in this JIRA we're "wiring up" the storage systems, but in ATS reader APIs, we need to deal with user requirements. This said, I think the main design goal here is to provide a small set of generic interfaces so that we can easily connect them to our writers. We may want to have some brief ideas of the potential user facing features (as [~zjshen] mentioned in a previous comment), but I'm not sure if we need to implement them before we make a concrete design for the storage read interface. # If my understanding in point 1 is right, then perhaps we do not need to quite worry about the huge list of nulls. Of course, on code level we may want to to some cosmetic fixes, but since those interfaces are not user facing, making them more general may be more important I think? # I still think when doing the v2 interface design, it is fine, if not even beneficial, to start from scratch rather than thinking about the existing v1 design. If we're not implementing some v1 features as first-class in v2 storage implementations, maybe we can simply leave them out from the interfaces to storage level? (I assume we'll have an intermediate layer to do the wire up between our user facing reader APIs and the storage interfaces. ) # bq. Now from backing storage implementation viewpoint, would it make more sense to let these query params be passed as strings or do datatype conversion ? I've got no strong preference on this. Leaving them as a generic type (like string) gives the storage layer more freedom to interpret the data, but the readers need to ensure they understand the types by themselves. BTW, could you please briefly skim through the list of Jenkins warnings and see if they're critical? Thanks! > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051-YARN-2928.003.patch, > YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, > YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID
[ https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583854#comment-14583854 ] Jian He commented on YARN-3017: --- IMHO, although this theoretically doesn't break compatibility, but may be so in practice for some existing 3rd party tools. Also, if cluster is rolling upgraded from 2.6, then we have the same containerId printed with two different formats, which makes debugging process harder. I don't know why containerId was originally written in that way to only print 2 digits, but one reason I can think of is that in reality we won't see a large number of attempt failures (let alone the max-attempts is set to 2 by defaults). > ContainerID in ResourceManager Log Has Slightly Different Format From > AppAttemptID > -- > > Key: YARN-3017 > URL: https://issues.apache.org/jira/browse/YARN-3017 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: MUFEED USMAN >Assignee: Mohammad Shahid Khan >Priority: Minor > Labels: PatchAvailable > Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch, > YARN-3017_3.patch > > > Not sure if this should be filed as a bug or not. > In the ResourceManager log in the events surrounding the creation of a new > application attempt, > ... > ... > 2014-11-14 17:45:37,258 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching > masterappattempt_1412150883650_0001_02 > ... > ... > The application attempt has the ID format "_1412150883650_0001_02". > Whereas the associated ContainerID goes by "_1412150883650_0001_02_". > ... > ... > 2014-11-14 17:45:37,260 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting > up > container Container: [ContainerId: container_1412150883650_0001_02_01, > NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: vCores:1, > disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02 > ... > ... > Curious to know if this is kept like that for a reason. If not while using > filtering tools to, say, grep events surrounding a specific attempt by the > numeric ID part information may slip out during troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583878#comment-14583878 ] Li Lu commented on YARN-3051: - I verified locally that the pre-patch findbugs warnings no longer exists. > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051-YARN-2928.003.patch, > YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, > YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583890#comment-14583890 ] Xuan Gong commented on YARN-3714: - [~iwasakims] Did you see this issue in a real cluster environment ? As far as I know, when we start RM, and we only set up yarn.resourcemanager.hostname.rm-id, we will set this for all the service address, including yarn.resourcemanager.webapp.address.rm-id and yarn.resourcemanager.webapp.https.address.rm-id. {code} // Set HA configuration should be done before login this.rmContext.setHAEnabled(HAUtil.isHAEnabled(this.conf)); if (this.rmContext.isHAEnabled()) { HAUtil.verifyAndSetConfiguration(this.conf); } {code} > AM proxy filter can not get proper default proxy address if RM-HA is enabled > > > Key: YARN-3714 > URL: https://issues.apache.org/jira/browse/YARN-3714 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Attachments: YARN-3714.001.patch > > > Default proxy address could not be got without setting > {{yarn.resourcemanager.webapp.address._rm-id_}} and/or > {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID
[ https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583892#comment-14583892 ] Xuan Gong commented on YARN-3017: - +1 for jian's comment. We do not need to change this. > ContainerID in ResourceManager Log Has Slightly Different Format From > AppAttemptID > -- > > Key: YARN-3017 > URL: https://issues.apache.org/jira/browse/YARN-3017 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: MUFEED USMAN >Assignee: Mohammad Shahid Khan >Priority: Minor > Labels: PatchAvailable > Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch, > YARN-3017_3.patch > > > Not sure if this should be filed as a bug or not. > In the ResourceManager log in the events surrounding the creation of a new > application attempt, > ... > ... > 2014-11-14 17:45:37,258 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching > masterappattempt_1412150883650_0001_02 > ... > ... > The application attempt has the ID format "_1412150883650_0001_02". > Whereas the associated ContainerID goes by "_1412150883650_0001_02_". > ... > ... > 2014-11-14 17:45:37,260 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting > up > container Container: [ContainerId: container_1412150883650_0001_02_01, > NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: vCores:1, > disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02 > ... > ... > Curious to know if this is kept like that for a reason. If not while using > filtering tools to, say, grep events surrounding a specific attempt by the > numeric ID part information may slip out during troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583903#comment-14583903 ] Xuan Gong commented on YARN-3543: - [~rohithsharma] Could we not directly change the ApplicationReport.newInstance() ? This will break other applications, such as Tez. We should create compatible newInstance change instead. > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, > YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2497) Changes for fair scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583918#comment-14583918 ] Naganarasimha G R commented on YARN-2497: - Hi [~yufeldman], I would like to work on this jira if you have not yet started working, please inform can i take over this jira ? > Changes for fair scheduler to support allocate resource respect labels > -- > > Key: YARN-2497 > URL: https://issues.apache.org/jira/browse/YARN-2497 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Yuliya Feldman > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583925#comment-14583925 ] zhihai xu commented on YARN-3591: - Hi [~vvasudev], thanks for the suggestion. It looks like your suggestion is similar as [~lavkesh]'s original patch 0001-YARN-3591.patch. Compared to [~lavkesh]'s original patch, your suggestion sometimes may not detect the disk failure because LocalDirHandlerService only calls {{checkDirs}} every 2 minutes by default and if the disk failure happens right after {{checkDirs}} is called and before {{isResourcePresent}} is called, your suggestion won't detect the disk failure but [~lavkesh]'s original patch can detect the disk failure. So it looks like [~lavkesh]'s original patch is better than your suggestion. It is my understanding, and please correct me if I am wrong. > Resource Localisation on a bad disk causes subsequent containers failure > - > > Key: YARN-3591 > URL: https://issues.apache.org/jira/browse/YARN-3591 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, > YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch > > > It happens when a resource is localised on the disk, after localising that > disk has gone bad. NM keeps paths for localised resources in memory. At the > time of resource request isResourcePresent(rsrc) will be called which calls > file.exists() on the localised path. > In some cases when disk has gone bad, inodes are stilled cached and > file.exists() returns true. But at the time of reading, file will not open. > Note: file.exists() actually calls stat64 natively which returns true because > it was able to find inode information from the OS. > A proposal is to call file.list() on the parent path of the resource, which > will call open() natively. If the disk is good it should return an array of > paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584034#comment-14584034 ] Zhijie Shen commented on YARN-3044: --- [~Naganarasimha], thanks for fixing the race, but TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow seems to be caused by your new patch. Can you double check? > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3044-YARN-2928.004.patch, > YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, > YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, > YARN-3044-YARN-2928.009.patch, YARN-3044-YARN-2928.010.patch, > YARN-3044-YARN-2928.011.patch, YARN-3044.20150325-1.patch, > YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584073#comment-14584073 ] Naganarasimha G R commented on YARN-3044: - hi [~zjshen], i have already checked it, i am able to reproduce the failure even without my patch applied, also i had ran this test case individually with my changes and it was fine ! Based on my analysis its caused due to multiple issues : # setting the timeline auxillary service only for {{testDSShellWithoutDomainV2CustomizedFlow}} & {{testDSShellWithoutDomainV2CustomizedFlow}} in {{TestDistributedShell.setupInternal}} but yarn.nodemanager.container-metrics.enable is enabled by default hence metrics are always trying to be published. # {{TimelineClientImpl.putObjects}} once finishes all the retry attempts in {{pollTimelineServiceAddress}} and still the {{timelineServiceAddress}} is null (not updated) then the next while loop should be run only if the {{timelineServiceAddress}} is not null. if not nullpointer exception is thrown in {{constructResURI}} On Null pointer exception all the threads gets expired and thread pool executor in ContainersMonitorImpl starts rejecting to launch threads hence the container metrics are not getting launched. Will try to provide the patch @ the earliest. > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3044-YARN-2928.004.patch, > YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, > YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, > YARN-3044-YARN-2928.009.patch, YARN-3044-YARN-2928.010.patch, > YARN-3044-YARN-2928.011.patch, YARN-3044.20150325-1.patch, > YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584099#comment-14584099 ] Masatake Iwasaki commented on YARN-3714: Thanks for the comment, [~xgong]. {{HAUtil.verifyAndSetConfiguration}} works only on the RM node. AMs running in slave nodes also need to know the RM webapp addresses. > AM proxy filter can not get proper default proxy address if RM-HA is enabled > > > Key: YARN-3714 > URL: https://issues.apache.org/jira/browse/YARN-3714 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Attachments: YARN-3714.001.patch > > > Default proxy address could not be got without setting > {{yarn.resourcemanager.webapp.address._rm-id_}} and/or > {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584141#comment-14584141 ] Tsuyoshi Ozawa commented on YARN-3798: -- [~varun_saxena], [~bibinchundatt] thank you for taking this. One of our users also faced same issue. sync() is effective only when accessing from multiple clients. ZKRMStateStore has only one client, so I think it's not effective in this case. BTW, expected behaviour can be done by catching NoNodeException, but we should check why and when it happens. > RM shutdown with NoNode exception while updating appAttempt on zk > - > > Key: YARN-3798 > URL: https://issues.apache.org/jira/browse/YARN-3798 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Varun Saxena > > RM going down with NoNode exception during create of znode for appattempt > *Please find the exception logs* > {code} > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-09 10:09:44,886 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 10:09:44,887 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! > 2015-06-09 10:09:44,887 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > updating appAttempt: appattempt_1433764310492_7152_01 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584148#comment-14584148 ] Zhijie Shen commented on YARN-3051: --- bq. APIs' for querying individual entity/flow/flow run/user and APIs' for querying a set of entities/flow runs/flows/users. APIs' such a set of flows/users will contain aggregated data. The reason for separate endpoints for entities, flows, users,etc. is because of the different tables in HBase/Phoenix schema. I think we don't store the first class citizen entity in a different way and in different tables (Li/Vrushali, correct me If I'm wrong). When fetching an entity, it doesn't matter it is a customized entity or a predefined entity such as ApplicationEntity. In fact, we have two level of interfaces. One is the storage interface and the other is user-oriented interface. I think it's a good idea to let the user-oriented interface to have more specific/advanced APIs to handle the special entity objects, the storage interface could have fewer, more uniformed APIs to reuse the common logic as much as possible. Thoughts? bq. Every query param will be received as a String, even timestamp. Now from backing storage implementation viewpoint, would it make more sense to let these query params be passed as strings or do datatype conversion ? I think we need to take the generic type as the param. If it's transformed to a string, it is likely to be difficult to recover the original type information. For example, when we see a string "true", how do we know whether it used to be a "true" string too or a true boolean. Also, "1234567" is a number or is a string that represents a vehicle license. > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051-YARN-2928.003.patch, > YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, > YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID
[ https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584154#comment-14584154 ] Tsuyoshi Ozawa commented on YARN-3017: -- Thanks committers for your comments. This change can influence large parts unexpectedly. The inconsistency is dirty, but it works actually. We can keep this format for keeping compatibility. > ContainerID in ResourceManager Log Has Slightly Different Format From > AppAttemptID > -- > > Key: YARN-3017 > URL: https://issues.apache.org/jira/browse/YARN-3017 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: MUFEED USMAN >Assignee: Mohammad Shahid Khan >Priority: Minor > Labels: PatchAvailable > Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch, > YARN-3017_3.patch > > > Not sure if this should be filed as a bug or not. > In the ResourceManager log in the events surrounding the creation of a new > application attempt, > ... > ... > 2014-11-14 17:45:37,258 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching > masterappattempt_1412150883650_0001_02 > ... > ... > The application attempt has the ID format "_1412150883650_0001_02". > Whereas the associated ContainerID goes by "_1412150883650_0001_02_". > ... > ... > 2014-11-14 17:45:37,260 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting > up > container Container: [ContainerId: container_1412150883650_0001_02_01, > NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: vCores:1, > disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02 > ... > ... > Curious to know if this is kept like that for a reason. If not while using > filtering tools to, say, grep events surrounding a specific attempt by the > numeric ID part information may slip out during troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishai Menache updated YARN-3656: Attachment: YARN-3656-v1.patch > LowCost: A Cost-Based Placement Agent for YARN Reservations > --- > > Key: YARN-3656 > URL: https://issues.apache.org/jira/browse/YARN-3656 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager >Reporter: Ishai Menache > Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.patch > > > YARN-1051 enables SLA support by allowing users to reserve cluster capacity > ahead of time. YARN-1710 introduced a greedy agent for placing user > reservations. The greedy agent makes fast placement decisions but at the cost > of ignoring the cluster committed resources, which might result in blocking > the cluster resources for certain periods of time, and in turn rejecting some > arriving jobs. > We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” > the demand of the job throughout the allowed time-window according to a > global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3800) Simplify inmemory state for ReservationAllocation
Anubhav Dhoot created YARN-3800: --- Summary: Simplify inmemory state for ReservationAllocation Key: YARN-3800 URL: https://issues.apache.org/jira/browse/YARN-3800 Project: Hadoop YARN Issue Type: Sub-task Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Instead of storing the ReservationRequest we store the Resource for allocations, as thats the only thing we need. Ultimately we convert everything to resources anyway -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3800) Simplify inmemory state for ReservationAllocation
[ https://issues.apache.org/jira/browse/YARN-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3800: Attachment: YARN-3800.001.patch > Simplify inmemory state for ReservationAllocation > - > > Key: YARN-3800 > URL: https://issues.apache.org/jira/browse/YARN-3800 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3800.001.patch > > > Instead of storing the ReservationRequest we store the Resource for > allocations, as thats the only thing we need. Ultimately we convert > everything to resources anyway -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-3656: - Assignee: Jonathan Yaniv > LowCost: A Cost-Based Placement Agent for YARN Reservations > --- > > Key: YARN-3656 > URL: https://issues.apache.org/jira/browse/YARN-3656 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Ishai Menache >Assignee: Jonathan Yaniv > Labels: capacity-scheduler, resourcemanager > Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.patch > > > YARN-1051 enables SLA support by allowing users to reserve cluster capacity > ahead of time. YARN-1710 introduced a greedy agent for placing user > reservations. The greedy agent makes fast placement decisions but at the cost > of ignoring the cluster committed resources, which might result in blocking > the cluster resources for certain periods of time, and in turn rejecting some > arriving jobs. > We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” > the demand of the job throughout the allowed time-window according to a > global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishai Menache updated YARN-3656: Attachment: lowcostrayonexternal_v2.pdf new version of the design doc, including a class diagram. > LowCost: A Cost-Based Placement Agent for YARN Reservations > --- > > Key: YARN-3656 > URL: https://issues.apache.org/jira/browse/YARN-3656 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Ishai Menache >Assignee: Jonathan Yaniv > Labels: capacity-scheduler, resourcemanager > Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.patch, > lowcostrayonexternal_v2.pdf > > > YARN-1051 enables SLA support by allowing users to reserve cluster capacity > ahead of time. YARN-1710 introduced a greedy agent for placing user > reservations. The greedy agent makes fast placement decisions but at the cost > of ignoring the cluster committed resources, which might result in blocking > the cluster resources for certain periods of time, and in turn rejecting some > arriving jobs. > We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” > the demand of the job throughout the allowed time-window according to a > global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584189#comment-14584189 ] Zhijie Shen commented on YARN-3044: --- Interesting. Is it an intermittent test failure? I cannot reproduce it on my machine with a clean YARN-2928 branch. BTW, I tried the new patch on a single node cluster. The race condition is fixed. > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3044-YARN-2928.004.patch, > YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, > YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, > YARN-3044-YARN-2928.009.patch, YARN-3044-YARN-2928.010.patch, > YARN-3044-YARN-2928.011.patch, YARN-3044.20150325-1.patch, > YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584197#comment-14584197 ] Li Lu commented on YARN-3051: - bq. APIs' for querying individual entity/flow/flow run/user and APIs' for querying a set of entities/flow runs/flows/users. APIs' such a set of flows/users will contain aggregated data. bq. I think we don't store the first class citizen entity in a different way and in different tables (Li/Vrushali, correct me If I'm wrong). When fetching an entity, it doesn't matter it is a customized entity or a predefined entity such as ApplicationEntity. If we're discussing about storage read interface, why is it harmful to explicitly separate interfaces for raw data and aggregated data, as [~zjshen] proposed before? We can work on the raw data interface first, when designing aggregations. bq. If it's transformed to a string, it is likely to be difficult to recover the original type information. I agree. A follow up concern is, who to maintain, or explain, the type information? I assume we need the readers themselves to keep track of this? > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051-YARN-2928.003.patch, > YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, > YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1198: -- Assignee: Wangda Tan (was: Craig Welch) > Capacity Scheduler headroom calculation does not work as expected > - > > Key: YARN-1198 > URL: https://issues.apache.org/jira/browse/YARN-1198 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Omkar Vinit Joshi >Assignee: Wangda Tan > Attachments: YARN-1198.1.patch, YARN-1198.10.patch, > YARN-1198.11-with-1857.patch, YARN-1198.11.patch, > YARN-1198.12-with-1857.patch, YARN-1198.2.patch, YARN-1198.3.patch, > YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, > YARN-1198.8.patch, YARN-1198.9.patch > > > Today headroom calculation (for the app) takes place only when > * New node is added/removed from the cluster > * New container is getting assigned to the application. > However there are potentially lot of situations which are not considered for > this calculation > * If a container finishes then headroom for that application will change and > should be notified to the AM accordingly. > * If a single user has submitted multiple applications (app1 and app2) to the > same queue then > ** If app1's container finishes then not only app1's but also app2's AM > should be notified about the change in headroom. > ** Similarly if a container is assigned to any applications app1/app2 then > both AM should be notified about their headroom. > ** To simplify the whole communication process it is ideal to keep headroom > per User per LeafQueue so that everyone gets the same picture (apps belonging > to same user and submitted in same queue). > * If a new user submits an application to the queue then all applications > submitted by all users in that queue should be notified of the headroom > change. > * Also today headroom is an absolute number ( I think it should be normalized > but then this is going to be not backward compatible..) > * Also when admin user refreshes queue headroom has to be updated. > These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1680: -- Assignee: Chen He (was: Craig Welch) > availableResources sent to applicationMaster in heartbeat should exclude > blacklistedNodes free memory. > -- > > Key: YARN-1680 > URL: https://issues.apache.org/jira/browse/YARN-1680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.2.0, 2.3.0 > Environment: SuSE 11 SP2 + Hadoop-2.3 >Reporter: Rohith >Assignee: Chen He > Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, > YARN-1680-v2.patch, YARN-1680.patch > > > There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster > slow start is set to 1. > Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is > become unstable(3 Map got killed), MRAppMaster blacklisted unstable > NodeManager(NM-4). All reducer task are running in cluster now. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes memory. This makes > jobs to hang forever(ResourceManager does not assing any new containers on > blacklisted nodes but returns availableResouce considers cluster free > memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3320) Support a Priority OrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-3320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3320: -- Assignee: Wangda Tan (was: Craig Welch) > Support a Priority OrderingPolicy > - > > Key: YARN-3320 > URL: https://issues.apache.org/jira/browse/YARN-3320 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Reporter: Craig Welch >Assignee: Wangda Tan > > When [YARN-2004] is complete, bring relevant logic into the OrderingPolicy > framework -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584266#comment-14584266 ] Craig Welch commented on YARN-1680: --- [~airbots], unfortunately, I'm having no more luck seeing this through than you have had! I have gone ahead and handed this back to you, if you don't believe you'll have time to work on it, you might want to see if [~leftnoteasy] is interested in picking it up. Thanks. > availableResources sent to applicationMaster in heartbeat should exclude > blacklistedNodes free memory. > -- > > Key: YARN-1680 > URL: https://issues.apache.org/jira/browse/YARN-1680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.2.0, 2.3.0 > Environment: SuSE 11 SP2 + Hadoop-2.3 >Reporter: Rohith >Assignee: Chen He > Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, > YARN-1680-v2.patch, YARN-1680.patch > > > There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster > slow start is set to 1. > Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is > become unstable(3 Map got killed), MRAppMaster blacklisted unstable > NodeManager(NM-4). All reducer task are running in cluster now. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes memory. This makes > jobs to hang forever(ResourceManager does not assing any new containers on > blacklisted nodes but returns availableResouce considers cluster free > memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1039) Add parameter for YARN resource requests to indicate "long lived"
[ https://issues.apache.org/jira/browse/YARN-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1039: -- Assignee: Vinod Kumar Vavilapalli (was: Craig Welch) > Add parameter for YARN resource requests to indicate "long lived" > - > > Key: YARN-1039 > URL: https://issues.apache.org/jira/browse/YARN-1039 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.0.0, 2.1.1-beta >Reporter: Steve Loughran >Assignee: Vinod Kumar Vavilapalli > Attachments: YARN-1039.1.patch, YARN-1039.2.patch, YARN-1039.3.patch > > > A container request could support a new parameter "long-lived". This could be > used by a scheduler that would know not to host the service on a transient > (cloud: spot priced) node. > Schedulers could also decide whether or not to allocate multiple long-lived > containers on the same node -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3800) Simplify inmemory state for ReservationAllocation
[ https://issues.apache.org/jira/browse/YARN-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584269#comment-14584269 ] Hadoop QA commented on YARN-3800: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 44s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 7m 50s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 57s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 47s | The applied patch generated 7 new checkstyle issues (total was 54, now 55). | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 27s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 46m 30s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 85m 50s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps | | | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12739350/YARN-3800.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / eef7b50 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8242/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8242/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8242/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8242/console | This message was automatically generated. > Simplify inmemory state for ReservationAllocation > - > > Key: YARN-3800 > URL: https://issues.apache.org/jira/browse/YARN-3800 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3800.001.patch > > > Instead of storing the ReservationRequest we store the Resource for > allocations, as thats the only thing we need. Ultimately we convert > everything to resources anyway -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1039) Add parameter for YARN resource requests to indicate "long lived"
[ https://issues.apache.org/jira/browse/YARN-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584271#comment-14584271 ] Craig Welch commented on YARN-1039: --- I'll go back to my earlier assertion that I think it's not "duration" we are really concerned with here, that is covered in various ways in other places, but more the notion of an application type, a "batch" or a "service", with the defining characteristic being for the potential of "continuous operation" (service) or "unit of work which will run to completion" (batch), and an enumeration of "service" and "batch" make sense to me. In any case, [~vinodkv], it seems that there still seems to be enough diversity of opinion here to require some ongoing discussion/reconciliation, so I will leave this in your capable hands. > Add parameter for YARN resource requests to indicate "long lived" > - > > Key: YARN-1039 > URL: https://issues.apache.org/jira/browse/YARN-1039 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.0.0, 2.1.1-beta >Reporter: Steve Loughran >Assignee: Vinod Kumar Vavilapalli > Attachments: YARN-1039.1.patch, YARN-1039.2.patch, YARN-1039.3.patch > > > A container request could support a new parameter "long-lived". This could be > used by a scheduler that would know not to host the service on a transient > (cloud: spot priced) node. > Schedulers could also decide whether or not to allocate multiple long-lived > containers on the same node -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3510) Create an extension of ProportionalCapacityPreemptionPolicy which preempts a number of containers from each application in a way which respects fairness
[ https://issues.apache.org/jira/browse/YARN-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3510: -- Assignee: Wangda Tan (was: Craig Welch) > Create an extension of ProportionalCapacityPreemptionPolicy which preempts a > number of containers from each application in a way which respects fairness > > > Key: YARN-3510 > URL: https://issues.apache.org/jira/browse/YARN-3510 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Craig Welch >Assignee: Wangda Tan > Attachments: YARN-3510.2.patch, YARN-3510.3.patch, YARN-3510.5.patch, > YARN-3510.6.patch > > > The ProportionalCapacityPreemptionPolicy preempts as many containers from > applications as it can during it's preemption run. For fifo this makes > sense, as it is prempting in reverse order & therefore maintaining the > primacy of the "oldest". For fair ordering this does not have the desired > effect - instead, it should preempt a number of containers from each > application which maintains a fair balance /close to a fair balance between > them -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN
[ https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584272#comment-14584272 ] Chris Douglas commented on YARN-1983: - (sorry for the delayed reply; missed this) bq. I was proposing we continue the same without adding a new CLC field. Are we both saying the same thing then? Yeah, I think we agree. We don't need to extend the CLC definition for this use case, because it's less invasive to add a composite CE that can inspect the CLC and demux on a set of rules. I scanned the patch on YARN-1964, and maybe I'm being dense but I couldn't find the demux. It does some validation using patterns... > Support heterogeneous container types at runtime on YARN > > > Key: YARN-1983 > URL: https://issues.apache.org/jira/browse/YARN-1983 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junping Du > Attachments: YARN-1983.2.patch, YARN-1983.patch > > > Different container types (default, LXC, docker, VM box, etc.) have different > semantics on isolation of security, namespace/env, performance, etc. > Per discussions in YARN-1964, we have some good thoughts on supporting > different types of containers running on YARN and specified by application at > runtime which largely enhance YARN's flexibility to meet heterogenous app's > requirement on isolation at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3711) Documentation of ResourceManager HA should explain about webapp address configuration
[ https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-3711: --- Description: There should be explanation about webapp address in addition to RPC address. AM proxy filter needs explicit definition of {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses in RM-HA mode. was: There should be explanation about webapp address in addition to RPC address. AM proxy filter needs explicit definition of {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper default addresses in RM-HA mode now. > Documentation of ResourceManager HA should explain about webapp address > configuration > - > > Key: YARN-3711 > URL: https://issues.apache.org/jira/browse/YARN-3711 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Attachments: YARN-3711.002.patch > > > There should be explanation about webapp address in addition to RPC address. > AM proxy filter needs explicit definition of > {{yarn.resourcemanager.webapp.address._rm-id_}} and/or > {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses > in RM-HA mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain about webapp address configuration
[ https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584327#comment-14584327 ] Masatake Iwasaki commented on YARN-3711: Though {{HAUtil.verifyAndSetConfiguration}} updates the configuration for HA on RM node, it only cares the rm-id of the node. AMs running on slave nodes need to know webapp addresses of all RMs. The URL of application shown in RM UI refers to wrong host name without explicit definition of webapp addresses. > Documentation of ResourceManager HA should explain about webapp address > configuration > - > > Key: YARN-3711 > URL: https://issues.apache.org/jira/browse/YARN-3711 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Attachments: YARN-3711.002.patch > > > There should be explanation about webapp address in addition to RPC address. > AM proxy filter needs explicit definition of > {{yarn.resourcemanager.webapp.address._rm-id_}} and/or > {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses > in RM-HA mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
Tsuyoshi Ozawa created YARN-3801: Summary: [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util Key: YARN-3801 URL: https://issues.apache.org/jira/browse/YARN-3801 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Tsuyoshi Ozawa Assignee: Tsuyoshi Ozawa timelineservice depends on hbase-client and hbase-testing-util, and they dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8. {quote} [WARNING] Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency are: +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT +-jdk.tools:jdk.tools:1.8 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-client:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-testing-util:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence failed with message: Failed while enforcing releasability the error(s) are [ Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency are: +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT +-jdk.tools:jdk.tools:1.8 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-client:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-testing-util:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
[ https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3801: - Issue Type: Sub-task (was: Bug) Parent: YARN-2928 > [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util > - > > Key: YARN-3801 > URL: https://issues.apache.org/jira/browse/YARN-3801 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > > timelineservice depends on hbase-client and hbase-testing-util, and they > dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8. > {quote} > [WARNING] > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence > failed with message: > Failed while enforcing releasability the error(s) are [ > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
[ https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3801: - Attachment: YARN-3801.001.patch Attaching a first patch. > [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util > - > > Key: YARN-3801 > URL: https://issues.apache.org/jira/browse/YARN-3801 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: YARN-3801.001.patch > > > timelineservice depends on hbase-client and hbase-testing-util, and they > dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8. > {quote} > [WARNING] > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence > failed with message: > Failed while enforcing releasability the error(s) are [ > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
[ https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584334#comment-14584334 ] Hadoop QA commented on YARN-3801: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12739379/YARN-3801.001.patch | | Optional Tests | javadoc javac unit | | git revision | trunk / eef7b50 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8244/console | This message was automatically generated. > [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util > - > > Key: YARN-3801 > URL: https://issues.apache.org/jira/browse/YARN-3801 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: YARN-3801.001.patch > > > timelineservice depends on hbase-client and hbase-testing-util, and they > dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8. > {quote} > [WARNING] > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence > failed with message: > Failed while enforcing releasability the error(s) are [ > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584411#comment-14584411 ] Chen He commented on YARN-1680: --- Thank you, [~cwelch]. Appreciate you assign it back. :) > availableResources sent to applicationMaster in heartbeat should exclude > blacklistedNodes free memory. > -- > > Key: YARN-1680 > URL: https://issues.apache.org/jira/browse/YARN-1680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.2.0, 2.3.0 > Environment: SuSE 11 SP2 + Hadoop-2.3 >Reporter: Rohith >Assignee: Chen He > Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, > YARN-1680-v2.patch, YARN-1680.patch > > > There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster > slow start is set to 1. > Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is > become unstable(3 Map got killed), MRAppMaster blacklisted unstable > NodeManager(NM-4). All reducer task are running in cluster now. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes memory. This makes > jobs to hang forever(ResourceManager does not assing any new containers on > blacklisted nodes but returns availableResouce considers cluster free > memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
[ https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584420#comment-14584420 ] Tsuyoshi Ozawa commented on YARN-3801: -- [~sjlee0] could you take a look? > [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util > - > > Key: YARN-3801 > URL: https://issues.apache.org/jira/browse/YARN-3801 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: YARN-3801.001.patch > > > timelineservice depends on hbase-client and hbase-testing-util, and they > dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8. > {quote} > [WARNING] > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence > failed with message: > Failed while enforcing releasability the error(s) are [ > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
[ https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584426#comment-14584426 ] Sean Busbey commented on YARN-3801: --- What's your timeline? HBase is adding support for JDK8 in the upcoming 1.2 release line, so we'll get this cleaned up in our code base. > [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util > - > > Key: YARN-3801 > URL: https://issues.apache.org/jira/browse/YARN-3801 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: YARN-3801.001.patch > > > timelineservice depends on hbase-client and hbase-testing-util, and they > dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8. > {quote} > [WARNING] > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence > failed with message: > Failed while enforcing releasability the error(s) are [ > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
[ https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584433#comment-14584433 ] Tsuyoshi Ozawa commented on YARN-3801: -- Currently, timeline service depends on hbase-client and hbase-resting-util 1.0.1. We can upgrade it after the 1.2.0 release. > [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util > - > > Key: YARN-3801 > URL: https://issues.apache.org/jira/browse/YARN-3801 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: YARN-3801.001.patch > > > timelineservice depends on hbase-client and hbase-testing-util, and they > dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8. > {quote} > [WARNING] > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence > failed with message: > Failed while enforcing releasability the error(s) are [ > Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency > are: > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT > +-jdk.tools:jdk.tools:1.8 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-client:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > and > +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT > +-org.apache.hbase:hbase-testing-util:1.0.1 > +-org.apache.hbase:hbase-annotations:1.0.1 > +-jdk.tools:jdk.tools:1.7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584472#comment-14584472 ] Bibin A Chundatt commented on YARN-3798: [~ozawa] thnk you for looking into this issue. Multiple times the ZK services down had happened during crash and transition to standBy and active state on RM side. {quote} expected behaviour can be done by catching NoNodeException {quote} Yes we should try to find the root cause of the same. Soon will upload the part logs on RM and ZK during this exception. [~Naganarasimha] {quote} i think we should retry again before making the job fail? {quote} Already we do have retry and timeout for zk . *recovery.ZKRMStateStore$ZKAction.runWithRetries* > RM shutdown with NoNode exception while updating appAttempt on zk > - > > Key: YARN-3798 > URL: https://issues.apache.org/jira/browse/YARN-3798 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Varun Saxena > > RM going down with NoNode exception during create of znode for appattempt > *Please find the exception logs* > {code} > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-09 10:09:44,886 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 10:09:44,887 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! > 2015-06-09 10:09:44,887 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > updating appAttempt: appattempt_1433764310492_7152_01 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException