[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: YARN-796.patch.1 First patch based on LabelBasedScheduling design document Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Labels: (was: patch) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077435#comment-14077435 ] Hadoop QA commented on YARN-796: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658367/YARN-796.patch.1 against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4467//console This message is automatically generated. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077440#comment-14077440 ] Hadoop QA commented on YARN-796: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658367/YARN-796.patch.1 against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4468//console This message is automatically generated. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077443#comment-14077443 ] Hadoop QA commented on YARN-611: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658363/YARN-611.4.rebase.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4466//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4466//console This message is automatically generated. Add an AM retry count reset window to YARN RM - Key: YARN-611 URL: https://issues.apache.org/jira/browse/YARN-611 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Chris Riccomini Assignee: Xuan Gong Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, YARN-611.4.patch, YARN-611.4.rebase.patch YARN currently has the following config: yarn.resourcemanager.am.max-retries This config defaults to 2, and defines how many times to retry a failed AM before failing the whole YARN job. YARN counts an AM as failed if the node that it was running on dies (the NM will timeout, which counts as a failure for the AM), or if the AM dies. This configuration is insufficient for long running (or infinitely running) YARN jobs, since the machine (or NM) that the AM is running on will eventually need to be restarted (or the machine/NM will fail). In such an event, the AM has not done anything wrong, but this is counted as a failure by the RM. Since the retry count for the AM is never reset, eventually, at some point, the number of machine/NM failures will result in the AM failure count going above the configured value for yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the job as failed, and shut it down. This behavior is not ideal. I propose that we add a second configuration: yarn.resourcemanager.am.retry-count-window-ms This configuration would define a window of time that would define when an AM is well behaved, and it's safe to reset its failure count back to zero. Every time an AM fails the RmAppImpl would check the last time that the AM failed. If the last failure was less than retry-count-window-ms ago, and the new failure count is max-retries, then the job should fail. If the AM has never failed, the retry count is max-retries, or if the last failure was OUTSIDE the retry-count-window-ms, then the job should be restarted. Additionally, if the last failure was outside the retry-count-window-ms, then the failure count should be set back to 0. This would give developers a way to have well-behaved AMs run forever, while still failing mis-behaving AMs after a short period of time. I think the work to be done here is to change the RmAppImpl to actually look at app.attempts, and see if there have been more than max-retries failures in the last retry-count-window-ms milliseconds. If there have, then the job should fail, if not, then the job should go forward. Additionally, we might also need to add an endTime in either RMAppAttemptImpl or RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the failure. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077453#comment-14077453 ] Gera Shegalov commented on YARN-796: Hi [~yufeldman], thanks for posting the patch. Please rebase it since it no longer applies. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077458#comment-14077458 ] Yuliya Feldman commented on YARN-796: - Yes, noticed - will repost in a moment Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.2 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: YARN-796.patch.2 Patch to comply with svn Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.2 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: (was: YARN-796.patch.1) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.2 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2215) Add preemption info to REST/CLI
[ https://issues.apache.org/jira/browse/YARN-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077481#comment-14077481 ] Kenji Kikushima commented on YARN-2215: --- Thanks for comments, [~leftnoteasy]. I will also make a patch for CLI support, but it does't depend on REST's dao. To make it clear, may I divide jira into REST support and CLI support? Add preemption info to REST/CLI --- Key: YARN-2215 URL: https://issues.apache.org/jira/browse/YARN-2215 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Reporter: Wangda Tan Assignee: Kenji Kikushima Attachments: YARN-2215.patch As discussed in YARN-2181, we'd better to add preemption info to RM RESTful API/CLI to make administrator/user get more understanding about preemption happened on app/queue, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: (was: YARN-796.patch.2) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.3 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: YARN-796.patch.3 Rebased from trunk Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.3 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated YARN-2368: - Description: Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager log shows as the following: 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) Meanwhile ZooKeeps logs as the following: 2014-07-25 22:10:09,742 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747 ... ... 2014-07-25 22:33:10,966 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747 was: Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager log shows as the following: 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at
[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated YARN-2368: - Attachment: YARN-2368.patch ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB - Key: YARN-2368 URL: https://issues.apache.org/jira/browse/YARN-2368 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Priority: Critical Attachments: YARN-2368.patch Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager log shows as the following: 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) Meanwhile ZooKeeps logs as the following: 2014-07-25 22:10:09,742 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747 ... ... 2014-07-25 22:33:10,966 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated YARN-2368: - Description: Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager log shows as the following: 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) Meanwhile ZooKeeps logs as the following: 2014-07-25 22:10:09,742 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747 ... ... 2014-07-25 22:33:10,966 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747 was: Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager log shows as the following: 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) at
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077580#comment-14077580 ] Junping Du commented on YARN-2209: -- Thanks [~zjshen] for more details! I have similar comments above but [~jianhe] mentioned RESYNC is just used for RM restart work which hasn't been released as a completed feature. However, I checked our previous releases that even since in 2.2 (may earlier), AM_RESYNC and AM_SHUTDOWN is already a public API that could be used in customers' application. In this case, our changes here could break the application - previously, it should show Resource Manager doesn't recognize AttemptId: ... when RM getting restart (even no preserving work), but now it shows something like Could not contact RM after ... milliseconds. which sounds misleading. May be we should think some other compatible way? i.e. add a new API to ApplicationMasterProtocol which throw exceptions instead of AMCommand. The old API still get supported for backward compatibility. Thoughts? Replace AM resync/shutdown command with corresponding exceptions Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, YARN-2209.4.patch, YARN-2209.5.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077593#comment-14077593 ] Hadoop QA commented on YARN-796: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658377/YARN-796.patch.3 against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/4470//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 4 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.TestRMAdminCLI {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4470//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4470//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4470//console This message is automatically generated. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.3 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077616#comment-14077616 ] Hadoop QA commented on YARN-2368: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658387/YARN-2368.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4471//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4471//console This message is automatically generated. ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB - Key: YARN-2368 URL: https://issues.apache.org/jira/browse/YARN-2368 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Priority: Critical Attachments: YARN-2368.patch Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager log shows as the following: 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated YARN-2368: - Description: Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager log shows as the following: 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) Meanwhile, ZooKeeps logs as the following: 2014-07-25 22:10:09,742 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747 ... ... 2014-07-25 22:33:10,966 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747 was: Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager log shows as the following: 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at
[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated YARN-2368: - Description: Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager log shows as the following: 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) Meanwhile, ZooKeeps log shows as the following: 2014-07-25 22:10:09,742 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747 ... ... 2014-07-25 22:33:10,966 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747 was: Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager log shows as the following: 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at
[jira] [Created] (YARN-2369) Environment variable handling assumes values should be appended
Jason Lowe created YARN-2369: Summary: Environment variable handling assumes values should be appended Key: YARN-2369 URL: https://issues.apache.org/jira/browse/YARN-2369 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: Jason Lowe When processing environment variables for a container context the code assumes that the value should be appended to any pre-existing value in the environment. This may be desired behavior for handling path-like environment variables such as PATH, LD_LIBRARY_PATH, CLASSPATH, etc. but it is a non-intuitive and harmful way to handle any variable that does not have path-like semantics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2369) Environment variable handling assumes values should be appended
[ https://issues.apache.org/jira/browse/YARN-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077729#comment-14077729 ] Jason Lowe commented on YARN-2369: -- The code in question is in org.apache.hadoop.yarn.util.Apps#addToEnvironment: {code} public static void addToEnvironment( MapString, String environment, String variable, String value, String classPathSeparator) { String val = environment.get(variable); if (val == null) { val = value; } else { val = val + classPathSeparator + value; } environment.put(StringInterner.weakIntern(variable), StringInterner.weakIntern(val)); } {code} This has very surprising results for any variable that isn't path-like. For example, we ran across a MapReduce job that had something like this in its environment settings: yarn.app.mapreduce.am.env='JAVA_HOME=/inst/jdk,JAVA_HOME=/inst/jdk' Rather than ending up with JAVA_HOME=/inst/jdk as one would expect, JAVA_HOME instead was set to /inst/jdk:/inst/jdk which completely broke the job. It seems to me that we should either use a whitelist of variables that support appending or never append settings. For the latter case if users desire values to be appended then they can ask for it explicitly in their variable settings, like one of these forms depending upon whether they want client-side environment variable expansion or container-side environment variable expansion: {noformat} PATH='$PATH:/my/extra/path' PATH='{{PATH}}:/my/extra/path' {noformat} Environment variable handling assumes values should be appended --- Key: YARN-2369 URL: https://issues.apache.org/jira/browse/YARN-2369 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: Jason Lowe When processing environment variables for a container context the code assumes that the value should be appended to any pre-existing value in the environment. This may be desired behavior for handling path-like environment variables such as PATH, LD_LIBRARY_PATH, CLASSPATH, etc. but it is a non-intuitive and harmful way to handle any variable that does not have path-like semantics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2369) Environment variable handling assumes values should be appended
[ https://issues.apache.org/jira/browse/YARN-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077736#comment-14077736 ] Allen Wittenauer commented on YARN-2369: Post-HADOOP-9902, I wonder if it would be worthwhile to have this configured at the shell level. i.e., Have something that both the Java code and the shell could could read that would list the semantics of each known/important shell var. This way both could be smarter about overwrite vs. append vs. dedupe. Environment variable handling assumes values should be appended --- Key: YARN-2369 URL: https://issues.apache.org/jira/browse/YARN-2369 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: Jason Lowe When processing environment variables for a container context the code assumes that the value should be appended to any pre-existing value in the environment. This may be desired behavior for handling path-like environment variables such as PATH, LD_LIBRARY_PATH, CLASSPATH, etc. but it is a non-intuitive and harmful way to handle any variable that does not have path-like semantics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2370) Fix comment more accurate in AppSchedulingInfo.java
Wenwu Peng created YARN-2370: Summary: Fix comment more accurate in AppSchedulingInfo.java Key: YARN-2370 URL: https://issues.apache.org/jira/browse/YARN-2370 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Priority: Trivial in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update OffRack request, the comment should be Update cloned OffRack requests for recovery not Update cloned RackLocal and OffRack requests for recover {code} // Update cloned RackLocal and OffRack requests for recovery resourceRequests.add(cloneResourceRequest(offSwitchRequest)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077797#comment-14077797 ] Zhijie Shen commented on YARN-2209: --- [~djp], thanks for sharing the your idea. bq. However, I checked our previous releases that even since in 2.2 (may earlier), AM_RESYNC and AM_SHUTDOWN is already a public API that could be used in customers' application. I think AMCommand is in the codebase since 2.1. I think [~jianhe] meant the new logic for RESYNC case is committed recently. bq. i.e. add a new API to ApplicationMasterProtocol which throw exceptions instead of AMCommand. The old API still get supported for backward compatibility. IMHO, it sounds an overcorrection for code refactoring work. I think the essential problem here is whether throwing new sub exception which may not be handled before is an acceptable incompatible change, and therefore whether it is worth trading it for code refactoring. Thoughts? Replace AM resync/shutdown command with corresponding exceptions Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, YARN-2209.4.patch, YARN-2209.5.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2360: - Attachment: YARN-2360-v2.txt Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, YARN-2360-v1.txt, YARN-2360-v2.txt Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077982#comment-14077982 ] Hadoop QA commented on YARN-2360: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658443/YARN-2360-v2.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4472//console This message is automatically generated. Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, YARN-2360-v1.txt, YARN-2360-v2.txt Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078046#comment-14078046 ] Jian He commented on YARN-2209: --- {code} try { ams.allocate(...); catch (Exception e) { ams.finishApplicationMaster(...) } if (response is shutdown/resync) { // cleanup and reboot ... } {code} The example you mentioned here will continue to work because ams.finishApplicationMaster won't be able to go through. Thus, AM container eventually gets killed and RM will still retry this application. So the main point is, regardless how application is earlier handling the AMCommand, it should continue to work with this change. Existing YARN applications will not break because of this. In fact, I think AMCommand#shutdown itself is a nondeterministic command, because AM may possibly get killed before it can do anything to process this command . Application should not rely on this command. Replace AM resync/shutdown command with corresponding exceptions Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, YARN-2209.4.patch, YARN-2209.5.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped
[ https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078051#comment-14078051 ] Karthik Kambatla commented on YARN-2328: Had an offline discussion with Sandy. This approach of having a single background-tasks-thread would adversely affect continuous scheduling as that would be gated on a sleep and often-longer update. Will go ahead and commit yarn-2328-2.patch that Sandy already +1ed. FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped Key: YARN-2328 URL: https://issues.apache.org/jira/browse/YARN-2328 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch, yarn-2328-preview.patch FairScheduler threads can use a little cleanup and tests. To begin with, the update and continuous-scheduling threads should extend Thread and handle being interrupted. We should have tests for starting and stopping them as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078048#comment-14078048 ] Craig Welch commented on YARN-1994: --- It used to be the case that the server's InetsocketAddress would be the hostname, but now it can be 0.0.0.0 in bind-host cases, so it can no longer get it the way is used to before the change. There might be a simpler way to do it than the one here, but what is here does appear to work I think it's as good an approach as any - this service is not quite the same as the others in terms of how this part of the code works. Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped
[ https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-2328. Resolution: Fixed Committed yarn-2328-2.patch to trunk and branch-2. Thanks for the review and offline discussions, Sandy. FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped Key: YARN-2328 URL: https://issues.apache.org/jira/browse/YARN-2328 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch, yarn-2328-preview.patch FairScheduler threads can use a little cleanup and tests. To begin with, the update and continuous-scheduling threads should extend Thread and handle being interrupted. We should have tests for starting and stopping them as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078066#comment-14078066 ] Craig Welch commented on YARN-1994: --- On further investigation, the getCanonicalHostname() code a little further down is what is responsible for making this section work - and it should work with any valid bind-host configuration (as it needs to be something on the host which is a valid listening specification - or it will have already failed during the server setup). I think this change is also unnecessary, I'm going to take it out and test to verify. Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped
[ https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078072#comment-14078072 ] Hudson commented on YARN-2328: -- FAILURE: Integrated in Hadoop-trunk-Commit #5982 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5982/]) YARN-2328. FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped. (kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1614432) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped Key: YARN-2328 URL: https://issues.apache.org/jira/browse/YARN-2328 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch, yarn-2328-preview.patch FairScheduler threads can use a little cleanup and tests. To begin with, the update and continuous-scheduling threads should extend Thread and handle being interrupted. We should have tests for starting and stopping them as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1994: -- Attachment: YARN-1994.12.patch Yup, that change is not needed, nice catch Arpit. Attached is .12 without it. Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078195#comment-14078195 ] Arpit Agarwal commented on YARN-1994: - +1 for the latest patch, thanks for all the patch iterations Craig! :-) Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078240#comment-14078240 ] Milan Potocnik commented on YARN-1994: -- Hi [~cwelch], [~arpitagarwal], I think some clarification is needed here. Initial reason we wanted to introduce _BIND_HOST options was to provide deterministic behaviors when clients try to connect to a service endpoint which is listening on all interfaces (0.0.0.0). In short, _BIND_HOST is what services use to bind, _ADDRESS is what clients should use to connect. This way, everything is deterministic. In Multi NIC environments, with default implementation, calls to conf.updateConnectAddress for 0.0.0.0 address would eventually call InetSocketAddress.getHostName(). In MultiNIC environments, this can introduce non-deterministic behavior. Imagine you have DNS entries for each of the network interfaces and although you bind your service endpoint to all of them, you want users to use a specific one (for instance, InfiniBand for better performance). InetSocketAddress.getHostName() will return just the machine's hostname which will usually resolve to some random network interface of the service when the client resolves it. Although service binds to 0.0.0.0, some interfaces might be disabled by firewall. This is why besides RPCUtil.getSocketAddress, we also need RPCUtil.updateConnectAddr to explicitly specify connect address which clients should use, i.e. a DNS entry pointing to a specific interface. There are also two cases in the code, where current implementation does not work in MultiNIC environments which we fixed: - MRClientService TaskAttemptListenerImpl where we had to propagate NM hostname through context, - which is set in ContainerManagerImpl via NodeId from YarnConfiguration.NM_ADDRESS (the logic Arpit mentioned in the comment) Please have a look the the patch version 5, for easier understanding. Hope this clarifies the initial idea. Thanks, Milan Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078272#comment-14078272 ] Craig Welch commented on YARN-1994: --- Milan At present multi-home only works if all interfaces on a box have the same names, managing which address a client will connect to has to be managed by controlling what address this will resolve to for a particular class of clients. This is necessary because there are cases where links and redirects are generated based on names, and for this to operate for all clients on all networks the names for the hadoop hosts must be the same everywhere. For it to work in any other way would require logic to use a particular name depending on the source network of the client when generating links, and that is not in the current scope (there would be other complexity around managing multiple names for the same host/service as well, which would be problematic). Since the only way for multi-home to work properly at this point is for the host to have the same name on all networks it can be accessed from the additional logic is unnecessary - when properly configured, it will always return the same name. The same is true for the container manager impl, mrclient service, task attempt, etc Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078307#comment-14078307 ] Hadoop QA commented on YARN-1994: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658462/YARN-1994.12.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4473//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4473//console This message is automatically generated. Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078337#comment-14078337 ] Craig Welch commented on YARN-1994: --- [~mipoto] Trying to understand the scenario here - is it the case that you have a host with multiple interfaces where some resolve on the host to a different name from the actual host name, but for some clients on other networks it resolves to the same name as the box (so, to a different name than the box sees for it's own interface)? Do I understand properly that the basic issue is that there is an interface on the box with a name different from what you want to use based on the address, but you do somehow want to bind to that name and will be able to use it from a client somewhere? For that to happen, the client would need to resolve the name differently and/or have a different yarn config with a different address in it, is that what is happening? Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2367) Make ResourceCalculator configurable for FairScheduler and FifoScheduler like CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved YARN-2367. -- Resolution: Not a Problem Hi Swapnil, The Fair Scheduler supports this through a different interface. Scheduling policies can be configured at any queue level in the hierarchy. In general, the FIFO scheduler lacks most of the advanced functionality of the Fair and Capacity schedulers. My opinion is that achieving parity is a non-goal. If you think this shouldn't be the case, feel free to reopen this JIRA under a name like Support multi-resource scheduling in the FIFO scheduler and we can discuss whether that's worth embarking on. Make ResourceCalculator configurable for FairScheduler and FifoScheduler like CapacityScheduler --- Key: YARN-2367 URL: https://issues.apache.org/jira/browse/YARN-2367 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.2.0, 2.3.0, 2.4.1 Reporter: Swapnil Daingade Priority: Minor The ResourceCalculator used by CapacityScheduler is read from a configuration file entry capacity-scheduler.xml yarn.scheduler.capacity.resource-calculator. This allows for custom implementations that implement the ResourceCalculator interface to be plugged in. It would be nice to have the same functionality in FairScheduler and FifoScheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078412#comment-14078412 ] Milan Potocnik commented on YARN-1994: -- [~cwelch] I'll try to explain one of the use cases. Let's say we have following interfaces in our network: - 1 ethernet, public network - 2 IB, private network. Please note that on Windows, IB does not support teaming On DNS Server, DNS entry for machine's hostname can resolve to any of the three interfaces (for each 'hostname' entry - three IP addresses). We also add a special DNS entry for each machine that resolves only to two IB interfaces, let's say in the form of 'hostname-IB'. Use case 1: We want internal communication in the cluster to always use IB. We also want to be fault tolerant if one of the IB fails (remember, no teaming on Windows). In order to bind to both IB interfaces, we must set bind address to 0.0.0.0. When this is set, clients when connecting will currently get hostname, which in some cases (DNS server usually returns IPs by round-robin) will resolve to Ethernet IP address, which could be blocked by firewall, or it might degrade performance in internal communication. By setting _BIND_HOST to 0.0.0.0 and _ADDRESS to 'hostname-IB' we avoid the non-determinism of InetSocketAddress.getHostName() For outside clients we can also control connectivity by making sure they connect via the public network, but this is a simpler problem, since they would use different DNS server. Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2212: Attachment: YARN-2212.4.patch ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped
[ https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078625#comment-14078625 ] Tsuyoshi OZAWA commented on YARN-2328: -- {quote} Had an offline discussion with Sandy. This approach of having a single background-tasks-thread would adversely affect continuous scheduling as that would be gated on a sleep and often-longer update. {quote} Checked preview patch. It makes sense to me. Thanks for taking this issue, Karthik. FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped Key: YARN-2328 URL: https://issues.apache.org/jira/browse/YARN-2328 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch, yarn-2328-preview.patch FairScheduler threads can use a little cleanup and tests. To begin with, the update and continuous-scheduling threads should extend Thread and handle being interrupted. We should have tests for starting and stopping them as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078648#comment-14078648 ] Jian He commented on YARN-2354: --- Hi Li, the test failures are different from previous one, can you make sure these failures are not related to this patch ? th! DistributedShell may allocate more containers than client specified after it restarts - Key: YARN-2354 URL: https://issues.apache.org/jira/browse/YARN-2354 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Li Lu Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch To reproduce, run distributed shell with -num_containers option, In ApplicationMaster.java, the following code has some issue. {code} int numTotalContainersToRequest = numTotalContainers - previousAMRunningContainers.size(); for (int i = 0; i numTotalContainersToRequest; ++i) { ContainerRequest containerAsk = setupContainerAskForRM(); amRMClient.addContainerRequest(containerAsk); } numRequestedContainers.set(numTotalContainersToRequest); {code} numRequestedContainers doesn't account for previous AM's requested containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: (was: YARN-796.patch.3) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: YARN-796.patch4 Fixing failed Test, FindBugs and JavaDocs Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2354: Attachment: YARN-2354-072914.patch Upload the same patch one more time to see if the unknown host exception reappears. DistributedShell may allocate more containers than client specified after it restarts - Key: YARN-2354 URL: https://issues.apache.org/jira/browse/YARN-2354 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Li Lu Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch, YARN-2354-072914.patch To reproduce, run distributed shell with -num_containers option, In ApplicationMaster.java, the following code has some issue. {code} int numTotalContainersToRequest = numTotalContainers - previousAMRunningContainers.size(); for (int i = 0; i numTotalContainersToRequest; ++i) { ContainerRequest containerAsk = setupContainerAskForRM(); amRMClient.addContainerRequest(containerAsk); } numRequestedContainers.set(numTotalContainersToRequest); {code} numRequestedContainers doesn't account for previous AM's requested containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078685#comment-14078685 ] Craig Welch commented on YARN-1994: --- Hmm, I think I understand the scenario. As I understand it, this will only be a problem if you want to refer to the host by a value other than the main hostname, e.g. something other than what would be returned from InetAddress.getLocalHost().getHostName() - as that is what it will come back with in the bind-host 0.0.0.0 case. (It's not that the results are indeterminate - it returns the primary host name, it's that you want to use something other than the primary host name). There are many places in the code where this convention is used to determine the name of the host - it's an implicit convention at least, and I'm concerned that there will be cases where this mismatch causes issues, as evidenced by all of the various things which needed to be tracked down/etc. Also, this is only realistic for single address services (or rather for cases where addresses are enumerated, including single addresses or ha resource manager, etc), others will not have an address in a centralized configuration to override them. Instead of this, could you use the primary hostname and control what address it uses through name resolution? Inside the cluster the hostname should resolve to only the infiniband addresses, that way only those will be used (the two ib addresses). From external networks you can set the resolution to the ethernet address as you mentioned above. This way, the host is always referred to by the same name, clients are always able to get to it over the desired interface, and no changes to the connect address logic are required. Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped
[ https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078708#comment-14078708 ] Hudson commented on YARN-2328: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1820 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1820/]) YARN-2328. FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped. (kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1614432) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped Key: YARN-2328 URL: https://issues.apache.org/jira/browse/YARN-2328 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch, yarn-2328-preview.patch FairScheduler threads can use a little cleanup and tests. To begin with, the update and continuous-scheduling threads should extend Thread and handle being interrupted. We should have tests for starting and stopping them as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078714#comment-14078714 ] Hadoop QA commented on YARN-2212: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658522/YARN-2212.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4474//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4474//console This message is automatically generated. ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078727#comment-14078727 ] Junping Du commented on YARN-2209: -- bq. I think the essential problem here is whether throwing new sub exception which may not be handled before is an acceptable incompatible change, and therefore whether it is worth trading it for code refactoring. Thoughts? I think it depends on if any behavior get changed (code functionality, log exception, etc.) in old client side. If old client doesn't aware this new sub exception and still catch it as the parent exception and do the same logic which seems to be fine. However, unfortunately, I concerned the case here doesn't belongs to this. Let's assume user's AM copy code from RMContainerAllocator, it treat YARNException and RESYNC/SHUTDOWN in response differently (with different exceptions to indicate different reasons). Now, the same reason that cause RESYNC/SHUTDOWN previously get thrown YARNExcetpion (though a new sub exception) in new RM and will be handle as the same as other reasons. So now the old application can only see one exception which is inconsistent behavior. bq. So the main point is, regardless how application is earlier handling the AMCommand, it should continue to work with this change. Existing YARN applications will not break because of this. IMO, breaking application doesn't means to break app's functionality only. Application's inconsistent behavior in the same exception belongs to this case also. In general, we may think the latter one doesn't sounds so serious as previous one but it affect user's experience (especially they meet exception before). It may worth to do so for serious bug fix or significant feature, but we should have more justifications for code refactor work. Replace AM resync/shutdown command with corresponding exceptions Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, YARN-2209.4.patch, YARN-2209.5.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078740#comment-14078740 ] Hadoop QA commented on YARN-2354: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658544/YARN-2354-072914.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4476//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4476//console This message is automatically generated. DistributedShell may allocate more containers than client specified after it restarts - Key: YARN-2354 URL: https://issues.apache.org/jira/browse/YARN-2354 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Li Lu Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch, YARN-2354-072914.patch To reproduce, run distributed shell with -num_containers option, In ApplicationMaster.java, the following code has some issue. {code} int numTotalContainersToRequest = numTotalContainers - previousAMRunningContainers.size(); for (int i = 0; i numTotalContainersToRequest; ++i) { ContainerRequest containerAsk = setupContainerAskForRM(); amRMClient.addContainerRequest(containerAsk); } numRequestedContainers.set(numTotalContainersToRequest); {code} numRequestedContainers doesn't account for previous AM's requested containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated YARN-2368: - Description: Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager (ip addr: 10.153.80.8) log shows as the following: {code} 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} Meanwhile, ZooKeeps log shows as the following: {code} 2014-07-25 22:10:09,728 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew session 0x247684586e70006 at /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client: 0x247684586e70006 2014-07-25 22:10:09,730 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth packet /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth success /10.153.80.8:58890 2014-07-25 22:10:09,742 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530 747 2014-07-25 22:10:09,743 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /10.153.80.8:58890 which had sessionid 0x247684586e70006 ... ... 2014-07-25 22:33:10,966 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747 {code} was: Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager log shows as the following: 2014-07-25 22:33:11,078 INFO
[jira] [Updated] (YARN-2370) Fix comment more accurate in AppSchedulingInfo.java
[ https://issues.apache.org/jira/browse/YARN-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenwu Peng updated YARN-2370: - Attachment: YARN-2370.0.patch This is only minor update. Thanks in advance for review Fix comment more accurate in AppSchedulingInfo.java --- Key: YARN-2370 URL: https://issues.apache.org/jira/browse/YARN-2370 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Priority: Trivial Attachments: YARN-2370.0.patch in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update OffRack request, the comment should be Update cloned OffRack requests for recovery not Update cloned RackLocal and OffRack requests for recover {code} // Update cloned RackLocal and OffRack requests for recovery resourceRequests.add(cloneResourceRequest(offSwitchRequest)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078804#comment-14078804 ] Tsuyoshi OZAWA commented on YARN-2368: -- Thanks for your contribution, [~breno.leitao]! Could you explain the condition you faced this problem? If we face this problem very often, it's critical problem of ZKRMStateStore. However, data stored in ZKRMStateStore are small basically. Therefore, I think it's strange that this kind of problem appear. Additionally, if the max data size is fixed, we should make it default value not to face this problem. ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB - Key: YARN-2368 URL: https://issues.apache.org/jira/browse/YARN-2368 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Priority: Critical Attachments: YARN-2368.patch Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager (ip addr: 10.153.80.8) log shows as the following: {code} 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} Meanwhile, ZooKeeps log shows as the following: {code} 2014-07-25 22:10:09,728 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew session 0x247684586e70006 at /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client: 0x247684586e70006 2014-07-25 22:10:09,730 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth packet /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth success /10.153.80.8:58890 2014-07-25 22:10:09,742 [myid:1] - WARN
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078807#comment-14078807 ] Hadoop QA commented on YARN-796: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658538/YARN-796.patch4 against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapred.TestMRWithDistributedCache org.apache.hadoop.mapred.TestJobClientGetJob org.apache.hadoop.mapred.TestLocalModeWithNewApis org.apache.hadoop.mapreduce.TestMapReduce org.apache.hadoop.mapreduce.lib.input.TestLineRecordReaderJobs org.apache.hadoop.mapred.jobcontrol.TestLocalJobControl org.apache.hadoop.mapred.TestJobCounters org.apache.hadoop.mapred.TestLocalMRNotification org.apache.hadoop.mapred.lib.TestDelegatingInputFormat org.apache.hadoop.mapred.TestReduceFetch org.apache.hadoop.mapreduce.TestMapReduceLazyOutput org.apache.hadoop.mapreduce.lib.join.TestJoinProperties org.apache.hadoop.mapred.lib.TestMultithreadedMapRunner org.apache.hadoop.mapred.TestClusterMRNotification org.apache.hadoop.mapreduce.v2.TestMRAppWithCombiner org.apache.hadoop.mapreduce.lib.chain.TestSingleElementChain org.apache.hadoop.mapreduce.TestMapperReducerCleanup org.apache.hadoop.mapreduce.security.TestBinaryTokenFile org.apache.hadoop.mapreduce.v2.TestMRJobsWithProfiler org.apache.hadoop.fs.TestFileSystem org.apache.hadoop.mapreduce.TestLargeSort org.apache.hadoop.mapred.join.TestDatamerge org.apache.hadoop.mapreduce.lib.input.TestMultipleInputs org.apache.hadoop.mapred.TestLazyOutput org.apache.hadoop.mapred.TestTaskCommit org.apache.hadoop.mapreduce.TestMRJobClient org.apache.hadoop.mapreduce.security.TestMRCredentials org.apache.hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers org.apache.hadoop.mapred.lib.TestChainMapReduce org.apache.hadoop.mapreduce.lib.fieldsel.TestMRFieldSelection org.apache.hadoop.mapreduce.lib.partition.TestMRKeyFieldBasedComparator org.apache.hadoop.mapreduce.lib.db.TestDataDrivenDBInputFormat org.apache.hadoop.mapred.TestSpecialCharactersInOutputPath org.apache.hadoop.mapreduce.v2.TestMRJobs org.apache.hadoop.mapred.TestMapRed org.apache.hadoop.mapred.lib.TestKeyFieldBasedComparator org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat org.apache.hadoop.mapreduce.v2.TestNonExistentJob org.apache.hadoop.mapreduce.lib.input.TestDelegatingInputFormat org.apache.hadoop.mapred.TestMiniMRChildTask org.apache.hadoop.fs.slive.TestSlive org.apache.hadoop.mapred.TestComparators org.apache.hadoop.mapreduce.v2.TestUberAM org.apache.hadoop.mapred.TestMiniMRClasspath org.apache.hadoop.mapred.TestMapOutputType org.apache.hadoop.mapreduce.lib.output.TestJobOutputCommitter
[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078805#comment-14078805 ] Tsuyoshi OZAWA commented on YARN-2368: -- [~breno.leitao], oops, sorry for calling you wrongly. I tried to mention [~guoleitao]. ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB - Key: YARN-2368 URL: https://issues.apache.org/jira/browse/YARN-2368 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Priority: Critical Attachments: YARN-2368.patch Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'. ResourceManager (ip addr: 10.153.80.8) log shows as the following: {code} 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} Meanwhile, ZooKeeps log shows as the following: {code} 2014-07-25 22:10:09,728 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew session 0x247684586e70006 at /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client: 0x247684586e70006 2014-07-25 22:10:09,730 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth packet /10.153.80.8:58890 2014-07-25 22:10:09,730 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth success /10.153.80.8:58890 2014-07-25 22:10:09,742 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530 747 2014-07-25 22:10:09,743 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client
[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs
[ https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Chen updated YARN-2172: --- Attachment: hadoop_job_suspend_resume.patch Hadoop Job Suspend and Resume svn patch file Suspend/Resume Hadoop Jobs -- Key: YARN-2172 URL: https://issues.apache.org/jira/browse/YARN-2172 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager, webapp Affects Versions: 2.2.0 Environment: CentOS 6.5, Hadoop 2.2.0 Reporter: Richard Chen Labels: hadoop, jobs, resume, suspend Fix For: 2.2.0 Attachments: Hadoop Job Suspend Resume Design.docx, hadoop_job_suspend_resume.patch Original Estimate: 336h Remaining Estimate: 336h In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it works in a rather solid way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs
[ https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Chen updated YARN-2172: --- Description: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it works in a rather solid and convenient way. was: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it works in a rather solid way. Suspend/Resume Hadoop Jobs -- Key: YARN-2172 URL: https://issues.apache.org/jira/browse/YARN-2172 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager, webapp Affects Versions: 2.2.0 Environment: CentOS 6.5, Hadoop 2.2.0 Reporter: Richard Chen Labels: hadoop, jobs, resume, suspend Fix For: 2.2.0 Attachments: Hadoop Job Suspend Resume Design.docx, hadoop_job_suspend_resume.patch Original Estimate: 336h Remaining Estimate: 336h In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it works in a rather solid and convenient way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2370) Fix comment more accurate in AppSchedulingInfo.java
[ https://issues.apache.org/jira/browse/YARN-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078840#comment-14078840 ] Hadoop QA commented on YARN-2370: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658574/YARN-2370.0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService org.apache.hadoop.yarn.server.resourcemanager.TestRMHA org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4477//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4477//console This message is automatically generated. Fix comment more accurate in AppSchedulingInfo.java --- Key: YARN-2370 URL: https://issues.apache.org/jira/browse/YARN-2370 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Priority: Trivial Attachments: YARN-2370.0.patch in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update OffRack request, the comment should be Update cloned OffRack requests for recovery not Update cloned RackLocal and OffRack requests for recover {code} // Update cloned RackLocal and OffRack requests for recovery resourceRequests.add(cloneResourceRequest(offSwitchRequest)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2370) Made comment more accurate in AppSchedulingInfo.java
[ https://issues.apache.org/jira/browse/YARN-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenwu Peng updated YARN-2370: - Summary: Made comment more accurate in AppSchedulingInfo.java (was: Fix comment more accurate in AppSchedulingInfo.java) Made comment more accurate in AppSchedulingInfo.java Key: YARN-2370 URL: https://issues.apache.org/jira/browse/YARN-2370 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Priority: Trivial Attachments: YARN-2370.0.patch in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update OffRack request, the comment should be Update cloned OffRack requests for recovery not Update cloned RackLocal and OffRack requests for recover {code} // Update cloned RackLocal and OffRack requests for recovery resourceRequests.add(cloneResourceRequest(offSwitchRequest)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078849#comment-14078849 ] Jian He commented on YARN-2354: --- looks good, checking this in. DistributedShell may allocate more containers than client specified after it restarts - Key: YARN-2354 URL: https://issues.apache.org/jira/browse/YARN-2354 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Li Lu Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch, YARN-2354-072914.patch To reproduce, run distributed shell with -num_containers option, In ApplicationMaster.java, the following code has some issue. {code} int numTotalContainersToRequest = numTotalContainers - previousAMRunningContainers.size(); for (int i = 0; i numTotalContainersToRequest; ++i) { ContainerRequest containerAsk = setupContainerAskForRM(); amRMClient.addContainerRequest(containerAsk); } numRequestedContainers.set(numTotalContainersToRequest); {code} numRequestedContainers doesn't account for previous AM's requested containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078851#comment-14078851 ] Li Lu commented on YARN-2354: - Thanks [~jianhe]. In the second time I think the only problem left is the persistent one, but unrelated to this patch. DistributedShell may allocate more containers than client specified after it restarts - Key: YARN-2354 URL: https://issues.apache.org/jira/browse/YARN-2354 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Li Lu Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch, YARN-2354-072914.patch To reproduce, run distributed shell with -num_containers option, In ApplicationMaster.java, the following code has some issue. {code} int numTotalContainersToRequest = numTotalContainers - previousAMRunningContainers.size(); for (int i = 0; i numTotalContainersToRequest; ++i) { ContainerRequest containerAsk = setupContainerAskForRM(); amRMClient.addContainerRequest(containerAsk); } numRequestedContainers.set(numTotalContainersToRequest); {code} numRequestedContainers doesn't account for previous AM's requested containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078859#comment-14078859 ] Hudson commented on YARN-2354: -- FAILURE: Integrated in Hadoop-trunk-Commit #5983 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5983/]) YARN-2354. DistributedShell may allocate more containers than client specified after AM restarts. Contributed by Li Lu (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1614538) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSFailedAppMaster.java DistributedShell may allocate more containers than client specified after it restarts - Key: YARN-2354 URL: https://issues.apache.org/jira/browse/YARN-2354 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Li Lu Fix For: 2.6.0 Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch, YARN-2354-072914.patch To reproduce, run distributed shell with -num_containers option, In ApplicationMaster.java, the following code has some issue. {code} int numTotalContainersToRequest = numTotalContainers - previousAMRunningContainers.size(); for (int i = 0; i numTotalContainersToRequest; ++i) { ContainerRequest containerAsk = setupContainerAskForRM(); amRMClient.addContainerRequest(containerAsk); } numRequestedContainers.set(numTotalContainersToRequest); {code} numRequestedContainers doesn't account for previous AM's requested containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2370) Made comment more accurate in AppSchedulingInfo.java
[ https://issues.apache.org/jira/browse/YARN-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078870#comment-14078870 ] Hadoop QA commented on YARN-2370: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658574/YARN-2370.0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4478//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4478//console This message is automatically generated. Made comment more accurate in AppSchedulingInfo.java Key: YARN-2370 URL: https://issues.apache.org/jira/browse/YARN-2370 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Priority: Trivial Attachments: YARN-2370.0.patch in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update OffRack request, the comment should be Update cloned OffRack requests for recovery not Update cloned RackLocal and OffRack requests for recover {code} // Update cloned RackLocal and OffRack requests for recovery resourceRequests.add(cloneResourceRequest(offSwitchRequest)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)