[jira] [Commented] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
[ https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546820#comment-14546820 ] Hudson commented on YARN-3526: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2145 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2145/]) YARN-3526. ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster. Contributed by Weiwei Yang (xgong: rev b0ad644083a0dfae3a39159ac88b6fc09d846371) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * hadoop-yarn-project/CHANGES.txt ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster - Key: YARN-3526 URL: https://issues.apache.org/jira/browse/YARN-3526 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.6.0 Environment: Red Hat Enterprise Linux Server 6.4 Reporter: Weiwei Yang Assignee: Weiwei Yang Labels: BB2015-05-TBR Fix For: 2.7.1 Attachments: YARN-3526.001.patch, YARN-3526.002.patch On a QJM HA cluster, view RM web UI to track job status, it shows This is standby RM. Redirecting to the current active RM: http://active-RM:8088/proxy/application_1427338037905_0008/mapreduce it refreshes every 3 sec but never going to the correct tracking page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
[ https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546799#comment-14546799 ] Hudson commented on YARN-3526: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #197 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/197/]) YARN-3526. ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster. Contributed by Weiwei Yang (xgong: rev b0ad644083a0dfae3a39159ac88b6fc09d846371) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster - Key: YARN-3526 URL: https://issues.apache.org/jira/browse/YARN-3526 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.6.0 Environment: Red Hat Enterprise Linux Server 6.4 Reporter: Weiwei Yang Assignee: Weiwei Yang Labels: BB2015-05-TBR Fix For: 2.7.1 Attachments: YARN-3526.001.patch, YARN-3526.002.patch On a QJM HA cluster, view RM web UI to track job status, it shows This is standby RM. Redirecting to the current active RM: http://active-RM:8088/proxy/application_1427338037905_0008/mapreduce it refreshes every 3 sec but never going to the correct tracking page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2421) RM still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546800#comment-14546800 ] Hudson commented on YARN-2421: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #197 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/197/]) YARN-2421. RM still allocates containers to an app in the FINISHING state. Contributed by Chang Li (jlowe: rev f7e051c4310024d4040ad466c34432c72e88b0fc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java RM still allocates containers to an app in the FINISHING state -- Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Thomas Graves Assignee: Chang Li Fix For: 2.8.0 Attachments: YARN-2421.4.patch, YARN-2421.5.patch, YARN-2421.6.patch, YARN-2421.7.patch, YARN-2421.8.patch, YARN-2421.9.patch, yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3591: Assignee: Lavkesh Lahngir Resource Localisation on a bad disk causes subsequent containers failure - Key: YARN-3591 URL: https://issues.apache.org/jira/browse/YARN-3591 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory. At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path. In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open. Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS. A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
[ https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546781#comment-14546781 ] Hudson commented on YARN-3526: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #187 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/187/]) YARN-3526. ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster. Contributed by Weiwei Yang (xgong: rev b0ad644083a0dfae3a39159ac88b6fc09d846371) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster - Key: YARN-3526 URL: https://issues.apache.org/jira/browse/YARN-3526 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.6.0 Environment: Red Hat Enterprise Linux Server 6.4 Reporter: Weiwei Yang Assignee: Weiwei Yang Labels: BB2015-05-TBR Fix For: 2.7.1 Attachments: YARN-3526.001.patch, YARN-3526.002.patch On a QJM HA cluster, view RM web UI to track job status, it shows This is standby RM. Redirecting to the current active RM: http://active-RM:8088/proxy/application_1427338037905_0008/mapreduce it refreshes every 3 sec but never going to the correct tracking page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2421) RM still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546782#comment-14546782 ] Hudson commented on YARN-2421: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #187 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/187/]) YARN-2421. RM still allocates containers to an app in the FINISHING state. Contributed by Chang Li (jlowe: rev f7e051c4310024d4040ad466c34432c72e88b0fc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/CHANGES.txt RM still allocates containers to an app in the FINISHING state -- Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Thomas Graves Assignee: Chang Li Fix For: 2.8.0 Attachments: YARN-2421.4.patch, YARN-2421.5.patch, YARN-2421.6.patch, YARN-2421.7.patch, YARN-2421.8.patch, YARN-2421.9.patch, yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546780#comment-14546780 ] Hudson commented on YARN-3505: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #187 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/187/]) YARN-3505 addendum: fix an issue in previous patch. (junping_du: rev 03a293aed6de101b0cae1a294f506903addcaa75) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Fix For: 2.8.0 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch, YARN-3505.6.patch, YARN-3505.addendum.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2421) RM still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546821#comment-14546821 ] Hudson commented on YARN-2421: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2145 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2145/]) YARN-2421. RM still allocates containers to an app in the FINISHING state. Contributed by Chang Li (jlowe: rev f7e051c4310024d4040ad466c34432c72e88b0fc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java RM still allocates containers to an app in the FINISHING state -- Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Thomas Graves Assignee: Chang Li Fix For: 2.8.0 Attachments: YARN-2421.4.patch, YARN-2421.5.patch, YARN-2421.6.patch, YARN-2421.7.patch, YARN-2421.8.patch, YARN-2421.9.patch, yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3651) Tracking url in ApplicationCLI wrong for running application
[ https://issues.apache.org/jira/browse/YARN-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546753#comment-14546753 ] Devaraj K commented on YARN-3651: - It seems intentionally SSL was disabled for MR AM, please refer the inline comment in MRClientService.java. You can enable it for all yarn daemons and also for jobhistory using the configurations. Tracking url in ApplicationCLI wrong for running application Key: YARN-3651 URL: https://issues.apache.org/jira/browse/YARN-3651 Project: Hadoop YARN Issue Type: Bug Components: applications, resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Priority: Minor Application URL in Application CLI wrong Steps to reproduce == 1. Start HA setup insecure mode 2.Configure HTTPS_ONLY 3.Submit application to cluster 4.Execute command ./yarn application -list 5.Observer tracking URL shown {code} 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History server at /IP:45034 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id --- Tracking-URL application_1431672734347_0003 *http://host-10-19-92-117:13013* {code} *Expected* https://IP:64323/proxy/application_1431672734347_0003 / -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546770#comment-14546770 ] Hudson commented on YARN-3505: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2127 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2127/]) YARN-3505 addendum: fix an issue in previous patch. (junping_du: rev 03a293aed6de101b0cae1a294f506903addcaa75) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Fix For: 2.8.0 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch, YARN-3505.6.patch, YARN-3505.addendum.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2421) RM still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546772#comment-14546772 ] Hudson commented on YARN-2421: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2127 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2127/]) YARN-2421. RM still allocates containers to an app in the FINISHING state. Contributed by Chang Li (jlowe: rev f7e051c4310024d4040ad466c34432c72e88b0fc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java RM still allocates containers to an app in the FINISHING state -- Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Thomas Graves Assignee: Chang Li Fix For: 2.8.0 Attachments: YARN-2421.4.patch, YARN-2421.5.patch, YARN-2421.6.patch, YARN-2421.7.patch, YARN-2421.8.patch, YARN-2421.9.patch, yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
[ https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546771#comment-14546771 ] Hudson commented on YARN-3526: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2127 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2127/]) YARN-3526. ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster. Contributed by Weiwei Yang (xgong: rev b0ad644083a0dfae3a39159ac88b6fc09d846371) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java * hadoop-yarn-project/CHANGES.txt ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster - Key: YARN-3526 URL: https://issues.apache.org/jira/browse/YARN-3526 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.6.0 Environment: Red Hat Enterprise Linux Server 6.4 Reporter: Weiwei Yang Assignee: Weiwei Yang Labels: BB2015-05-TBR Fix For: 2.7.1 Attachments: YARN-3526.001.patch, YARN-3526.002.patch On a QJM HA cluster, view RM web UI to track job status, it shows This is standby RM. Redirecting to the current active RM: http://active-RM:8088/proxy/application_1427338037905_0008/mapreduce it refreshes every 3 sec but never going to the correct tracking page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2748) Upload logs in the sub-folders under the local log dir when aggregating logs
[ https://issues.apache.org/jira/browse/YARN-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546792#comment-14546792 ] Varun Saxena commented on YARN-2748: [~vinodkv], IIUC the use case Sumit had which led to filing of YARN-2734 was that rolling log files were being backed up in a sub folder. Sumit didnt quite get back on this however to confirm. Upload logs in the sub-folders under the local log dir when aggregating logs Key: YARN-2748 URL: https://issues.apache.org/jira/browse/YARN-2748 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Varun Saxena Labels: BB2015-05-RFC Attachments: YARN-2748.001.patch, YARN-2748.002.patch, YARN-2748.03.patch, YARN-2748.04.patch YARN-2734 has a temporal fix to skip sub folders to avoid exception. Ideally, if the app is creating a sub folder and putting its rolling logs there, we need to upload these logs as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
[ https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546723#comment-14546723 ] Hudson commented on YARN-3526: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #929 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/929/]) YARN-3526. ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster. Contributed by Weiwei Yang (xgong: rev b0ad644083a0dfae3a39159ac88b6fc09d846371) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster - Key: YARN-3526 URL: https://issues.apache.org/jira/browse/YARN-3526 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.6.0 Environment: Red Hat Enterprise Linux Server 6.4 Reporter: Weiwei Yang Assignee: Weiwei Yang Labels: BB2015-05-TBR Fix For: 2.7.1 Attachments: YARN-3526.001.patch, YARN-3526.002.patch On a QJM HA cluster, view RM web UI to track job status, it shows This is standby RM. Redirecting to the current active RM: http://active-RM:8088/proxy/application_1427338037905_0008/mapreduce it refreshes every 3 sec but never going to the correct tracking page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546722#comment-14546722 ] Hudson commented on YARN-3505: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #929 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/929/]) YARN-3505 addendum: fix an issue in previous patch. (junping_du: rev 03a293aed6de101b0cae1a294f506903addcaa75) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Fix For: 2.8.0 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch, YARN-3505.6.patch, YARN-3505.addendum.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2421) RM still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546724#comment-14546724 ] Hudson commented on YARN-2421: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #929 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/929/]) YARN-2421. RM still allocates containers to an app in the FINISHING state. Contributed by Chang Li (jlowe: rev f7e051c4310024d4040ad466c34432c72e88b0fc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * hadoop-yarn-project/CHANGES.txt RM still allocates containers to an app in the FINISHING state -- Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Thomas Graves Assignee: Chang Li Fix For: 2.8.0 Attachments: YARN-2421.4.patch, YARN-2421.5.patch, YARN-2421.6.patch, YARN-2421.7.patch, YARN-2421.8.patch, YARN-2421.9.patch, yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547040#comment-14547040 ] Raju Bairishetti commented on YARN-3644: [~sandflee] Yes, NM should catch the exception and keeps it alive. Right now, NM shuts down itself only in case of connection failures. NM ignores all other kinds of exceptions and errors while sending heartbeats. {code} } catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e); } catch (Throwable e) { // TODO Better error handling. Thread can die with the rest of the // NM still running. LOG.error(Caught exception in status-updater, e); } {code} Node manager shuts down if unable to connect with RM Key: YARN-3644 URL: https://issues.apache.org/jira/browse/YARN-3644 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Srikanth Sundarrajan When NM is unable to connect to RM, NM shuts itself down. {code} } catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e); {code} In large clusters, if RM is down for maintenance for longer period, all the NMs shuts themselves down, requiring additional work to bring up the NMs. Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where non connection failures are being retried infinitely by all YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.
[ https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546587#comment-14546587 ] Xuan Gong commented on YARN-221: [~mingma] Thanks for working on this. I have some general comments and want to discuss with you. We could have a common interface called ContainerLogAggregationPolicy which can include at least this function: * doLogAggregationForContainer (You might need a better name.). And this function will be called by AppLogAggregator to check whether the log for this container need to be aggregated. So, instead of creating a enum type: ContainerLogAggregationPolicy {code} AGGREGATE, DO_NOT_AGGREGATE, AGGREGATE_FAILED, AGGREGATE_FAILED_OR_KILLED {code} We could create some basic policy which implements the common interface ContainerLogAggregationPolicy, such as AllContainerLogAggregationPolicy, NonContainerLogAggregationPolicy, AMContainerOnlyLogAggregationPolicy, FailContainerOnlyLogAggregationPolicy, SampleRateContainerLogAggregationPolicy, etc. I think that this way might be more extendible. And in the future, clients can implement their own ContainerLogAggregationPolicy which can be more complex. With this, we do not need add any new configurations in service side. {code} + public static final String LOG_AGGREGATION_SAMPLE_PERCENT = NM_PREFIX + + log-aggregation.worker-sample-percent; + public static final float DEFAULT_LOG_AGGREGATION_SAMPLE_PERCENT = 1.0f; + + public static final String LOG_AGGREGATION_AM_LOGS = NM_PREFIX + + log-aggregation.am-enable; + public static final boolean DEFAULT_LOG_AGGREGATION_AM_LOGS = true; {code} can be removed Also, instead of adding ContainerLogAggregationPolicy into CLC, we could add ContainerLogAggregationPolicy into LogAggregationContext which already can be accessed by NM. Thoughts ? NM should provide a way for AM to tell it not to aggregate logs. Key: YARN-221 URL: https://issues.apache.org/jira/browse/YARN-221 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager Reporter: Robert Joseph Evans Assignee: Ming Ma Attachments: YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch The NodeManager should provide a way for an AM to tell it that either the logs should not be aggregated, that they should be aggregated with a high priority, or that they should be aggregated but with a lower priority. The AM should be able to do this in the ContainerLaunch context to provide a default value, but should also be able to update the value when the container is released. This would allow for the NM to not aggregate logs in some cases, and avoid connection to the NN at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546590#comment-14546590 ] zhihai xu commented on YARN-3591: - [~lavkesh], Currently DirectoryCollection supports {{fullDirs}} and {{errorDirs}}. Both are not good dirs. IMO {{fullDirs}} is the disk which can become good when the localized files are deleted by above cache-clean-up and {{errorDirs}} is the corrupted disk which can't become good until somebody fix it manually. Calling removeResource for localized resource in {{errorDirs}} sounds reasonable to me. Resource Localisation on a bad disk causes subsequent containers failure - Key: YARN-3591 URL: https://issues.apache.org/jira/browse/YARN-3591 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory. At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path. In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open. Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS. A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3489) RMServerUtils.validateResourceRequests should only obtain queue info once
[ https://issues.apache.org/jira/browse/YARN-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3489: --- Attachment: YARN-3489-branch-2.7.03.patch RMServerUtils.validateResourceRequests should only obtain queue info once - Key: YARN-3489 URL: https://issues.apache.org/jira/browse/YARN-3489 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Varun Saxena Labels: BB2015-05-RFC Attachments: YARN-3489-branch-2.7.02.patch, YARN-3489-branch-2.7.03.patch, YARN-3489-branch-2.7.patch, YARN-3489.01.patch, YARN-3489.02.patch, YARN-3489.03.patch Since the label support was added we now get the queue info for each request being validated in SchedulerUtils.validateResourceRequest. If validateResourceRequests needs to validate a lot of requests at a time (e.g.: large cluster with lots of varied locality in the requests) then it will get the queue info for each request. Since we build the queue info this generates a lot of unnecessary garbage, as the queue isn't changing between requests. We should grab the queue info once and pass it down rather than building it again for each request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2421) RM still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546711#comment-14546711 ] Hudson commented on YARN-2421: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #198 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/198/]) YARN-2421. RM still allocates containers to an app in the FINISHING state. Contributed by Chang Li (jlowe: rev f7e051c4310024d4040ad466c34432c72e88b0fc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/CHANGES.txt RM still allocates containers to an app in the FINISHING state -- Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Thomas Graves Assignee: Chang Li Fix For: 2.8.0 Attachments: YARN-2421.4.patch, YARN-2421.5.patch, YARN-2421.6.patch, YARN-2421.7.patch, YARN-2421.8.patch, YARN-2421.9.patch, yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546709#comment-14546709 ] Hudson commented on YARN-3505: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #198 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/198/]) YARN-3505 addendum: fix an issue in previous patch. (junping_du: rev 03a293aed6de101b0cae1a294f506903addcaa75) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Fix For: 2.8.0 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch, YARN-3505.6.patch, YARN-3505.addendum.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
[ https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546710#comment-14546710 ] Hudson commented on YARN-3526: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #198 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/198/]) YARN-3526. ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster. Contributed by Weiwei Yang (xgong: rev b0ad644083a0dfae3a39159ac88b6fc09d846371) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * hadoop-yarn-project/CHANGES.txt ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster - Key: YARN-3526 URL: https://issues.apache.org/jira/browse/YARN-3526 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.6.0 Environment: Red Hat Enterprise Linux Server 6.4 Reporter: Weiwei Yang Assignee: Weiwei Yang Labels: BB2015-05-TBR Fix For: 2.7.1 Attachments: YARN-3526.001.patch, YARN-3526.002.patch On a QJM HA cluster, view RM web UI to track job status, it shows This is standby RM. Redirecting to the current active RM: http://active-RM:8088/proxy/application_1427338037905_0008/mapreduce it refreshes every 3 sec but never going to the correct tracking page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3561) Non-AM Containers continue to run even after AM is stopped
[ https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chackaravarthy updated YARN-3561: - Attachment: application_1431771946377_0001.zip [~gsaha] / [~jianhe] Attached logs (application_1431771946377_0001.zip) with debug level enabled. It contains RM and NM logs from hosts running Slider AM and non-AM application containers. container_1431771946377_0001_01_01 - host3 - SliderAM container_1431771946377_0001_01_02 - host7 - NIMBUS container_1431771946377_0001_01_03 - host5 - STORM_UI_SERVER container_1431771946377_0001_01_04 - host3 - DRPC_SERVER container_1431771946377_0001_01_05 - host6 - SUPERVISOR *Timing of issuing the commands:* Slider start command : 2015-05-16 15:57:11,954 Slider stop command : 2015-05-16 15:59:06,480 Non-AM Containers continue to run even after AM is stopped -- Key: YARN-3561 URL: https://issues.apache.org/jira/browse/YARN-3561 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, yarn Affects Versions: 2.6.0 Environment: debian 7 Reporter: Gour Saha Priority: Critical Attachments: app0001.zip, application_1431771946377_0001.zip Non-AM containers continue to run even after application is stopped. This occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a Hadoop 2.6 deployment. Following are the NM logs from 2 different nodes: *host-07* - where Slider AM was running *host-03* - where Storm NIMBUS container was running. *Note:* The logs are partial, starting with the time when the relevant Slider AM and NIMBUS containers were allocated, till the time when the Slider AM was stopped. Also, the large number of Memory usage log lines were removed keeping only a few starts and ends of every segment. *NM log from host-07 where Slider AM container was running:* {noformat} 2015-04-29 00:39:24,614 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for container_1428575950531_0020_02_01 2015-04-29 00:41:10,310 INFO ipc.Server (Server.java:saslProcess(1306)) - Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE) 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for container_1428575950531_0021_01_01 by user yarn 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new application reference for app application_1428575950531_0021 2015-04-29 00:41:10,323 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from NEW to INITING 2015-04-29 00:41:10,325 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=yarn IP=10.84.105.162 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1428575950531_0021 CONTAINERID=container_1428575950531_0021_01_01 2015-04-29 00:41:10,328 WARN logaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users. 2015-04-29 00:41:10,328 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:init(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2015-04-29 00:41:10,351 INFO application.Application (ApplicationImpl.java:transition(304)) - Adding container_1428575950531_0021_01_01 to application application_1428575950531_0021 2015-04-29 00:41:10,352 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from INITING to RUNNING 2015-04-29 00:41:10,356 INFO container.Container (ContainerImpl.java:handle(999)) - Container container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING 2015-04-29 00:41:10,357 INFO containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId application_1428575950531_0021 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource
[jira] [Commented] (YARN-3489) RMServerUtils.validateResourceRequests should only obtain queue info once
[ https://issues.apache.org/jira/browse/YARN-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546691#comment-14546691 ] Hadoop QA commented on YARN-3489: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12733319/YARN-3489-branch-2.7.03.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / b0ad644 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7960/console | This message was automatically generated. RMServerUtils.validateResourceRequests should only obtain queue info once - Key: YARN-3489 URL: https://issues.apache.org/jira/browse/YARN-3489 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Varun Saxena Labels: BB2015-05-RFC Attachments: YARN-3489-branch-2.7.02.patch, YARN-3489-branch-2.7.03.patch, YARN-3489-branch-2.7.patch, YARN-3489.01.patch, YARN-3489.02.patch, YARN-3489.03.patch Since the label support was added we now get the queue info for each request being validated in SchedulerUtils.validateResourceRequest. If validateResourceRequests needs to validate a lot of requests at a time (e.g.: large cluster with lots of varied locality in the requests) then it will get the queue info for each request. Since we build the queue info this generates a lot of unnecessary garbage, as the queue isn't changing between requests. We should grab the queue info once and pass it down rather than building it again for each request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3565) NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String
[ https://issues.apache.org/jira/browse/YARN-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547010#comment-14547010 ] Naganarasimha G R commented on YARN-3565: - Hi [~wangda], Findbugs and whitespace issue is not related to the patch /cc [~aw], I think currently white space is getting calculated on the diff output rather just the modified lines only (diff has some lines before and after the modifications). git apply --whitespace=fix also passes for this patch. NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String - Key: YARN-3565 URL: https://issues.apache.org/jira/browse/YARN-3565 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Priority: Blocker Attachments: YARN-3565-20150502-1.patch, YARN-3565.20150515-1.patch, YARN-3565.20150516-1.patch Now NM HB/Register uses SetString, it will be hard to add new fields if we want to support specifying NodeLabel type such as exclusivity/constraints, etc. We need to make sure rolling upgrade works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546971#comment-14546971 ] sandflee commented on YARN-3644: If RM is down, NM's connection will be reset by RM machine, could we catch this exception and keeps NM alive? Node manager shuts down if unable to connect with RM Key: YARN-3644 URL: https://issues.apache.org/jira/browse/YARN-3644 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Srikanth Sundarrajan When NM is unable to connect to RM, NM shuts itself down. {code} } catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e); {code} In large clusters, if RM is down for maintenance for longer period, all the NMs shuts themselves down, requiring additional work to bring up the NMs. Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where non connection failures are being retried infinitely by all YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)