[jira] [Commented] (YARN-2949) Add documentation for CGroups
[ https://issues.apache.org/jira/browse/YARN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253237#comment-14253237 ] Hudson commented on YARN-2949: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #46 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/46/]) YARN-2949. Add documentation for CGroups. (Contributed by Varun Vasudev) (junping_du: rev 389f881d423c1f7c2bb90ff521e59eb8c7d26214) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerCgroups.apt.vm * hadoop-yarn-project/CHANGES.txt * hadoop-project/src/site/site.xml Add documentation for CGroups - Key: YARN-2949 URL: https://issues.apache.org/jira/browse/YARN-2949 Project: Hadoop YARN Issue Type: Task Components: documentation, nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.7.0 Attachments: NodeManagerCgroups.html, apache-yarn-2949.0.patch, apache-yarn-2949.1.patch A bunch of changes have gone into the NodeManager to allow greater use of CGroups. It would be good to have a single page that documents how to setup CGroups and the controls available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253229#comment-14253229 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #46 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/46/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253246#comment-14253246 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Yarn-trunk #780 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/780/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2949) Add documentation for CGroups
[ https://issues.apache.org/jira/browse/YARN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253254#comment-14253254 ] Hudson commented on YARN-2949: -- FAILURE: Integrated in Hadoop-Yarn-trunk #780 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/780/]) YARN-2949. Add documentation for CGroups. (Contributed by Varun Vasudev) (junping_du: rev 389f881d423c1f7c2bb90ff521e59eb8c7d26214) * hadoop-yarn-project/CHANGES.txt * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerCgroups.apt.vm Add documentation for CGroups - Key: YARN-2949 URL: https://issues.apache.org/jira/browse/YARN-2949 Project: Hadoop YARN Issue Type: Task Components: documentation, nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.7.0 Attachments: NodeManagerCgroups.html, apache-yarn-2949.0.patch, apache-yarn-2949.1.patch A bunch of changes have gone into the NodeManager to allow greater use of CGroups. It would be good to have a single page that documents how to setup CGroups and the controls available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253440#comment-14253440 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1978 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1978/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2949) Add documentation for CGroups
[ https://issues.apache.org/jira/browse/YARN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253449#comment-14253449 ] Hudson commented on YARN-2949: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1978 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1978/]) YARN-2949. Add documentation for CGroups. (Contributed by Varun Vasudev) (junping_du: rev 389f881d423c1f7c2bb90ff521e59eb8c7d26214) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerCgroups.apt.vm * hadoop-project/src/site/site.xml * hadoop-yarn-project/CHANGES.txt Add documentation for CGroups - Key: YARN-2949 URL: https://issues.apache.org/jira/browse/YARN-2949 Project: Hadoop YARN Issue Type: Task Components: documentation, nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.7.0 Attachments: NodeManagerCgroups.html, apache-yarn-2949.0.patch, apache-yarn-2949.1.patch A bunch of changes have gone into the NodeManager to allow greater use of CGroups. It would be good to have a single page that documents how to setup CGroups and the controls available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2949) Add documentation for CGroups
[ https://issues.apache.org/jira/browse/YARN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253463#comment-14253463 ] Hudson commented on YARN-2949: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #43 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/43/]) YARN-2949. Add documentation for CGroups. (Contributed by Varun Vasudev) (junping_du: rev 389f881d423c1f7c2bb90ff521e59eb8c7d26214) * hadoop-project/src/site/site.xml * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerCgroups.apt.vm Add documentation for CGroups - Key: YARN-2949 URL: https://issues.apache.org/jira/browse/YARN-2949 Project: Hadoop YARN Issue Type: Task Components: documentation, nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.7.0 Attachments: NodeManagerCgroups.html, apache-yarn-2949.0.patch, apache-yarn-2949.1.patch A bunch of changes have gone into the NodeManager to allow greater use of CGroups. It would be good to have a single page that documents how to setup CGroups and the controls available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253455#comment-14253455 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #43 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/43/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253502#comment-14253502 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #47 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/47/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2949) Add documentation for CGroups
[ https://issues.apache.org/jira/browse/YARN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253510#comment-14253510 ] Hudson commented on YARN-2949: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #47 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/47/]) YARN-2949. Add documentation for CGroups. (Contributed by Varun Vasudev) (junping_du: rev 389f881d423c1f7c2bb90ff521e59eb8c7d26214) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerCgroups.apt.vm * hadoop-project/src/site/site.xml Add documentation for CGroups - Key: YARN-2949 URL: https://issues.apache.org/jira/browse/YARN-2949 Project: Hadoop YARN Issue Type: Task Components: documentation, nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.7.0 Attachments: NodeManagerCgroups.html, apache-yarn-2949.0.patch, apache-yarn-2949.1.patch A bunch of changes have gone into the NodeManager to allow greater use of CGroups. It would be good to have a single page that documents how to setup CGroups and the controls available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2949) Add documentation for CGroups
[ https://issues.apache.org/jira/browse/YARN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253531#comment-14253531 ] Hudson commented on YARN-2949: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1997 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1997/]) YARN-2949. Add documentation for CGroups. (Contributed by Varun Vasudev) (junping_du: rev 389f881d423c1f7c2bb90ff521e59eb8c7d26214) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerCgroups.apt.vm * hadoop-project/src/site/site.xml * hadoop-yarn-project/CHANGES.txt Add documentation for CGroups - Key: YARN-2949 URL: https://issues.apache.org/jira/browse/YARN-2949 Project: Hadoop YARN Issue Type: Task Components: documentation, nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.7.0 Attachments: NodeManagerCgroups.html, apache-yarn-2949.0.patch, apache-yarn-2949.1.patch A bunch of changes have gone into the NodeManager to allow greater use of CGroups. It would be good to have a single page that documents how to setup CGroups and the controls available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253523#comment-14253523 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1997 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1997/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2946) DeadLocks in RMStateStore-ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2946: - Attachment: 0001-YARN-2946.patch DeadLocks in RMStateStore-ZKRMStateStore -- Key: YARN-2946 URL: https://issues.apache.org/jira/browse/YARN-2946 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-2946.patch, 0001-YARN-2946.patch, 0002-YARN-2946.patch, RM_BeforeFix_Deadlock_cycle_1.png, RM_BeforeFix_Deadlock_cycle_2.png, TestYARN2946.java Found one deadlock in ZKRMStateStore. # Initial stage zkClient is null because of zk disconnected event. # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to re establish zookeeper connection either via synconnected or expired event, it is highly possible that any other thred can obtain lock on {{ZKRMStateStore.this}} from state machine transition events. This cause Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2946) DeadLocks in RMStateStore-ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253688#comment-14253688 ] Rohith commented on YARN-2946: -- I updated the patch with following fix # All the token storage handled synchronously via state machine. # Removed unnecessary synchronization from the method. This ensures 1st point For the test, deployed in cluster by integrating with JCarder. Executed same scenario as per my earlier comment for checking any deadlock cycles. JCarder has not identified any deadlock cycles. Kindly review the patch DeadLocks in RMStateStore-ZKRMStateStore -- Key: YARN-2946 URL: https://issues.apache.org/jira/browse/YARN-2946 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-2946.patch, 0001-YARN-2946.patch, 0002-YARN-2946.patch, RM_BeforeFix_Deadlock_cycle_1.png, RM_BeforeFix_Deadlock_cycle_2.png, TestYARN2946.java Found one deadlock in ZKRMStateStore. # Initial stage zkClient is null because of zk disconnected event. # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to re establish zookeeper connection either via synconnected or expired event, it is highly possible that any other thred can obtain lock on {{ZKRMStateStore.this}} from state machine transition events. This cause Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated YARN-2877: Assignee: Konstantinos Karanasos Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao Assignee: Konstantinos Karanasos This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253753#comment-14253753 ] Chen He commented on YARN-1680: --- Any update on this issue? I have some free cycles recently. availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Craig Welch Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253765#comment-14253765 ] Craig Welch commented on YARN-1680: --- Go for it :-) I thought I was free to work it, and as soon as we switched the assignment I got too busy with other things. availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Craig Welch Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1680: -- Assignee: Chen He (was: Craig Welch) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Chen He Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253773#comment-14253773 ] Chen He commented on YARN-1680: --- Thanks, [~cwelch]. availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Chen He Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2946) DeadLocks in RMStateStore-ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253793#comment-14253793 ] Hadoop QA commented on YARN-2946: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688362/0001-YARN-2946.patch against trunk revision 6635ccd. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRM Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6156//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6156//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6156//console This message is automatically generated. DeadLocks in RMStateStore-ZKRMStateStore -- Key: YARN-2946 URL: https://issues.apache.org/jira/browse/YARN-2946 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-2946.patch, 0001-YARN-2946.patch, 0002-YARN-2946.patch, RM_BeforeFix_Deadlock_cycle_1.png, RM_BeforeFix_Deadlock_cycle_2.png, TestYARN2946.java Found one deadlock in ZKRMStateStore. # Initial stage zkClient is null because of zk disconnected event. # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to re establish zookeeper connection either via synconnected or expired event, it is highly possible that any other thred can obtain lock on {{ZKRMStateStore.this}} from state machine transition events. This cause Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2975: --- Attachment: yarn-2975-2.patch Updated patch to preserve behavior of FSLeafQueue#removeApp and add FSLeafQueue#removeNonRunnableApp separately. FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-2975-1.patch, yarn-2975-2.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254086#comment-14254086 ] Hadoop QA commented on YARN-2975: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688412/yarn-2975-2.patch against trunk revision d9e4d67. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6157//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6157//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6157//console This message is automatically generated. FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-2975-1.patch, yarn-2975-2.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254095#comment-14254095 ] Jian He commented on YARN-2964: --- bq. do you think this is something we can/should fix in YARN? I think so. RM is the designated renewer so it should renew the token every so often. But because there's a bug in DelegationTokenRenewer, RM just forgets the token and won't renew the token automatically. So we should fix this in DelegationTokenRenewer to keep track of the token and renew the token properly. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2738) Add FairReservationSystem for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254110#comment-14254110 ] Karthik Kambatla commented on YARN-2738: Thanks Carlo, makes sense. Sorry for the delay in getting to this. The latest patch looks pretty good, except for one nit: spurious change in the following snippet. I can take care of it at commit time. {code} String text = ((Text) field.getFirstChild()).getData(); {code} However, I have some comments that might require some follow-up work: # Should we have a default implementation of {{getAverageCapacity}} etc. in ReservationSchedulerConfiguration, and not require separate implementations in CS and FS. # Would it make sense to have a common ReservationQueueConfiguration for both CS and FS? Add FairReservationSystem for FairScheduler --- Key: YARN-2738 URL: https://issues.apache.org/jira/browse/YARN-2738 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2738.001.patch, YARN-2738.002.patch, YARN-2738.003.patch, YARN-2738.004.patch Need to create a FairReservationSystem that will implement ReservationSystem for FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2574) Add support for FairScheduler to the ReservationSystem
[ https://issues.apache.org/jira/browse/YARN-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2574: --- Issue Type: New Feature (was: Improvement) Add support for FairScheduler to the ReservationSystem -- Key: YARN-2574 URL: https://issues.apache.org/jira/browse/YARN-2574 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Subru Krishnan Assignee: Anubhav Dhoot YARN-1051 introduces the ReservationSystem and the current implementation is based on CapacityScheduler. This JIRA proposes adding support for FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2852) WebUI Metrics: Add disk I/O resource information to the web ui and metrics
[ https://issues.apache.org/jira/browse/YARN-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2852: - Labels: metrics supportability (was: ) WebUI Metrics: Add disk I/O resource information to the web ui and metrics Key: YARN-2852 URL: https://issues.apache.org/jira/browse/YARN-2852 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Labels: metrics, supportability Attachments: YARN-2852-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2675) the containersKilled metrics is not updated when the container is killed during localization.
[ https://issues.apache.org/jira/browse/YARN-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254178#comment-14254178 ] Karthik Kambatla commented on YARN-2675: Given we split up all the cases of ContainerDoneTransition, do we still need it? the containersKilled metrics is not updated when the container is killed during localization. - Key: YARN-2675 URL: https://issues.apache.org/jira/browse/YARN-2675 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Attachments: YARN-2675.000.patch, YARN-2675.001.patch, YARN-2675.002.patch, YARN-2675.003.patch, YARN-2675.004.patch, YARN-2675.005.patch, YARN-2675.006.patch The containersKilled metrics is not updated when the container is killed during localization. We should add KILLING state in finished of ContainerImpl.java to update killedContainer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users
[ https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2423: Attachment: YARN-2423.005.patch 005 patch fixes the test failure. A previous test was leaking UGI settings. [~zjshen], can you take a look at the latest patch? TimelineClient should wrap all GET APIs to facilitate Java users Key: YARN-2423 URL: https://issues.apache.org/jira/browse/YARN-2423 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Robert Kanter Attachments: YARN-2423.004.patch, YARN-2423.005.patch, YARN-2423.patch, YARN-2423.patch, YARN-2423.patch TimelineClient provides the Java method to put timeline entities. It's also good to wrap over all GET APIs (both entity and domain), and deserialize the json response into Java POJO objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2655) AllocatedGB/AvailableGB in nodemanager JMX showing only integer values
[ https://issues.apache.org/jira/browse/YARN-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254201#comment-14254201 ] Karthik Kambatla commented on YARN-2655: [~ywskycn] - the patch doesn't apply anymore. Mind updating it? AllocatedGB/AvailableGB in nodemanager JMX showing only integer values -- Key: YARN-2655 URL: https://issues.apache.org/jira/browse/YARN-2655 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Nishan Shetty Assignee: Wei Yan Priority: Minor Attachments: YARN-2655-1.patch, screenshot-1.png, screenshot-2.png AllocatedGB/AvailableGB in nodemanager JMX showing only integer values Screenshot attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2655) AllocatedGB/AvailableGB in nodemanager JMX showing only integer values
[ https://issues.apache.org/jira/browse/YARN-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254202#comment-14254202 ] Wei Yan commented on YARN-2655: --- [~kasha], sure, will do it soon. AllocatedGB/AvailableGB in nodemanager JMX showing only integer values -- Key: YARN-2655 URL: https://issues.apache.org/jira/browse/YARN-2655 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Nishan Shetty Assignee: Wei Yan Priority: Minor Attachments: YARN-2655-1.patch, screenshot-1.png, screenshot-2.png AllocatedGB/AvailableGB in nodemanager JMX showing only integer values Screenshot attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2982) Use ReservationQueueConfiguration in CapacityScheduler
Anubhav Dhoot created YARN-2982: --- Summary: Use ReservationQueueConfiguration in CapacityScheduler Key: YARN-2982 URL: https://issues.apache.org/jira/browse/YARN-2982 Project: Hadoop YARN Issue Type: Sub-task Reporter: Anubhav Dhoot ReservationQueueConfiguration is common to reservation irrespective of Scheduler. It would be good to have CapacityScheduler also support this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2982) Use ReservationQueueConfiguration in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2982: Parent Issue: YARN-2574 (was: YARN-2572) Use ReservationQueueConfiguration in CapacityScheduler -- Key: YARN-2982 URL: https://issues.apache.org/jira/browse/YARN-2982 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler, resourcemanager Reporter: Anubhav Dhoot ReservationQueueConfiguration is common to reservation irrespective of Scheduler. It would be good to have CapacityScheduler also support this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2738) Add FairReservationSystem for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254212#comment-14254212 ] Anubhav Dhoot commented on YARN-2738: - Re 1. This is a configuration point which will need to be implemented based on each CS or FS configuration mechanism Re 2. Added YARN-2982 Thanks for the review [~kasha] Add FairReservationSystem for FairScheduler --- Key: YARN-2738 URL: https://issues.apache.org/jira/browse/YARN-2738 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2738.001.patch, YARN-2738.002.patch, YARN-2738.003.patch, YARN-2738.004.patch Need to create a FairReservationSystem that will implement ReservationSystem for FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2675) the containersKilled metrics is not updated when the container is killed during localization.
[ https://issues.apache.org/jira/browse/YARN-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254216#comment-14254216 ] zhihai xu commented on YARN-2675: - Although we don't use it in the state machine directly, it is the base class of all other added classes. So we still need it. the containersKilled metrics is not updated when the container is killed during localization. - Key: YARN-2675 URL: https://issues.apache.org/jira/browse/YARN-2675 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Attachments: YARN-2675.000.patch, YARN-2675.001.patch, YARN-2675.002.patch, YARN-2675.003.patch, YARN-2675.004.patch, YARN-2675.005.patch, YARN-2675.006.patch The containersKilled metrics is not updated when the container is killed during localization. We should add KILLING state in finished of ContainerImpl.java to update killedContainer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254223#comment-14254223 ] Anubhav Dhoot commented on YARN-2975: - Minor comment: The following comment might be misleading. One may assume this means the app will be removed regardless and the boolean return is only to indicate whether it happened to be nonRunnable {noformat} /** * @return true if the app was non-runnable, false otherwise */ public boolean removeNonRunnableApp(FSAppAttempt app) { {noformat} LGTM otherwise FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-2975-1.patch, yarn-2975-2.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2655) AllocatedGB/AvailableGB in nodemanager JMX showing only integer values
[ https://issues.apache.org/jira/browse/YARN-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254225#comment-14254225 ] Wei Yan commented on YARN-2655: --- Problem already solved in YARN-1156. Closing it. AllocatedGB/AvailableGB in nodemanager JMX showing only integer values -- Key: YARN-2655 URL: https://issues.apache.org/jira/browse/YARN-2655 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Nishan Shetty Assignee: Wei Yan Priority: Minor Attachments: YARN-2655-1.patch, screenshot-1.png, screenshot-2.png AllocatedGB/AvailableGB in nodemanager JMX showing only integer values Screenshot attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2738) Add FairReservationSystem for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254249#comment-14254249 ] Karthik Kambatla commented on YARN-2738: +1. Checking this in. Add FairReservationSystem for FairScheduler --- Key: YARN-2738 URL: https://issues.apache.org/jira/browse/YARN-2738 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2738.001.patch, YARN-2738.002.patch, YARN-2738.003.patch, YARN-2738.004.patch Need to create a FairReservationSystem that will implement ReservationSystem for FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-868) YarnClient should set the service address in tokens returned by getRMDelegationToken()
[ https://issues.apache.org/jira/browse/YARN-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254264#comment-14254264 ] Hitesh Shah commented on YARN-868: -- [~vinodkv] Mind taking a look? YarnClient should set the service address in tokens returned by getRMDelegationToken() -- Key: YARN-868 URL: https://issues.apache.org/jira/browse/YARN-868 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Varun Saxena Attachments: YARN-868.patch Either the client should set this information into the token or the client layer should expose an api that returns the service address. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2574) Add support for FairScheduler to the ReservationSystem
[ https://issues.apache.org/jira/browse/YARN-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254261#comment-14254261 ] Hudson commented on YARN-2574: -- FAILURE: Integrated in Hadoop-trunk-Commit #6762 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6762/]) YARN-2738. [YARN-2574] Add FairReservationSystem for FairScheduler. (Anubhav Dhoot via kasha) (kasha: rev a22ffc318801698e86cd0e316b4824015f2486ac) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AllocationFileLoaderService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/ReservationSystemTestUtil.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AllocationConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/AbstractReservationSystem.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/TestFairReservationSystem.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestAllocationFileLoaderService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/FairReservationSystem.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/ReservationQueueConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java Add support for FairScheduler to the ReservationSystem -- Key: YARN-2574 URL: https://issues.apache.org/jira/browse/YARN-2574 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Subru Krishnan Assignee: Anubhav Dhoot YARN-1051 introduces the ReservationSystem and the current implementation is based on CapacityScheduler. This JIRA proposes adding support for FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2738) Add FairReservationSystem for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254263#comment-14254263 ] Hudson commented on YARN-2738: -- FAILURE: Integrated in Hadoop-trunk-Commit #6762 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6762/]) YARN-2738. [YARN-2574] Add FairReservationSystem for FairScheduler. (Anubhav Dhoot via kasha) (kasha: rev a22ffc318801698e86cd0e316b4824015f2486ac) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AllocationFileLoaderService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/ReservationSystemTestUtil.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AllocationConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/AbstractReservationSystem.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/TestFairReservationSystem.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestAllocationFileLoaderService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/FairReservationSystem.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/ReservationQueueConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java Add FairReservationSystem for FairScheduler --- Key: YARN-2738 URL: https://issues.apache.org/jira/browse/YARN-2738 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2738.001.patch, YARN-2738.002.patch, YARN-2738.003.patch, YARN-2738.004.patch Need to create a FairReservationSystem that will implement ReservationSystem for FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users
[ https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254275#comment-14254275 ] Hadoop QA commented on YARN-2423: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688447/YARN-2423.005.patch against trunk revision 6f1e366. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 36 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice: org.apache.hadoop.yarn.client.api.impl.TestTimelineClient Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6158//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6158//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6158//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-applicationhistoryservice.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6158//console This message is automatically generated. TimelineClient should wrap all GET APIs to facilitate Java users Key: YARN-2423 URL: https://issues.apache.org/jira/browse/YARN-2423 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Robert Kanter Attachments: YARN-2423.004.patch, YARN-2423.005.patch, YARN-2423.patch, YARN-2423.patch, YARN-2423.patch TimelineClient provides the Java method to put timeline entities. It's also good to wrap over all GET APIs (both entity and domain), and deserialize the json response into Java POJO objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254283#comment-14254283 ] Robert Kanter commented on YARN-2975: - +1 after clarifying the comment that Anubhav pointed out FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-2975-1.patch, yarn-2975-2.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2975: --- Attachment: yarn-2975-3.patch Thanks Anubhav. Updated the comment to be clearer. The test failures and findbugs warnings look unrelated. FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-2975-1.patch, yarn-2975-2.patch, yarn-2975-3.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254296#comment-14254296 ] Karthik Kambatla commented on YARN-2975: Thanks for the review, Robert. I ll go ahead and commit this, if Jenkins doesn't complain of any new issues. FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-2975-1.patch, yarn-2975-2.patch, yarn-2975-3.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2675) the containersKilled metrics is not updated when the container is killed during localization.
[ https://issues.apache.org/jira/browse/YARN-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254302#comment-14254302 ] Karthik Kambatla commented on YARN-2675: bq. it is the base class of all other added classes Never mind, I am not the brightest today. Forgot the child classes call super.transition. the containersKilled metrics is not updated when the container is killed during localization. - Key: YARN-2675 URL: https://issues.apache.org/jira/browse/YARN-2675 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Attachments: YARN-2675.000.patch, YARN-2675.001.patch, YARN-2675.002.patch, YARN-2675.003.patch, YARN-2675.004.patch, YARN-2675.005.patch, YARN-2675.006.patch The containersKilled metrics is not updated when the container is killed during localization. We should add KILLING state in finished of ContainerImpl.java to update killedContainer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2675) containersKilled metrics is not updated when the container is killed during localization
[ https://issues.apache.org/jira/browse/YARN-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2675: --- Summary: containersKilled metrics is not updated when the container is killed during localization (was: the containersKilled metrics is not updated when the container is killed during localization.) containersKilled metrics is not updated when the container is killed during localization Key: YARN-2675 URL: https://issues.apache.org/jira/browse/YARN-2675 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Attachments: YARN-2675.000.patch, YARN-2675.001.patch, YARN-2675.002.patch, YARN-2675.003.patch, YARN-2675.004.patch, YARN-2675.005.patch, YARN-2675.006.patch The containersKilled metrics is not updated when the container is killed during localization. We should add KILLING state in finished of ContainerImpl.java to update killedContainer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2675) the containersKilled metrics is not updated when the container is killed during localization.
[ https://issues.apache.org/jira/browse/YARN-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254307#comment-14254307 ] Karthik Kambatla commented on YARN-2675: The latest patch looks good, the findbugs warnings look unrelated. +1. Checking this in. the containersKilled metrics is not updated when the container is killed during localization. - Key: YARN-2675 URL: https://issues.apache.org/jira/browse/YARN-2675 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Attachments: YARN-2675.000.patch, YARN-2675.001.patch, YARN-2675.002.patch, YARN-2675.003.patch, YARN-2675.004.patch, YARN-2675.005.patch, YARN-2675.006.patch The containersKilled metrics is not updated when the container is killed during localization. We should add KILLING state in finished of ContainerImpl.java to update killedContainer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2675) containersKilled metrics is not updated when the container is killed during localization
[ https://issues.apache.org/jira/browse/YARN-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254332#comment-14254332 ] Hudson commented on YARN-2675: -- FAILURE: Integrated in Hadoop-trunk-Commit #6764 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6764/]) YARN-2675. containersKilled metrics is not updated when the container is killed during localization. (Zhihai Xu via kasha) (kasha: rev 954fb8581ec6d7d389ac5d6f94061760a29bc309) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/metrics/NodeManagerMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java containersKilled metrics is not updated when the container is killed during localization Key: YARN-2675 URL: https://issues.apache.org/jira/browse/YARN-2675 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2675.000.patch, YARN-2675.001.patch, YARN-2675.002.patch, YARN-2675.003.patch, YARN-2675.004.patch, YARN-2675.005.patch, YARN-2675.006.patch The containersKilled metrics is not updated when the container is killed during localization. We should add KILLING state in finished of ContainerImpl.java to update killedContainer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-868) YarnClient should set the service address in tokens returned by getRMDelegationToken()
[ https://issues.apache.org/jira/browse/YARN-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254352#comment-14254352 ] Hadoop QA commented on YARN-868: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12661447/YARN-868.patch against trunk revision 390a7c1. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 35 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6161//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6161//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6161//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6161//console This message is automatically generated. YarnClient should set the service address in tokens returned by getRMDelegationToken() -- Key: YARN-868 URL: https://issues.apache.org/jira/browse/YARN-868 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Varun Saxena Attachments: YARN-868.patch Either the client should set this information into the token or the client layer should expose an api that returns the service address. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2946) DeadLocks in RMStateStore-ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254376#comment-14254376 ] Jian He commented on YARN-2946: --- [~rohithsharma], I had a quick look at the patch. one comment is: In each store/update method, instead of doing this: {code} if (isFencedState()) { LOG.info(State store is in Fenced state. Can't remove RM Delegation + Token Master key.); return; } this.stateMachine.doTransition(RMStateStoreEventType.UPDATE_AMRM_TOKEN, new RMStateStoreAMRMTokenEvent(amrmTokenSecretManagerState, isUpdate, RMStateStoreEventType.UPDATE_AMRM_TOKEN)); {code} we can do this {code} handleStoreEvent(RMStateStoreEvent event) {code} DeadLocks in RMStateStore-ZKRMStateStore -- Key: YARN-2946 URL: https://issues.apache.org/jira/browse/YARN-2946 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-2946.patch, 0001-YARN-2946.patch, 0002-YARN-2946.patch, RM_BeforeFix_Deadlock_cycle_1.png, RM_BeforeFix_Deadlock_cycle_2.png, TestYARN2946.java Found one deadlock in ZKRMStateStore. # Initial stage zkClient is null because of zk disconnected event. # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to re establish zookeeper connection either via synconnected or expired event, it is highly possible that any other thred can obtain lock on {{ZKRMStateStore.this}} from state machine transition events. This cause Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager
[ https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254382#comment-14254382 ] Ming Ma commented on YARN-914: -- [~djp], thanks for working on this. It looks like we are going to use YARN-291 and thus the drain the state approach, instead of the more complicated migrate the state approach. So YARN will reduce the capacity of the nodes as part of the decomission process until all its map output are fetched or until all the applications the node touches have completed? In addition, it will be interesting to understand how you handle long running jobs. FYI, https://issues.apache.org/jira/browse/YARN-1996 will drain containers of unhealthy nodes. Support graceful decommission of nodemanager Key: YARN-914 URL: https://issues.apache.org/jira/browse/YARN-914 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Luke Lu Assignee: Junping Du When NMs are decommissioned for non-fault reasons (capacity change etc.), it's desirable to minimize the impact to running applications. Currently if a NM is decommissioned, all running containers on the NM need to be rescheduled on other NMs. Further more, for finished map tasks, if their map output are not fetched by the reducers of the job, these map tasks will need to be rerun as well. We propose to introduce a mechanism to optionally gracefully decommission a node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2952) Incorrect version check in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254395#comment-14254395 ] Hudson commented on YARN-2952: -- FAILURE: Integrated in Hadoop-trunk-Commit #6765 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6765/]) YARN-2952. Fixed incorrect version check in StateStore. Contributed by Rohith Sharmaks (jianhe: rev 808cba3821d5bc4267f69d14220757f01cd55715) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * hadoop-yarn-project/CHANGES.txt Incorrect version check in RMStateStore --- Key: YARN-2952 URL: https://issues.apache.org/jira/browse/YARN-2952 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Fix For: 2.7.0 Attachments: 0001-YARN-2952.patch In RMStateStore#checkVersion: if we modify tCURRENT_VERSION_INFO to 2.0, it'll still store the version as 1.0 which is incorrect; The same thing might happen to NM store, timeline store. {code} // if there is no version info, treat it as 1.0; if (loadedVersion == null) { loadedVersion = Version.newInstance(1, 0); } if (loadedVersion.isCompatibleTo(getCurrentVersion())) { LOG.info(Storing RM state version info + getCurrentVersion()); storeVersion(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)