[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275183#comment-14275183 ] Yi Liu commented on YARN-2964: -- It seems this JIRA will cause the token is not renewed properly if it's shared by jobs (oozie), I filed a JIRA YARN-3055, please take a look. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253229#comment-14253229 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #46 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/46/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253246#comment-14253246 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Yarn-trunk #780 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/780/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253440#comment-14253440 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1978 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1978/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253455#comment-14253455 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #43 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/43/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253502#comment-14253502 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #47 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/47/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253523#comment-14253523 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1997 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1997/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254095#comment-14254095 ] Jian He commented on YARN-2964: --- bq. do you think this is something we can/should fix in YARN? I think so. RM is the designated renewer so it should renew the token every so often. But because there's a bug in DelegationTokenRenewer, RM just forgets the token and won't renew the token automatically. So we should fix this in DelegationTokenRenewer to keep track of the token and renew the token properly. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251466#comment-14251466 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #45 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/45/]) YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251481#comment-14251481 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Yarn-trunk #779 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/779/]) YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/CHANGES.txt RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251708#comment-14251708 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #42 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/42/]) YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251722#comment-14251722 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1977 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1977/]) YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/CHANGES.txt RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251766#comment-14251766 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #46 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/46/]) YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/CHANGES.txt RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251804#comment-14251804 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1996 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1996/]) YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251818#comment-14251818 ] Jason Lowe commented on YARN-2964: -- Thanks for the patch, Jian! Findbug warnings appear to be unrelated. I'm wondering about the change in the removeApplicationFromRenewal method or remove. If a sub-job completes, won't we remove the token from the allTokens map before the launcher job has completed? Then a subsequent sub-job that requests token cancelation can put the token back in the map and cause the token to be canceled when it leaves. I think we need to repeat the logic from the original code before YARN-2704 here, i.e.: only remove the token if the application ID matches. That way the launcher job's token will remain _the_ token in that collection until the launcher job completes. This comment doesn't match the code, since the code looks like if any token wants to cancel at the end then we will cancel at the end. {code} // If any of the jobs sharing the same token set shouldCancelAtEnd // to true, we should not cancel the token. if (evt.shouldCancelAtEnd) { dttr.shouldCancelAtEnd = evt.shouldCancelAtEnd; } {code} I think the logic and comment should be if any job doesn't want to cancel then we won't cancel. The code seems to be trying to do the opposite, so I'm not sure how the unit test is passing. Maybe I'm missing something. The info log message added in handleAppSubmitEvent also is misleading, as it says we are setting shouldCancelAtEnd to whatever the event said, when in reality we only set it sometimes. Probably needs to be inside the conditional. Wonder if we should be using a Set instead of a Map to track these tokens. Adding an already existing DelegationTokenToRenew in a set will not change the one already there, but with the map a sub-job can clobber the DelegationTokenToRenew that's already there with its own when it does the allTokens.put(dtr.token, dtr). RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252045#comment-14252045 ] Jian He commented on YARN-2964: --- thanks for your comments, Jason ! bq. I'm wondering about the change in the removeApplicationFromRenewal method or remove. If launcher job first gets added to the appTokens map, DelegationTokenRenewer will not add DelegationTokenToRenew instance for the sub-job. So the tokens in removeApplicationFromRenewal will return empty for the sub-job when the sub-job completes. So the token won’t be removed from the allTokens. My only concern with a global set that is that each time an application completes, we end up looping all the applications or worse (each app may have at least one token). bq. This comment doesn't match the code good catch.. what a mistake.. I might be in the impression the semantics is “shouldKeepAtEnd”, I added one line in the test case to guard against this. bq. Wonder if we should be using a Set instead of a Map to track these tokens Thought about that too, the reason that switched to a map is to get the DelegationTokenToRenew instance based on the token app provided and change the shouldCancelAtEnd field on submission. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252216#comment-14252216 ] Hadoop QA commented on YARN-2964: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688092/YARN-2964.2.patch against trunk revision 07619aa. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRM org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6149//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6149//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6149//console This message is automatically generated. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch, YARN-2964.2.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252218#comment-14252218 ] Jason Lowe commented on YARN-2964: -- bq. If launcher job first gets added to the appTokens map, DelegationTokenRenewer will not add DelegationTokenToRenew instance for the sub-job. Ah, sorry, I missed this critical change from the original patch. However if we don't add the delegation token for each sub-job then I think we have a problem with the following use-case: # Oozie launcher submits a MapReduce sub-job # MapReduce job starts # Oozie launcher job leaves # MapReduce job now running with a token that the RM has forgotten and won't be automatically renewed We might have had the same issue in this case prior to YARN-2704, since the token would be pulled from the set when the launcher completed. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch, YARN-2964.2.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252243#comment-14252243 ] Jian He commented on YARN-2964: --- bq. We might have had the same issue in this case prior to YARN-2704. Yes, this is an existing issue. As Robert pointed out in the previous comment, oozie MapReduce sub-job now cannot run beyond 24 hrs. IMO, we can fix this separately ? RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch, YARN-2964.2.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252259#comment-14252259 ] Jason Lowe commented on YARN-2964: -- Sure, we can fix that as a followup issue since it's no worse than what we had before. +1 lgtm, only nit is the new getAllTokens method should be package-private instead of public but not a big deal either way. I assume the test failures are unrelated? RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch, YARN-2964.2.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252286#comment-14252286 ] Jian He commented on YARN-2964: --- I believe the failures are not related. I just changed the visibility and uploaded a new patch to re-kick jenkins. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252473#comment-14252473 ] Hadoop QA commented on YARN-2964: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688133/YARN-2964.3.patch against trunk revision b9d4976. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestRM Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6150//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6150//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6150//console This message is automatically generated. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252510#comment-14252510 ] Jason Lowe commented on YARN-2964: -- +1 lgtm. I don't believe the test failures are related since they pass for me locally. Committing this. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252563#comment-14252563 ] Hudson commented on YARN-2964: -- FAILURE: Integrated in Hadoop-trunk-Commit #6755 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6755/]) YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252611#comment-14252611 ] Jian He commented on YARN-2964: --- thanks for reviewing and committing, Jason ! RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252830#comment-14252830 ] Robert Kanter commented on YARN-2964: - Thanks for fixing this. [~jianhe], [~jlowe], on the 24 hrs thing, do you think this is something we can/should fix in YARN? My understanding of this issue is that it's by design (there's even a config for the interval). Given that, I'm thinking the proper fix for this is just to have the launcher job periodically renew the token (a fix in OOZIE)? RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250453#comment-14250453 ] Hudson commented on YARN-2964: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6736 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6736/]) YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250691#comment-14250691 ] Jian He commented on YARN-2964: --- bq. Once the token was stashed in the set, subsequent attempts from sub-jobs to store the token would silently be ignored because the token was already in the set. After digging into the code, I found even if we are not canceling the token if the flag is set, we still remove the token from the global set. This means that if sub-jobs doesn't set the flag, it'll be added to the global set again and once the sub-job finishes the token is canceled. I'm wondering how this worked before, [~jlowe], [~daryn] could you shed some light on this ? RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250802#comment-14250802 ] Jason Lowe commented on YARN-2964: -- IIUC it worked in the past because typically the Oozie launcher job hangs around waiting for all the sub-jobs to complete (e.g.: launcher is running a pig client). Since the launcher job was the first to request the token, it's the one that remains in the set. Any attempt to add the token by a sub-job will not actually add it because of the way the hashcode and equals methods on DelegationTokenToRenew work. Therefore when a sub-job completes and it tries to remove the tokens, this token will not match because the app ID is for the launcher and nto the sub-job. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250812#comment-14250812 ] Jian He commented on YARN-2964: --- I see, I missed the part that launcher job will wait for sub-jobs to complete, thanks for your explanation ! RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250819#comment-14250819 ] Karthik Kambatla commented on YARN-2964: IIRC, the launcher job waits for all actions but the MR action. As an optimization, Oozie started exiting the launcher for pure MR actions. [~rkanter]? RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250926#comment-14250926 ] Robert Kanter commented on YARN-2964: - [~kasha] is correct. The launcher job waits around for all actions types that typically submit other MR jobs (Pig, Sqoop, Hive, etc) except for the MapReduce action, which finishes immediately after submitting the real MR job. I just checked, and in the MR launcher, Oozie sets {{mapreduce.job.complete.cancel.delegation.tokens}} to {{true}} and in the other launchers, Oozie sets it to {{false}}. Oozie doesn't set touch this property in any real launched MR jobs, so they'll use the default, which I'm guessing is {{true}}. Though thinking about this now, it seems like these are backwards, so I'm not sure how that's working right On a related note, we did see an issue recently where a launched job that took over 24 hours would cause the launcher to fail with a delegation token issue because the token expired; even with the property explicitly set correctly. The problem was that {{yarn.resourcemanager.delegation.token.renew-interval}} was set to 24 hours (the default) and if you don't renew (or use?) a delegation token at least every 24 hours, then it automatically expires. [~daryn], perhaps in the original issue this was set to 10 minutes? I haven't had a chance to look into this, but the fix for this particular issue would be to have the launcher job renew the token at some interval. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250958#comment-14250958 ] Jian He commented on YARN-2964: --- bq. we did see an issue recently where a launched job that took over 24 hours would cause the launcher to fail with a delegation token issue because the token expired; This is because the token is removed from RM DelegationTokenRenewer even though the flag is set to false. Hence, RM won't renew the token. This will cause ooze job to fail after 24 hrs, which should be an existing issue. I'm working on a patch to fix this no worse than before. The patch is based on the assumption that launcher job waits for all actions to complete. In addition, I think it may make sense for oozie to propagate this flag to other actions also. Or we can take another solution to have an application group Id to indicate a group of applications like oozie case and tie the token lifetime with the group, and drop this flag completely. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250968#comment-14250968 ] Robert Kanter commented on YARN-2964: - +1 to the idea of groups. canceling/not canceling the token the way we do now seems kinda hacky. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1425#comment-1425 ] Hadoop QA commented on YARN-2964: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12687918/YARN-2964.1.patch against trunk revision 1050d42. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6140//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6140//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6140//console This message is automatically generated. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Jian He Priority: Blocker Attachments: YARN-2964.1.patch The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248470#comment-14248470 ] Jason Lowe commented on YARN-2964: -- bq. AFAIR, this code never had the concept of a first job. An app submits tokens, there was a flat list of tokens, everytime an app finishes, RM will check if the CancelTokensWhenComplete flag is set, and ignore the cancelation of this app if the flag is set. As I understand it, the orignial code implicitly had the concept of a first job because the tokens were stored in a Set instead of a Map. Once the token was stashed in the set, subsequent attempts from sub-jobs to store the token would silently be ignored because the token was already in the set. Since the DelegationTokenToRenew only hashes and checks the underlying token, the difference between shouldCancelAtEnd is ignored and therefore lost when the first job's token is already in the set. In the new code, the DelegationTokenToRenew objects are kept in a map instead of a set, so we no longer are implicitly ignoring the same tokens from sub-jobs as we did in the past. This is what allows a sub-job to override the request of the launcher job to avoid canceling the token. bq. Are you seeing it on a cluster or is it a theory? This is occurring on our 2.6 clusters. Our 2.5-based clusters do not exhibit the problem. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248633#comment-14248633 ] Jian He commented on YARN-2964: --- bq. the difference between shouldCancelAtEnd is ignored and therefore lost when the first job's token is already in the set. One question, who is setting the shouldCancelAtEnd flag? is it only the main job or all sub-jobs are setting it? RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248669#comment-14248669 ] Jason Lowe commented on YARN-2964: -- bq. One question, who is setting the shouldCancelAtEnd flag? is it only the main job or all sub-jobs are setting it? AFAIK only the Oozie launcher job is requesting tokens not be canceled at the end of the job. If all of the sub-jobs were also requesting that then we wouldn't see the issue since nobody would cancel the token. I'm not sure all of the sub-jobs in all cases are asking for the token to be canceled at the end of the job, but in the current code it only takes one to spoil it for the others. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248939#comment-14248939 ] Jian He commented on YARN-2964: --- The reason the mapping was introduced is for the purpose of efficiency so that removing tokens for a single application doesn't need to search all tokens in a global set. Maybe quickest way to fix this to change oozie sub-jobs to set this flag. Anyways, I can work on a patch to fix this in DelegationTokenRenewer. thanks for reporting this issue ! Maybe long-term we should have a group Id for a group of applications so that the token lifetime is tied to a group of applications instead of a single application. RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247400#comment-14247400 ] Daryn Sharp commented on YARN-2964: --- [~vinodkv], can you take a look at this? RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Priority: Critical The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247444#comment-14247444 ] Vinod Kumar Vavilapalli commented on YARN-2964: --- I checked the code, doubt if there is a bug. bq. The first job controlled the cancellation of the token. Correct. bq. This prevented completion of sub-jobs from canceling tokens used by the main job. Only, partially true. More common case to avoid was the completion of the launcher job itself canceling tokens to be used by the sub-jobs. bq. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. AFAIR, this code never had the concept of a first job. An app submits tokens, there was a flat list of tokens, everytime an app finishes, RM will check if the CancelTokensWhenComplete flag is set, and ignore the cancelation of this app if the flag is set. The token gets expired after 7 days. This continues to be the case even after YARN-2704. bq. It also appears to schedule multiple redundant renewals. Specific references? bq. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. I doubt if this issue happens at all. Are you seeing it on a cluster or is it a theory? IAC, [~jianhe], we can write a test-case which proves or disproves this? RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Priority: Blocker The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)