[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970945#comment-14970945 ] Jason Lowe commented on YARN-4041: -- +1 for the latest patch, will commit this later today if there are no objections. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch, 0004-YARN-4041.patch, 0005-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970579#comment-14970579 ] Hadoop QA commented on YARN-4041: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 21m 9s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 9m 56s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 12m 11s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 26s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 3s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 42s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 38s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 30s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 66m 37s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 115m 17s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12768222/0005-YARN-4041.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 124a412 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9540/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9540/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9540/console | This message was automatically generated. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch, 0004-YARN-4041.patch, 0005-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971998#comment-14971998 ] Hudson commented on YARN-4041: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #589 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/589/]) YARN-4041. Slow delegation token renewal can severely prolong RM (jlowe: rev d3a34a4f388155f6a7ef040e244ce7be788cd28b) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Fix For: 2.7.2 > > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch, 0004-YARN-4041.patch, 0005-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972200#comment-14972200 ] Hudson commented on YARN-4041: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #531 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/531/]) YARN-4041. Slow delegation token renewal can severely prolong RM (jlowe: rev d3a34a4f388155f6a7ef040e244ce7be788cd28b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Fix For: 2.7.2 > > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch, 0004-YARN-4041.patch, 0005-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972103#comment-14972103 ] Hudson commented on YARN-4041: -- FAILURE: Integrated in Hadoop-Yarn-trunk #1312 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1312/]) YARN-4041. Slow delegation token renewal can severely prolong RM (jlowe: rev d3a34a4f388155f6a7ef040e244ce7be788cd28b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Fix For: 2.7.2 > > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch, 0004-YARN-4041.patch, 0005-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972117#comment-14972117 ] Hudson commented on YARN-4041: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2521 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2521/]) YARN-4041. Slow delegation token renewal can severely prolong RM (jlowe: rev d3a34a4f388155f6a7ef040e244ce7be788cd28b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Fix For: 2.7.2 > > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch, 0004-YARN-4041.patch, 0005-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971852#comment-14971852 ] Hudson commented on YARN-4041: -- FAILURE: Integrated in Hadoop-trunk-Commit #8697 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8697/]) YARN-4041. Slow delegation token renewal can severely prolong RM (jlowe: rev d3a34a4f388155f6a7ef040e244ce7be788cd28b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Fix For: 2.7.2 > > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch, 0004-YARN-4041.patch, 0005-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971966#comment-14971966 ] Hudson commented on YARN-4041: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #576 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/576/]) YARN-4041. Slow delegation token renewal can severely prolong RM (jlowe: rev d3a34a4f388155f6a7ef040e244ce7be788cd28b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Fix For: 2.7.2 > > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch, 0004-YARN-4041.patch, 0005-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969820#comment-14969820 ] Jason Lowe commented on YARN-4041: -- The problem with checking the renewer event queue directly is that the queue can be empty but processing has not yet completed. Threads can still be executing the last events, having just pulled them from the queue to leave it empty. Therefore the test is still racy. A simpler approach would be to just keep checking if the tokens are equal. If they aren't then sleep for a bit then try again, up to some limit of time to keep checking. By the way, we should not sleep an entire second between checks. All those seconds of waiting add up across all of our tests doing it, making it take significantly longer to run them overall. We should be sleeping for only 10ms or so. That's still a large amount of time for modern processors to get work done while we're waiting, and we still won't be spinning non-stop on the CPU. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch, 0004-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969105#comment-14969105 ] Sunil G commented on YARN-4041: --- Test case failures are not related. Its passing locally. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch, 0004-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967776#comment-14967776 ] Hadoop QA commented on YARN-4041: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 20s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 9m 1s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 59s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 41s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 38s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 39s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 58m 36s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 104m 5s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesForCSWithPartitions | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12767832/0004-YARN-4041.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / e27c2ae | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9511/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9511/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9511/console | This message was automatically generated. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch, 0004-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966552#comment-14966552 ] Sunil G commented on YARN-4041: --- Thank you [~jlowe] for the comments. Yes, I was also planning to move this logic inside the waitForTokensToBeRenewed method, but the test failure occurred only for one test case, hence placed outside. I think its better we place that logic inside waitForTokensToBeRenewed itself as suggested. I also was not much liking the solution of sleep, however a better checkpoint was not raised explicitly from DelegationTokenRenewer. I also thought of checking the event queue size there. Now I feel we can verify that whether any Token renewal event is raised or not. It can be a good checkpoint. I will attach a patch for this with other comment fix. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964735#comment-14964735 ] Sunil G commented on YARN-4041: --- Test case failures looks related, I will debug and will check. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965864#comment-14965864 ] Jason Lowe commented on YARN-4041: -- Thanks for updating the patch, Sunil! When fixing the test, why wasn't the fix in waitForTokensToBeRenewed? Also I'm not thrilled with the idea of sleeping for 1 second per application and hoping it's enough time. And we're getting out early when there is at least one token in the token set, but there's a race where we may have taken a snapshot before all the tokens are there. Can't we key off the app start events coming out of the token renewal process to know when we're done? Would be nice if there were a more reliable way so we can avoid arbitrary sleeps (which tend to slow down unit tests overall) and racy tests. Also noticed on subsequent look that AbsrtactDelegationTokenRenewerAppEvent s/b AbstractDelegationTokenRenewerAppEvent. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965071#comment-14965071 ] Hadoop QA commented on YARN-4041: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 26s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 58s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 32s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 51s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 30s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 28s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 57m 50s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 98m 36s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12767585/0003-YARN-4041.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 9cb5d35 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9490/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9490/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9490/console | This message was automatically generated. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, > 0003-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964093#comment-14964093 ] Hadoop QA commented on YARN-4041: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 25s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 2s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 7s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 52s | The applied patch generated 1 new checkstyle issues (total was 150, now 149). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 37s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 37s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 57m 51s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 99m 38s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMRestart | | | hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler | | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA | | | org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12751320/0001-YARN-4041.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 6144e01 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9485/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9485/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9485/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9485/console | This message was automatically generated. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963934#comment-14963934 ] Jason Lowe commented on YARN-4041: -- Sorry for the delay. Looks good to me as well, kicking Jenkins to comment on the patch. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964534#comment-14964534 ] Hadoop QA commented on YARN-4041: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 28s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 32s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 52s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 30s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 30s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 58m 8s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 100m 40s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12767527/0002-YARN-4041.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 7e2837f | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9488/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9488/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9488/console | This message was automatically generated. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14953688#comment-14953688 ] Jian He commented on YARN-4041: --- looks good to me overall, hold on committing in case any comments from others. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952161#comment-14952161 ] Sunil G commented on YARN-4041: --- Hi Bob, I have shared a patch which uses delegation token renewal during recovery in an asynchronous way. I will rebase the same against trunk now. Meantime [~jlowe], [~rohithsharma] and [~kasha] could you please take a look on this patch. > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951510#comment-14951510 ] Bob commented on YARN-4041: --- Hi, [~sunilg] , Any update or idea on this issue? > Slow delegation token renewal can severely prolong RM recovery > -- > > Key: YARN-4041 > URL: https://issues.apache.org/jira/browse/YARN-4041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-4041.patch > > > When the RM does a work-preserving restart it synchronously tries to renew > delegation tokens for every active application. If a token server happens to > be down or is running slow and a lot of the active apps were using tokens > from that server then it can have a huge impact on the time it takes the RM > to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695316#comment-14695316 ] Jason Lowe commented on YARN-4041: -- bq. IIRR, synchronous recovery was to fail-fast if recovery doesn't work. With the proposed change, what happens when the recovery fails? Arguably the same thing that happens when the RM goes to renew tokens on a live application and fails without a restart. IIRC this is not fatal to either the RM nor the application when this occurs today. In general I think we should make restarting as orthogonal as possible to token renewals, and ideally RM restart should not cause an out-of-band token renewal storm. Slow delegation token renewal can severely prolong RM recovery -- Key: YARN-4041 URL: https://issues.apache.org/jira/browse/YARN-4041 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G When the RM does a work-preserving restart it synchronously tries to renew delegation tokens for every active application. If a token server happens to be down or is running slow and a lot of the active apps were using tokens from that server then it can have a huge impact on the time it takes the RM to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694730#comment-14694730 ] Karthik Kambatla commented on YARN-4041: IIRR, synchronous recovery was to fail-fast if recovery doesn't work. With the proposed change, what happens when the recovery fails? Slow delegation token renewal can severely prolong RM recovery -- Key: YARN-4041 URL: https://issues.apache.org/jira/browse/YARN-4041 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G When the RM does a work-preserving restart it synchronously tries to renew delegation tokens for every active application. If a token server happens to be down or is running slow and a lot of the active apps were using tokens from that server then it can have a huge impact on the time it takes the RM to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681226#comment-14681226 ] Rohith Sharma K S commented on YARN-4041: - One correction in my previous comment, it *NOT 8 minutes*, its *8-10 seconds*. So {{8 seconds * 60 apps = 480 seconds i.e 8 minutes}} Slow delegation token renewal can severely prolong RM recovery -- Key: YARN-4041 URL: https://issues.apache.org/jira/browse/YARN-4041 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G When the RM does a work-preserving restart it synchronously tries to renew delegation tokens for every active application. If a token server happens to be down or is running slow and a lot of the active apps were using tokens from that server then it can have a huge impact on the time it takes the RM to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680515#comment-14680515 ] Jason Lowe commented on YARN-4041: -- The active apps already have the tokens and are running on the cluster, so I'm not sure why it's so pressing that we synchronously process token renewal upon recovery. This should be made asynchronous, or even better, we shouldn't do any renewals just because we restarted. Ideally the RM should be tracking when tokens need to be renewed and renew them at that point. If we restart and some tokens are due for a renewal then we should go ahead and renew those, but I don't think the RM should blindly renew all tokens for apps that are already active and running on the cluster when it restarts. Slow delegation token renewal can severely prolong RM recovery -- Key: YARN-4041 URL: https://issues.apache.org/jira/browse/YARN-4041 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe When the RM does a work-preserving restart it synchronously tries to renew delegation tokens for every active application. If a token server happens to be down or is running slow and a lot of the active apps were using tokens from that server then it can have a huge impact on the time it takes the RM to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680620#comment-14680620 ] Sunil G commented on YARN-4041: --- Could we use an async way here and use {{DelegationTokenRenewerRunnable}} to renew tokens if needed. A new state can be added in this class as below {code} enum DelegationTokenRenewerEventType { VERIFY_AND_START_APPLICATION, +RECOVER_APPLICATION, FINISH_APPLICATION } {code} And we can handle this recover event to decide to renew token from {{DelegationTokenRenewer}}. Will it be fine? Slow delegation token renewal can severely prolong RM recovery -- Key: YARN-4041 URL: https://issues.apache.org/jira/browse/YARN-4041 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe When the RM does a work-preserving restart it synchronously tries to renew delegation tokens for every active application. If a token server happens to be down or is running slow and a lot of the active apps were using tokens from that server then it can have a huge impact on the time it takes the RM to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680907#comment-14680907 ] Jian He commented on YARN-4041: --- YARN-2010 was done to actually ignore the failure on token renewal for recovery. Now I agree that we do not even need to do the renew Slow delegation token renewal can severely prolong RM recovery -- Key: YARN-4041 URL: https://issues.apache.org/jira/browse/YARN-4041 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe When the RM does a work-preserving restart it synchronously tries to renew delegation tokens for every active application. If a token server happens to be down or is running slow and a lot of the active apps were using tokens from that server then it can have a huge impact on the time it takes the RM to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680904#comment-14680904 ] Jian He commented on YARN-4041: --- bq. I don't think the RM should blindly renew all tokens for apps that are already active and running on the cluster when it restarts. I agree with this. We do not need to renew tokens for apps on recovery. Slow delegation token renewal can severely prolong RM recovery -- Key: YARN-4041 URL: https://issues.apache.org/jira/browse/YARN-4041 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe When the RM does a work-preserving restart it synchronously tries to renew delegation tokens for every active application. If a token server happens to be down or is running slow and a lot of the active apps were using tokens from that server then it can have a huge impact on the time it takes the RM to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681217#comment-14681217 ] Rohith Sharma K S commented on YARN-4041: - Recently in test cluster faced the similar issue i.e around 60 apps were running. On RM switch, each applications took around 8 minutes to renew delegation token which is 8 min* 60 apps = 480minutes for recovery. YARN-3639 is the issue raised for the same. Slow delegation token renewal can severely prolong RM recovery -- Key: YARN-4041 URL: https://issues.apache.org/jira/browse/YARN-4041 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G When the RM does a work-preserving restart it synchronously tries to renew delegation tokens for every active application. If a token server happens to be down or is running slow and a lot of the active apps were using tokens from that server then it can have a huge impact on the time it takes the RM to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680730#comment-14680730 ] Jason Lowe commented on YARN-4041: -- Maybe. The synchronous recovery was added as part of YARN-2010, and I don't recall from that JIRA why it was crucial for the token renewal process to be performed synchronously during recovery. [~jianhe] or [~kasha] do you see any issues with making the delegation token renewal asynchronous for active applications during RM recovery? Slow delegation token renewal can severely prolong RM recovery -- Key: YARN-4041 URL: https://issues.apache.org/jira/browse/YARN-4041 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe When the RM does a work-preserving restart it synchronously tries to renew delegation tokens for every active application. If a token server happens to be down or is running slow and a lot of the active apps were using tokens from that server then it can have a huge impact on the time it takes the RM to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4041) Slow delegation token renewal can severely prolong RM recovery
[ https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681048#comment-14681048 ] Sunil G commented on YARN-4041: --- I wud like to take this. [~jlowe] cud I take over this. Slow delegation token renewal can severely prolong RM recovery -- Key: YARN-4041 URL: https://issues.apache.org/jira/browse/YARN-4041 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe When the RM does a work-preserving restart it synchronously tries to renew delegation tokens for every active application. If a token server happens to be down or is running slow and a lot of the active apps were using tokens from that server then it can have a huge impact on the time it takes the RM to process the restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)