[jira] [Commented] (YARN-11353) Change debug logs in FSDownload.java to info logs for better escalations debugging
[ https://issues.apache.org/jira/browse/YARN-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624551#comment-17624551 ] Akira Ajisaka commented on YARN-11353: -- {quote}Can you please add Sanjay Kumar Sahu as contributor. {quote} Done. > Change debug logs in FSDownload.java to info logs for better escalations > debugging > -- > > Key: YARN-11353 > URL: https://issues.apache.org/jira/browse/YARN-11353 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3.4.0 >Reporter: Sanjay Kumar Sahu >Assignee: Sanjay Kumar Sahu >Priority: Major > Labels: pull-request-available > > AM was stuck in Preparing Local resources step and it timed out and never > started the driver. This happened in one of the customer's cluster and got > resolved when this cluster got deleted and the customer started using another > cluster . The logs were not enough to look into the issue. Adding more info > logs will help to understand when did the download of the files start and > when did it end, or whether it actually reached that step like adding the > containerId here to know who is downloading. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11353) Change debug logs in FSDownload.java to info logs for better escalations debugging
[ https://issues.apache.org/jira/browse/YARN-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka reassigned YARN-11353: Assignee: Sanjay Kumar Sahu > Change debug logs in FSDownload.java to info logs for better escalations > debugging > -- > > Key: YARN-11353 > URL: https://issues.apache.org/jira/browse/YARN-11353 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3.4.0 >Reporter: Sanjay Kumar Sahu >Assignee: Sanjay Kumar Sahu >Priority: Major > Labels: pull-request-available > > AM was stuck in Preparing Local resources step and it timed out and never > started the driver. This happened in one of the customer's cluster and got > resolved when this cluster got deleted and the customer started using another > cluster . The logs were not enough to look into the issue. Adding more info > logs will help to understand when did the download of the files start and > when did it end, or whether it actually reached that step like adding the > containerId here to know who is downloading. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11187) Remove WhiteBox in yarn module.
[ https://issues.apache.org/jira/browse/YARN-11187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11187. -- Fix Version/s: 3.4.0 Resolution: Fixed Committed to trunk. Thank you [~slfan1989] for your contribution. > Remove WhiteBox in yarn module. > --- > > Key: YARN-11187 > URL: https://issues.apache.org/jira/browse/YARN-11187 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.4.0, 3.3.5 >Reporter: fanshilun >Assignee: fanshilun >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6868) Add test scope to certain entries in hadoop-yarn-server-resourcemanager pom.xml
[ https://issues.apache.org/jira/browse/YARN-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-6868: Fix Version/s: 2.10.0 3.0.0-beta1 Added missing fixed versions. > Add test scope to certain entries in hadoop-yarn-server-resourcemanager > pom.xml > --- > > Key: YARN-6868 > URL: https://issues.apache.org/jira/browse/YARN-6868 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0-beta1 >Reporter: Ray Chiang >Assignee: Ray Chiang >Priority: Major > Fix For: 3.0.0-beta1, 2.10.0, 2.9.1 > > Attachments: YARN-6868.001.patch > > > The tag > {noformat} > test > {noformat} > is missing from a few entries in the pom.xml for > hadoop-yarn-server-resourcemanager. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10662) [JDK 11] TestTimelineReaderWebServicesHBaseStorage fails
[ https://issues.apache.org/jira/browse/YARN-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka reassigned YARN-10662: Assignee: (was: Akira Ajisaka) > [JDK 11] TestTimelineReaderWebServicesHBaseStorage fails > > > Key: YARN-10662 > URL: https://issues.apache.org/jira/browse/YARN-10662 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Akira Ajisaka >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > [https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java11-linux-x86_64/131/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice-hbase-tests.txt] > {noformat} > [INFO] Running > org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.515 > s <<< FAILURE! - in > org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage > [ERROR] > org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage > Time elapsed: 1.514 s <<< ERROR! > java.lang.ExceptionInInitializerError > at > org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage.setupBeforeClass(TestTimelineReaderWebServicesHBaseStorage.java:84) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > Caused by: java.lang.RuntimeException: java.io.IOException: Can not attach to > current VM > at > mockit.internal.startup.AgentLoader.attachToRunningVM(AgentLoader.java:150) > at mockit.internal.startup.AgentLoader.loadAgent(AgentLoader.java:60) > at > mockit.internal.startup.Startup.verifyInitialization(Startup.java:169) > at mockit.MockUp.(MockUp.java:94) > ... 19 more > Caused by: java.io.IOException: Can not attach to current VM > at > jdk.attach/sun.tools.attach.HotSpotVirtualMachine.(HotSpotVirtualMachine.java:75) > at > jdk.attach/sun.tools.attach.VirtualMachineImpl.(VirtualMachineImpl.java:56) > at > jdk.attach/sun.tools.attach.AttachProviderImpl.attachVirtualMachine(AttachProviderImpl.java:58) > at > jdk.attach/com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:207) > at > mockit.internal.startup.AgentLoader.attachToRunningVM(AgentLoader.java:144) > ... 22 more > [INFO] > [INFO] Results: > [INFO] > [ERROR] Errors: > [ERROR] TestTimelineReaderWebServicesHBaseStorage.setupBeforeClass:84 > ExceptionInInitializer > [INFO] > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0 {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11301) Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory after YARN-11269
[ https://issues.apache.org/jira/browse/YARN-11301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11301: - Component/s: test > Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory > after YARN-11269 > --- > > Key: YARN-11301 > URL: https://issues.apache.org/jira/browse/YARN-11301 > Project: Hadoop YARN > Issue Type: Bug > Components: test, timelineserver >Affects Versions: 3.4.0 >Reporter: fanshilun >Assignee: fanshilun >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > After the Apache Hadoop YARN Timeline Plugin Storage module was upgraded from > junit 4 to 5, a compilation error occurred. > {code:java} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1: > test (default-test) on project hadoop-yarn-server-timeline-pluginstorage: > Execution default-test of goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test failed: > java.lang.NoClassDefFoundError: > org/junit/platform/launcher/core/LauncherFactory: > org.junit.platform.launcher.core.LauncherFactory -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, please > read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11301) Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory after YARN-11269
[ https://issues.apache.org/jira/browse/YARN-11301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11301. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Committed to trunk. Thank you [~slfan1989] for your fix. > Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory > after YARN-11269 > --- > > Key: YARN-11301 > URL: https://issues.apache.org/jira/browse/YARN-11301 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 3.4.0 >Reporter: fanshilun >Assignee: fanshilun >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > After the Apache Hadoop YARN Timeline Plugin Storage module was upgraded from > junit 4 to 5, a compilation error occurred. > {code:java} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1: > test (default-test) on project hadoop-yarn-server-timeline-pluginstorage: > Execution default-test of goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test failed: > java.lang.NoClassDefFoundError: > org/junit/platform/launcher/core/LauncherFactory: > org.junit.platform.launcher.core.LauncherFactory -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, please > read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11241) Add uncleaning option for local app log file with log-aggregation enabled
[ https://issues.apache.org/jira/browse/YARN-11241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11241. -- Fix Version/s: 3.4.0 3.3.9 Resolution: Fixed Committed to trunk and branch-3.3. Thank you [~groot] for your contribution! > Add uncleaning option for local app log file with log-aggregation enabled > - > > Key: YARN-11241 > URL: https://issues.apache.org/jira/browse/YARN-11241 > Project: Hadoop YARN > Issue Type: New Feature > Components: log-aggregation >Reporter: groot >Assignee: groot >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.9 > > Time Spent: 20m > Remaining Estimate: 0h > > Add uncleaning option for local app log file with log-aggregation enabled > This will be helpful for debugging purpose. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6939) Upgrade JUnit from 4 to 5 in hadoop-yarn
[ https://issues.apache.org/jira/browse/YARN-6939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597840#comment-17597840 ] Akira Ajisaka commented on YARN-6939: - After discussing with groot offline, now I'm okay with adding the junit-platform-launcher dependency for each PR. It will take long time to test locally. > Upgrade JUnit from 4 to 5 in hadoop-yarn > > > Key: YARN-6939 > URL: https://issues.apache.org/jira/browse/YARN-6939 > Project: Hadoop YARN > Issue Type: Test >Reporter: Akira Ajisaka >Assignee: groot >Priority: Major > > Feel free to create sub-tasks for each module. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6939) Upgrade JUnit from 4 to 5 in hadoop-yarn
[ https://issues.apache.org/jira/browse/YARN-6939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597790#comment-17597790 ] Akira Ajisaka commented on YARN-6939: - Thank you [~groot]. For the contributors, in addition, I want to make sure we won't make any regressions like YARN-11287. Could you run {{mvn test}} under hadoop-yarn-project/hadoop-yarn directory and paste the result in the PR? > Upgrade JUnit from 4 to 5 in hadoop-yarn > > > Key: YARN-6939 > URL: https://issues.apache.org/jira/browse/YARN-6939 > Project: Hadoop YARN > Issue Type: Test >Reporter: Akira Ajisaka >Assignee: groot >Priority: Major > > Feel free to create sub-tasks for each module. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11287) Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory after YARN-10793
[ https://issues.apache.org/jira/browse/YARN-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11287: - Summary: Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory after YARN-10793 (was: Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory After YARN-10793) > Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory > after YARN-10793 > --- > > Key: YARN-11287 > URL: https://issues.apache.org/jira/browse/YARN-11287 > Project: Hadoop YARN > Issue Type: Bug > Components: build, test >Affects Versions: 3.4.0 >Reporter: fanshilun >Assignee: fanshilun >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > After executing the yarn-project global unit test, I found the following > error: > {code:java} > ERROR] Failed to execute goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) > on project hadoop-yarn-server-applicationhistoryservice: Execution > default-test of goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test failed: > java.lang.NoClassDefFoundError: > org/junit/platform/launcher/core/LauncherFactory: > org.junit.platform.launcher.core.LauncherFactory -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, please > read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException > [ERROR] > [ERROR] After correcting the problems, you can resume the build with the > command > [ERROR] mvn -rf :hadoop-yarn-server-applicationhistoryservice {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11287) Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory After YARN-10793
[ https://issues.apache.org/jira/browse/YARN-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11287: - Component/s: build test > Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory > After YARN-10793 > --- > > Key: YARN-11287 > URL: https://issues.apache.org/jira/browse/YARN-11287 > Project: Hadoop YARN > Issue Type: Bug > Components: build, test >Affects Versions: 3.4.0 >Reporter: fanshilun >Assignee: fanshilun >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > After executing the yarn-project global unit test, I found the following > error: > {code:java} > ERROR] Failed to execute goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) > on project hadoop-yarn-server-applicationhistoryservice: Execution > default-test of goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test failed: > java.lang.NoClassDefFoundError: > org/junit/platform/launcher/core/LauncherFactory: > org.junit.platform.launcher.core.LauncherFactory -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, please > read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException > [ERROR] > [ERROR] After correcting the problems, you can resume the build with the > command > [ERROR] mvn -rf :hadoop-yarn-server-applicationhistoryservice {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11287) Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory After YARN-10793
[ https://issues.apache.org/jira/browse/YARN-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11287. -- Fix Version/s: 3.4.0 Resolution: Fixed Committed to trunk. Thank you [~slfan1989] for reporting and fixing this issue. > Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory > After YARN-10793 > --- > > Key: YARN-11287 > URL: https://issues.apache.org/jira/browse/YARN-11287 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: fanshilun >Assignee: fanshilun >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > After executing the yarn-project global unit test, I found the following > error: > {code:java} > ERROR] Failed to execute goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) > on project hadoop-yarn-server-applicationhistoryservice: Execution > default-test of goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test failed: > java.lang.NoClassDefFoundError: > org/junit/platform/launcher/core/LauncherFactory: > org.junit.platform.launcher.core.LauncherFactory -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, please > read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException > [ERROR] > [ERROR] After correcting the problems, you can resume the build with the > command > [ERROR] mvn -rf :hadoop-yarn-server-applicationhistoryservice {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6939) Upgrade JUnit from 4 to 5 in hadoop-yarn
[ https://issues.apache.org/jira/browse/YARN-6939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597689#comment-17597689 ] Akira Ajisaka commented on YARN-6939: - If you have some PRs for upgrading JUnit version, please ping or mention me in the PR. I can review later when I have time. > Upgrade JUnit from 4 to 5 in hadoop-yarn > > > Key: YARN-6939 > URL: https://issues.apache.org/jira/browse/YARN-6939 > Project: Hadoop YARN > Issue Type: Test >Reporter: Akira Ajisaka >Assignee: groot >Priority: Major > > Feel free to create sub-tasks for each module. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11254) hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager pom.xml
[ https://issues.apache.org/jira/browse/YARN-11254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11254. -- Fix Version/s: 3.4.0 Resolution: Fixed Committed to trunk. Thank you [~clara0]! > hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager pom.xml > -- > > Key: YARN-11254 > URL: https://issues.apache.org/jira/browse/YARN-11254 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Clara Fang >Assignee: Clara Fang >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > The dependency hadoop-minikdc is defined twice in > hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/pom.xml > {code:xml} > > org.apache.hadoop > hadoop-minikdc > test > > > org.apache.hadoop > hadoop-minikdc > test > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11254) hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager
[ https://issues.apache.org/jira/browse/YARN-11254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11254: - Summary: hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager (was: hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager pom.xml) > hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager > -- > > Key: YARN-11254 > URL: https://issues.apache.org/jira/browse/YARN-11254 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Clara Fang >Assignee: Clara Fang >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > The dependency hadoop-minikdc is defined twice in > hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/pom.xml > {code:xml} > > org.apache.hadoop > hadoop-minikdc > test > > > org.apache.hadoop > hadoop-minikdc > test > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11254) hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager pom.xml
[ https://issues.apache.org/jira/browse/YARN-11254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka reassigned YARN-11254: Assignee: Clara Fang > hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager pom.xml > -- > > Key: YARN-11254 > URL: https://issues.apache.org/jira/browse/YARN-11254 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Clara Fang >Assignee: Clara Fang >Priority: Minor > Labels: pull-request-available > > The dependency hadoop-minikdc is defined twice in > hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/pom.xml > {code:xml} > > org.apache.hadoop > hadoop-minikdc > test > > > org.apache.hadoop > hadoop-minikdc > test > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11257) Add junit5 dependency to hadoop-yarn-server-timeline-pluginstorage to fix few unit test failure
[ https://issues.apache.org/jira/browse/YARN-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11257. -- Resolution: Duplicate This issue has been fixed by YARN-11269. Closing. > Add junit5 dependency to hadoop-yarn-server-timeline-pluginstorage to fix few > unit test failure > --- > > Key: YARN-11257 > URL: https://issues.apache.org/jira/browse/YARN-11257 > Project: Hadoop YARN > Issue Type: Bug >Reporter: groot >Assignee: groot >Priority: Major > Labels: pull-request-available > > We need to add Junit 5 dependency in > {code:java} > hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timeline-pluginstorage/pom.xml{code} > as the TestLevelDBCacheTimelineStore is extending TimelineStoreTestUtils and > we have already upgraded from Junit 4 to 5 in TimelineStoreTestUtils. > > Failing UTs > -[https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/957/testReport/junit/org.apache.hadoop.yarn.server.timeline/TestLevelDBCacheTimelineStore/testGetDomains/] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10793) Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice
[ https://issues.apache.org/jira/browse/YARN-10793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582217#comment-17582217 ] Akira Ajisaka commented on YARN-10793: -- Thank you [~ayushtkn] and [~groot]. I'm sorry for that. Hi [~groot], could you file a jira and add the junit5 dependency to hadoop-yarn-server-timeline-pluginstorage module to quickly fix this issue? I can review and commit it. Upgrading from junit 4 to 5 in the module can be done separately. > Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice > - > > Key: YARN-10793 > URL: https://issues.apache.org/jira/browse/YARN-10793 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: ANANDA G B >Assignee: groot >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11248) Add unit test for FINISHED_CONTAINERS_PULLED_BY_AM event on DECOMMISSIONING
[ https://issues.apache.org/jira/browse/YARN-11248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11248. -- Fix Version/s: 3.4.0 3.3.9 Resolution: Fixed Committed to trunk and branch-3.3. Thank you [~groot] for your contribution. > Add unit test for FINISHED_CONTAINERS_PULLED_BY_AM event on DECOMMISSIONING > --- > > Key: YARN-11248 > URL: https://issues.apache.org/jira/browse/YARN-11248 > Project: Hadoop YARN > Issue Type: Test > Components: test >Affects Versions: 3.3.3 >Reporter: groot >Assignee: groot >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.9 > > > Add unit test for FINISHED_CONTAINERS_PULLED_BY_AM event on DECOMMISSIONING -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10793) Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice
[ https://issues.apache.org/jira/browse/YARN-10793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-10793. -- Fix Version/s: 3.4.0 Resolution: Fixed Committed to trunk. Thank you [~groot] for your contribution. > Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice > - > > Key: YARN-10793 > URL: https://issues.apache.org/jira/browse/YARN-10793 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: ANANDA G B >Assignee: groot >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11210) Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration exception
[ https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka reassigned YARN-11210: Assignee: Kevin Wikant > Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration > exception > -- > > Key: YARN-11210 > URL: https://issues.apache.org/jira/browse/YARN-11210 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > h2. Description of Problem > Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) > synchronously can be blocked for up to 15 minutes with the default > configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an > issue in of itself, but there is a non-retryable IllegalArgumentException > exception thrown within the YARN ResourceManager client that is getting > swallowed & treated as a retryable "connection exception" meaning that it > gets retried for 15 minutes. > The purpose of this JIRA (and PR) is to modify the YARN client so that it > does not retry on this non-retryable exception. > h2. Background Information > YARN ResourceManager client treats connection exceptions as retryable & with > the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt > to connect to the ResourceManager for up to 15 minutes when facing > "connection exceptions". This arguably makes sense because connection > exceptions are in some cases transient & can be recovered from without any > action needed from the client. See example below where YARN ResourceManager > client was able to recover from connection issues that resulted from the > ResourceManager process being down. > {quote}> yarn rmadmin -refreshNodes > 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while > invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over > null after 1 failover attempts. Trying to failover after sleeping for 41061ms. > 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:28 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while > invoking ResourceManagerAdministrationProtoco
[jira] [Commented] (YARN-11210) Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration exception
[ https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571183#comment-17571183 ] Akira Ajisaka commented on YARN-11210: -- [~prabhujoseph] Yes. Done. I think all the Hadoop committers have the privilege to make someone to contributor. You can go to https://issues.apache.org/jira/plugins/servlet/project-config/YARN/roles and add Kevin as contributor. > Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration > exception > -- > > Key: YARN-11210 > URL: https://issues.apache.org/jira/browse/YARN-11210 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Kevin Wikant >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > h2. Description of Problem > Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) > synchronously can be blocked for up to 15 minutes with the default > configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an > issue in of itself, but there is a non-retryable IllegalArgumentException > exception thrown within the YARN ResourceManager client that is getting > swallowed & treated as a retryable "connection exception" meaning that it > gets retried for 15 minutes. > The purpose of this JIRA (and PR) is to modify the YARN client so that it > does not retry on this non-retryable exception. > h2. Background Information > YARN ResourceManager client treats connection exceptions as retryable & with > the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt > to connect to the ResourceManager for up to 15 minutes when facing > "connection exceptions". This arguably makes sense because connection > exceptions are in some cases transient & can be recovered from without any > action needed from the client. See example below where YARN ResourceManager > client was able to recover from connection issues that resulted from the > ResourceManager process being down. > {quote}> yarn rmadmin -refreshNodes > 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while > invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over > null after 1 failover attempts. Trying to failover after sleeping for 41061ms. > 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:28 INFO retry.Retr
[jira] [Assigned] (YARN-6946) Upgrade JUnit from 4 to 5 in hadoop-yarn-common
[ https://issues.apache.org/jira/browse/YARN-6946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka reassigned YARN-6946: --- Assignee: (was: Akira Ajisaka) > Upgrade JUnit from 4 to 5 in hadoop-yarn-common > --- > > Key: YARN-6946 > URL: https://issues.apache.org/jira/browse/YARN-6946 > Project: Hadoop YARN > Issue Type: Sub-task > Components: test >Reporter: Akira Ajisaka >Priority: Major > Attachments: YARN-6946.wip001.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11119) Backport YARN-10538 to branch-2.10
[ https://issues.apache.org/jira/browse/YARN-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-9. -- Fix Version/s: 2.10.3 Resolution: Fixed Committed to branch-2.10. Thank you [~groot] for your contribution. > Backport YARN-10538 to branch-2.10 > -- > > Key: YARN-9 > URL: https://issues.apache.org/jira/browse/YARN-9 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.10.0 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > Fix For: 2.10.3 > > Time Spent: 2h > Remaining Estimate: 0h > > Backport YARN-10538 to branch-2.10 > > Add recommissioning nodes to the list of updated nodes returned to the AM -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10303) One yarn rest api example of yarn document is error
[ https://issues.apache.org/jira/browse/YARN-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-10303. -- Resolution: Fixed > One yarn rest api example of yarn document is error > --- > > Key: YARN-10303 > URL: https://issues.apache.org/jira/browse/YARN-10303 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 3.1.1, 3.2.1 >Reporter: bright.zhou >Assignee: Ashutosh Gupta >Priority: Minor > Labels: documentation, newbie, pull-request-available > Fix For: 3.4.0 > > Attachments: image-2020-06-02-10-27-35-020.png > > Time Spent: 2h 20m > Remaining Estimate: 0h > > deSelects value should be resourceRequests > !image-2020-06-02-10-27-35-020.png! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-10303) One yarn rest api example of yarn document is error
[ https://issues.apache.org/jira/browse/YARN-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka reopened YARN-10303: -- > One yarn rest api example of yarn document is error > --- > > Key: YARN-10303 > URL: https://issues.apache.org/jira/browse/YARN-10303 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 3.1.1, 3.2.1 >Reporter: bright.zhou >Assignee: Ashutosh Gupta >Priority: Minor > Labels: documentation, newbie, pull-request-available > Fix For: 3.4.0 > > Attachments: image-2020-06-02-10-27-35-020.png > > Time Spent: 2h 20m > Remaining Estimate: 0h > > deSelects value should be resourceRequests > !image-2020-06-02-10-27-35-020.png! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11128) Fix comments in TestProportionalCapacityPreemptionPolicy*
[ https://issues.apache.org/jira/browse/YARN-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11128. -- Fix Version/s: 3.4.0 3.3.4 Resolution: Fixed Committed to trunk and branch-3.3. Thanks! > Fix comments in TestProportionalCapacityPreemptionPolicy* > - > > Key: YARN-11128 > URL: https://issues.apache.org/jira/browse/YARN-11128 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, documentation >Reporter: groot >Assignee: groot >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.3.4 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > At various places, comment for appsConfig is > {{// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)}} > but should be > {{// > queueName\t(priority,resource,host,expression,#repeat,reserved,pending,user)}} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10080) Support show app id on localizer thread pool
[ https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10080: - Fix Version/s: 3.2.4 Backported to branch-3.2. > Support show app id on localizer thread pool > > > Key: YARN-10080 > URL: https://issues.apache.org/jira/browse/YARN-10080 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Attachments: YARN-10080-001.patch, YARN-10080.002.patch > > Time Spent: 40m > Remaining Estimate: 0h > > Currently when we are troubleshooting a container localizer issue, if we want > to analyze the jstack with thread detail, we can not figure out which thread > is processing the given container. So i want to add app id on the thread name -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11133) YarnClient gets the wrong EffectiveMinCapacity value
[ https://issues.apache.org/jira/browse/YARN-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11133: - Fix Version/s: 3.2.4 3.3.4 Backported to branch-3.3 and branch-3.2. > YarnClient gets the wrong EffectiveMinCapacity value > > > Key: YARN-11133 > URL: https://issues.apache.org/jira/browse/YARN-11133 > Project: Hadoop YARN > Issue Type: Bug > Components: api >Affects Versions: 3.2.3, 3.3.2 >Reporter: Zilong Zhu >Assignee: Zilong Zhu >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Time Spent: 2h > Remaining Estimate: 0h > > It calls the QueueConfigurations#getEffectiveMinCapacity to get the wrong > value when I use the YarnClient. I found some bugs with > QueueConfigurationsPBImpl#mergeLocalToBuilder. > {code:java} > private void mergeLocalToBuilder() { > if (this.effMinResource != null) { > builder > .setEffectiveMinCapacity(convertToProtoFormat(this.effMinResource)); > } > if (this.effMaxResource != null) { > builder > .setEffectiveMaxCapacity(convertToProtoFormat(this.effMaxResource)); > } > if (this.configuredMinResource != null) { > builder.setEffectiveMinCapacity( > convertToProtoFormat(this.configuredMinResource)); > } > if (this.configuredMaxResource != null) { > builder.setEffectiveMaxCapacity( > convertToProtoFormat(this.configuredMaxResource)); > } > } {code} > configuredMinResource was incorrectly assigned to effMinResource. This causes > the real effMinResource to be overwritten and configuredMinResource is null. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11133) YarnClient gets the wrong EffectiveMinCapacity value
[ https://issues.apache.org/jira/browse/YARN-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka reassigned YARN-11133: Assignee: Zilong Zhu > YarnClient gets the wrong EffectiveMinCapacity value > > > Key: YARN-11133 > URL: https://issues.apache.org/jira/browse/YARN-11133 > Project: Hadoop YARN > Issue Type: Bug > Components: api >Affects Versions: 3.2.3, 3.3.2 >Reporter: Zilong Zhu >Assignee: Zilong Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > It calls the QueueConfigurations#getEffectiveMinCapacity to get the wrong > value when I use the YarnClient. I found some bugs with > QueueConfigurationsPBImpl#mergeLocalToBuilder. > {code:java} > private void mergeLocalToBuilder() { > if (this.effMinResource != null) { > builder > .setEffectiveMinCapacity(convertToProtoFormat(this.effMinResource)); > } > if (this.effMaxResource != null) { > builder > .setEffectiveMaxCapacity(convertToProtoFormat(this.effMaxResource)); > } > if (this.configuredMinResource != null) { > builder.setEffectiveMinCapacity( > convertToProtoFormat(this.configuredMinResource)); > } > if (this.configuredMaxResource != null) { > builder.setEffectiveMaxCapacity( > convertToProtoFormat(this.configuredMaxResource)); > } > } {code} > configuredMinResource was incorrectly assigned to effMinResource. This causes > the real effMinResource to be overwritten and configuredMinResource is null. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11092) Upgrade jquery ui to 1.13.1
[ https://issues.apache.org/jira/browse/YARN-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11092: - Issue Type: Bug (was: Improvement) > Upgrade jquery ui to 1.13.1 > --- > > Key: YARN-11092 > URL: https://issues.apache.org/jira/browse/YARN-11092 > Project: Hadoop YARN > Issue Type: Bug >Reporter: D M Murali Krishna Reddy >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The current jquery-ui version used(1.12.1) in the trunk has the following > vulnerabilities CVE-2021-41182, CVE-2021-41183, CVE-2021-41184, so we need to > upgrade to at least 1.13.0. > > Also currently for the UI2 we are using the shims repo which is not being > maintained as per the discussion > [https://github.com/components/jqueryui/issues/70] , so if possible we should > move to the main jquery repo [https://github.com/jquery/jquery-ui] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11092) Upgrade jquery ui to 1.13.1
[ https://issues.apache.org/jira/browse/YARN-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11092. -- Fix Version/s: 3.4.0 3.2.4 3.3.4 Resolution: Fixed Committed to trunk, branch-3.3, and branch-3.2. Thank you [~dmmkr] for the report and thank you [~groot] for your contribution. > Upgrade jquery ui to 1.13.1 > --- > > Key: YARN-11092 > URL: https://issues.apache.org/jira/browse/YARN-11092 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: D M Murali Krishna Reddy >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The current jquery-ui version used(1.12.1) in the trunk has the following > vulnerabilities CVE-2021-41182, CVE-2021-41183, CVE-2021-41184, so we need to > upgrade to at least 1.13.0. > > Also currently for the UI2 we are using the shims repo which is not being > maintained as per the discussion > [https://github.com/components/jqueryui/issues/70] , so if possible we should > move to the main jquery repo [https://github.com/jquery/jquery-ui] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10080) Support show app id on localizer thread pool
[ https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536774#comment-17536774 ] Akira Ajisaka commented on YARN-10080: -- I want to backport to branch-3.2, but the build is now broken by HDFS-16552. I'll backport this after the build is fixed. > Support show app id on localizer thread pool > > > Key: YARN-10080 > URL: https://issues.apache.org/jira/browse/YARN-10080 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.4 > > Attachments: YARN-10080-001.patch, YARN-10080.002.patch > > Time Spent: 40m > Remaining Estimate: 0h > > Currently when we are troubleshooting a container localizer issue, if we want > to analyze the jstack with thread detail, we can not figure out which thread > is processing the given container. So i want to add app id on the thread name -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11125) Backport YARN-6483 to branch-2.10
[ https://issues.apache.org/jira/browse/YARN-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11125. -- Fix Version/s: 2.10.2 Resolution: Fixed Merged the PR into branch-2.10. Thank you [~groot] for your contribution. > Backport YARN-6483 to branch-2.10 > - > > Key: YARN-11125 > URL: https://issues.apache.org/jira/browse/YARN-11125 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > Fix For: 2.10.2 > > Time Spent: 1h > Remaining Estimate: 0h > > Backport YARN-6483 to branch-2.10 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11125) Backport YARN-6483 to branch-2.10
[ https://issues.apache.org/jira/browse/YARN-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11125: - Issue Type: Improvement (was: Bug) > Backport YARN-6483 to branch-2.10 > - > > Key: YARN-11125 > URL: https://issues.apache.org/jira/browse/YARN-11125 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > Backport YARN-6483 to branch-2.10 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11125) Backport YARN-6483 to branch-2.10
[ https://issues.apache.org/jira/browse/YARN-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11125: - Component/s: resourcemanager > Backport YARN-6483 to branch-2.10 > - > > Key: YARN-11125 > URL: https://issues.apache.org/jira/browse/YARN-11125 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > Backport YARN-6483 to branch-2.10 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11073) Avoid unnecessary preemption for tiny queues under certain corner cases
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536750#comment-17536750 ] Akira Ajisaka commented on YARN-11073: -- Cut YARN-11469 for regression tests. > Avoid unnecessary preemption for tiny queues under certain corner cases > --- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Assignee: Jian Chen >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-11073.tmp-1.patch > > Time Spent: 50m > Remaining Estimate: 0h > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11073) Avoid unnecessary preemption for tiny queues under certain corner cases
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536750#comment-17536750 ] Akira Ajisaka edited comment on YARN-11073 at 5/13/22 4:16 PM: --- Filed YARN-11149 for regression tests. was (Author: ajisakaa): Cut YARN-11469 for regression tests. > Avoid unnecessary preemption for tiny queues under certain corner cases > --- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Assignee: Jian Chen >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-11073.tmp-1.patch > > Time Spent: 50m > Remaining Estimate: 0h > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11149) Add regression test cases for YARN-11073
Akira Ajisaka created YARN-11149: Summary: Add regression test cases for YARN-11073 Key: YARN-11149 URL: https://issues.apache.org/jira/browse/YARN-11149 Project: Hadoop YARN Issue Type: Test Components: test Reporter: Akira Ajisaka Add regression test cases for YARN-11073 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11073) Avoid unnecessary preemption for tiny queues under certain corner cases
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11073: - Summary: Avoid unnecessary preemption for tiny queues under certain corner cases (was: CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues) > Avoid unnecessary preemption for tiny queues under certain corner cases > --- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Assignee: Jian Chen >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-11073.tmp-1.patch > > Time Spent: 50m > Remaining Estimate: 0h > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11073. -- Fix Version/s: 3.4.0 Resolution: Fixed Merged the PR into trunk. Let's add test cases in a separate JIRA. > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Assignee: Jian Chen >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-11073.tmp-1.patch > > Time Spent: 50m > Remaining Estimate: 0h > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11128) Fix comments in TestProportionalCapacityPreemptionPolicy*
[ https://issues.apache.org/jira/browse/YARN-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11128: - Component/s: documentation Issue Type: Bug (was: New Feature) > Fix comments in TestProportionalCapacityPreemptionPolicy* > - > > Key: YARN-11128 > URL: https://issues.apache.org/jira/browse/YARN-11128 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, documentation >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Minor > > At various places, comment for appsConfig is > {{// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)}} > but should be > {{// > queueName\t(priority,resource,host,expression,#repeat,reserved,pending,user)}} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11128) Fix comments in TestProportionalCapacityPreemptionPolicy*
[ https://issues.apache.org/jira/browse/YARN-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11128: - Description: At various places, comment for appsConfig is {{// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)}} but should be {{// queueName\t(priority,resource,host,expression,#repeat,reserved,pending,user)}} was: At various places, comment for appsConfig is `// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)` but should be `// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)` > Fix comments in TestProportionalCapacityPreemptionPolicy* > - > > Key: YARN-11128 > URL: https://issues.apache.org/jira/browse/YARN-11128 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacityscheduler >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Minor > > At various places, comment for appsConfig is > {{// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)}} > but should be > {{// > queueName\t(priority,resource,host,expression,#repeat,reserved,pending,user)}} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11116) Migrate Times util from SimpleDateFormat to thread-safe DateTimeFormatter class
[ https://issues.apache.org/jira/browse/YARN-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-6. -- Fix Version/s: 3.4.0 3.2.4 3.3.4 Resolution: Fixed Committed to trunk, branch-3.3, and branch-3.2. Thank you [~jeagles] for your contribution! > Migrate Times util from SimpleDateFormat to thread-safe DateTimeFormatter > class > --- > > Key: YARN-6 > URL: https://issues.apache.org/jira/browse/YARN-6 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Turner Eagles >Assignee: Jonathan Turner Eagles >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Attachments: YARN-6.001.perftest.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Came across a stack trace with SimpleDateFormatter in it which led me to > investigate current practices > > {noformat} > 6578 "IPC Server handler 29 on 8032" #797 daemon prio=5 os_prio=0 > tid=0x7fb6527d nid=0x953b runnable [0x7fb5ba034000] > 6579 java.lang.Thread.State: RUNNABLE > 6580 at org.apache.hadoop.yarn.util.Times.formatISO8601(Times.java:95) > 6581 at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:810) > 6582 at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:396) > 6583 at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:224) > 6584 at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:529) > 6585 at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > 6586 at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:500) > 6587 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1069) > 6588 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003) > 6589 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:936) > 6590 at java.security.AccessController.doPrivileged(Native Method) > 6591 at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2135) > 6592 at > org.apache.hadoop.security.UserGroupInformation.doAsPrivileged(UserGroupInformation.java:2123) > 6593 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2875) > 6594 > {noformat} > > DateTimeFormatter is thread-safe meaning no need to wrap the class in Thread > local as they can be reused safely across threads. In addition, the new > classes are slightly more performant. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10187) Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained.
[ https://issues.apache.org/jira/browse/YARN-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-10187. -- Fix Version/s: 3.2.4 3.3.4 Resolution: Fixed Backported to branch-3.3 and branch-3.2. > Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained. > -- > > Key: YARN-10187 > URL: https://issues.apache.org/jira/browse/YARN-10187 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: N Sanketh Reddy >Assignee: Ashutosh Gupta >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Original Estimate: 1h > Time Spent: 50m > Remaining Estimate: 10m > > hadoop-yarn-project/hadoop-yarn/README is not maintained and can be removed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10187) Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained.
[ https://issues.apache.org/jira/browse/YARN-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10187: - Description: hadoop-yarn-project/hadoop-yarn/README is not maintained and can be removed. (was: Converting a README to README.md for showcasing the markdown and for better readablity) > Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained. > -- > > Key: YARN-10187 > URL: https://issues.apache.org/jira/browse/YARN-10187 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: N Sanketh Reddy >Assignee: Ashutosh Gupta >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Original Estimate: 1h > Time Spent: 50m > Remaining Estimate: 10m > > hadoop-yarn-project/hadoop-yarn/README is not maintained and can be removed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10187) Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained.
[ https://issues.apache.org/jira/browse/YARN-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10187: - Fix Version/s: 3.4.0 Issue Type: Bug (was: Improvement) Priority: Minor (was: Major) Summary: Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained. (was: converting README to README.md) Committed to trunk. Thank you [~groot] for your contribution! > Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained. > -- > > Key: YARN-10187 > URL: https://issues.apache.org/jira/browse/YARN-10187 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: N Sanketh Reddy >Assignee: Ashutosh Gupta >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Original Estimate: 1h > Time Spent: 50m > Remaining Estimate: 10m > > Converting a README to README.md for showcasing the markdown and for better > readablity -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-3224) Notify AM with containers (on decommissioning node) could be preempted after timeout.
[ https://issues.apache.org/jira/browse/YARN-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-3224. - Assignee: (was: Sunil G) Resolution: Fixed Duplicate of YARN-6483. Closing > Notify AM with containers (on decommissioning node) could be preempted after > timeout. > - > > Key: YARN-3224 > URL: https://issues.apache.org/jira/browse/YARN-3224 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful >Reporter: Junping Du >Priority: Major > Attachments: 0001-YARN-3224.patch, 0002-YARN-3224.patch > > > We should leverage YARN preemption framework to notify AM that some > containers will be preempted after a timeout. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10538) Add recommissioning nodes to the list of updated nodes returned to the AM
[ https://issues.apache.org/jira/browse/YARN-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529215#comment-17529215 ] Akira Ajisaka commented on YARN-10538: -- Hi [~groot], I tried to backport it to branch-2.10 but I faced some conflicts. Would you try backporting this and create a PR? I can review it. > Add recommissioning nodes to the list of updated nodes returned to the AM > - > > Key: YARN-10538 > URL: https://issues.apache.org/jira/browse/YARN-10538 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.9.1, 3.1.1 >Reporter: Srinivas S T >Assignee: Srinivas S T >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Time Spent: 2h > Remaining Estimate: 0h > > YARN-6483 introduced nodes that transitioned to DECOMMISSIONING state to the > list of updated nodes returned to the AM. This allows the Spark application > master to gracefully decommission its containers on the decommissioning node. > But if the node were to be recommissioned, the Spark application master would > not be aware of this. We propose to add recommissioned node to the list of > updated nodes sent to the AM when a recommission node transition occurs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized
[ https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-1: - Fix Version/s: 3.4.0 (was: 3.3.4) Changed the fix version to 3.4.0 because it is now only in trunk. > Recovery failure when node-label configure-type transit from > delegated-centralized to centralized > - > > Key: YARN-1 > URL: https://issues.apache.org/jira/browse/YARN-1 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h > Remaining Estimate: 0h > > When i make configure-type from delegated-centralized to centralized in > yarn-site.xml and restart the RM, it failed. > The error stacktrace is as follows > > {code:txt} > 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61) > at > org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138) > at > org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76) > at > org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41) > at > org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120) > at > org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149) > at > org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328) > ... 5 more > 2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Trying to re-establish ZK session > {code} > When i digging into the codebase, found that the node and labels mapping is > stored in the nodelabel.mirror file when configured the type of c
[jira] [Updated] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10720: - Fix Version/s: 3.4.0 > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server from hanging > -- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 2.10.2, 3.2.4, 3.3.3 > > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, > YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, > image-2021-03-29-14-05-32-708.png > > Time Spent: 1h > Remaining Estimate: 0h > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10553) Refactor TestDistributedShell
[ https://issues.apache.org/jira/browse/YARN-10553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10553: - Fix Version/s: 3.3.4 Backported to branch-3.3. > Refactor TestDistributedShell > - > > Key: YARN-10553 > URL: https://issues.apache.org/jira/browse/YARN-10553 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell, test >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Labels: pull-request-available, refactoring, test > Fix For: 3.4.0, 3.3.4 > > Time Spent: 8h 50m > Remaining Estimate: 0h > > TestDistributedShell has grown so large over time. It has 29 tests. > This is running the risk of exceeding 30 minutes limit for a single unit > class. > * The implementation has lots of code redundancy. > * The Jira splits TestDistributedShell into three different unitTest for > each TimeLineVersion: V1.0, 1.5, and 2.0 > * Fixes the broken test {{testDSShellWithEnforceExecutionType}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10553) Refactor TestDistributedShell
[ https://issues.apache.org/jira/browse/YARN-10553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520365#comment-17520365 ] Akira Ajisaka commented on YARN-10553: -- Opened [https://github.com/apache/hadoop/pull/4159] for branch-3.3 > Refactor TestDistributedShell > - > > Key: YARN-10553 > URL: https://issues.apache.org/jira/browse/YARN-10553 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell, test >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Labels: pull-request-available, refactoring, test > Fix For: 3.4.0 > > Time Spent: 8h 20m > Remaining Estimate: 0h > > TestDistributedShell has grown so large over time. It has 29 tests. > This is running the risk of exceeding 30 minutes limit for a single unit > class. > * The implementation has lots of code redundancy. > * The Jira splits TestDistributedShell into three different unitTest for > each TimeLineVersion: V1.0, 1.5, and 2.0 > * Fixes the broken test {{testDSShellWithEnforceExecutionType}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11101) Fix TestYarnConfigurationFields
[ https://issues.apache.org/jira/browse/YARN-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11101. -- Resolution: Duplicate > Fix TestYarnConfigurationFields > --- > > Key: YARN-11101 > URL: https://issues.apache.org/jira/browse/YARN-11101 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, newbie >Reporter: Akira Ajisaka >Priority: Major > > yarn.resourcemanager.node-labels.am.default-node-label-expression is missing > in yarn-default.xml. > {noformat} > [INFO] Running org.apache.hadoop.yarn.conf.TestYarnConfigurationFields > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.533 > s <<< FAILURE! - in org.apache.hadoop.yarn.conf.TestYarnConfigurationFields > [ERROR] testCompareConfigurationClassAgainstXml Time elapsed: 0.082 s <<< > FAILURE! > java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration > has 1 variables missing in yarn-default.xml Entries: > yarn.resourcemanager.node-labels.am.default-node-label-expression > expected:<0> but was:<1> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at > org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493) > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11101) Fix TestYarnConfigurationFields
[ https://issues.apache.org/jira/browse/YARN-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517924#comment-17517924 ] Akira Ajisaka commented on YARN-11101: -- Thank you [~zuston] for the information. I'll close this as duplicate. > Fix TestYarnConfigurationFields > --- > > Key: YARN-11101 > URL: https://issues.apache.org/jira/browse/YARN-11101 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, newbie >Reporter: Akira Ajisaka >Priority: Major > > yarn.resourcemanager.node-labels.am.default-node-label-expression is missing > in yarn-default.xml. > {noformat} > [INFO] Running org.apache.hadoop.yarn.conf.TestYarnConfigurationFields > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.533 > s <<< FAILURE! - in org.apache.hadoop.yarn.conf.TestYarnConfigurationFields > [ERROR] testCompareConfigurationClassAgainstXml Time elapsed: 0.082 s <<< > FAILURE! > java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration > has 1 variables missing in yarn-default.xml Entries: > yarn.resourcemanager.node-labels.am.default-node-label-expression > expected:<0> but was:<1> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at > org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493) > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka reassigned YARN-11073: Assignee: Jian Chen > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Assignee: Jian Chen >Priority: Major > Labels: pull-request-available > Attachments: YARN-11073.tmp-1.patch > > Time Spent: 40m > Remaining Estimate: 0h > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513261#comment-17513261 ] Akira Ajisaka commented on YARN-11073: -- Thank you [~jchenjc22]. Yes, I agreed that the testing the preemption behavior is difficult. For unit testing, I think it's sufficient to verify the behavior of the resetCapacity method that you have changed. > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Priority: Major > Labels: pull-request-available > Attachments: YARN-11073.tmp-1.patch > > Time Spent: 40m > Remaining Estimate: 0h > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10548) Decouple AM runner logic from SLSRunner
[ https://issues.apache.org/jira/browse/YARN-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513012#comment-17513012 ] Akira Ajisaka commented on YARN-10548: -- Hi all the reviewers, please do not +1 when spotbugs says -1. Hi [~snemeth] - please fix the spotbugs error in YARN-11102 as follow-up. Otherwise I'll revert it because it raised not only spotbugs error but also increased javac, whitespace, and javadoc warnings. > Decouple AM runner logic from SLSRunner > --- > > Key: YARN-10548 > URL: https://issues.apache.org/jira/browse/YARN-10548 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Fix For: 3.4.0 > > Attachments: YARN-10548.001.patch, YARN-10548.002.patch, > YARN-10548.003.patch > > > SLSRunner has too many responsibilities. > One of them is to parse the job details from the SLS input formats and > launch the AMs and task containers. > The AM runner logic could be decoupled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11102) Fix spotbugs error in hadoop-sls module
Akira Ajisaka created YARN-11102: Summary: Fix spotbugs error in hadoop-sls module Key: YARN-11102 URL: https://issues.apache.org/jira/browse/YARN-11102 Project: Hadoop YARN Issue Type: Bug Reporter: Akira Ajisaka Fix the following Spotbugs error: - org.apache.hadoop.yarn.sls.AMRunner.setInputTraces(String[]) may expose internal representation by storing an externally mutable object into AMRunner.inputTraces At AMRunner.java:by storing an externally mutable object into AMRunner.inputTraces At AMRunner.java:[line 267] - Write to static field org.apache.hadoop.yarn.sls.AMRunner.REMAINING_APPS from instance method org.apache.hadoop.yarn.sls.AMRunner.startAM() At AMRunner.java:from instance method org.apache.hadoop.yarn.sls.AMRunner.startAM() At AMRunner.java:[line 116] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11101) Fix TestYarnConfigurationFields
[ https://issues.apache.org/jira/browse/YARN-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11101: - Component/s: newbie > Fix TestYarnConfigurationFields > --- > > Key: YARN-11101 > URL: https://issues.apache.org/jira/browse/YARN-11101 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, newbie >Reporter: Akira Ajisaka >Priority: Major > > yarn.resourcemanager.node-labels.am.default-node-label-expression is missing > in yarn-default.xml. > {noformat} > [INFO] Running org.apache.hadoop.yarn.conf.TestYarnConfigurationFields > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.533 > s <<< FAILURE! - in org.apache.hadoop.yarn.conf.TestYarnConfigurationFields > [ERROR] testCompareConfigurationClassAgainstXml Time elapsed: 0.082 s <<< > FAILURE! > java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration > has 1 variables missing in yarn-default.xml Entries: > yarn.resourcemanager.node-labels.am.default-node-label-expression > expected:<0> but was:<1> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at > org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493) > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11101) Fix TestYarnConfigurationFields
Akira Ajisaka created YARN-11101: Summary: Fix TestYarnConfigurationFields Key: YARN-11101 URL: https://issues.apache.org/jira/browse/YARN-11101 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Akira Ajisaka yarn.resourcemanager.node-labels.am.default-node-label-expression is missing in yarn-default.xml. {noformat} [INFO] Running org.apache.hadoop.yarn.conf.TestYarnConfigurationFields [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.533 s <<< FAILURE! - in org.apache.hadoop.yarn.conf.TestYarnConfigurationFields [ERROR] testCompareConfigurationClassAgainstXml Time elapsed: 0.082 s <<< FAILURE! java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration has 1 variables missing in yarn-default.xml Entries: yarn.resourcemanager.node-labels.am.default-node-label-expression expected:<0> but was:<1> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493) {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10720: - Fix Version/s: 2.10.2 3.2.4 Backported to branch-3.2 and branch-2.10. > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server from hanging > -- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 2.10.2, 3.2.4, 3.3.3 > > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, > YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, > image-2021-03-29-14-05-32-708.png > > Time Spent: 1h > Remaining Estimate: 0h > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511676#comment-17511676 ] Akira Ajisaka commented on YARN-11073: -- Hello [~jchenjc22], do you want to continue the work to merge the fix into Apache Hadoop? If not, I'll let one of my colleague to take it over. > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Priority: Major > Attachments: YARN-11073.tmp-1.patch > > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511652#comment-17511652 ] Akira Ajisaka commented on YARN-10720: -- Opened https://github.com/apache/hadoop/pull/4103 for branch-2.10. I'm seeing this issue in a prod cluster, so I want to backport the fix to all the release branches. > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server from hanging > -- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.3.3 > > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, > YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, > image-2021-03-29-14-05-32-708.png > > Time Spent: 20m > Remaining Estimate: 0h > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511610#comment-17511610 ] Akira Ajisaka commented on YARN-10720: -- When cherry-picking to branch-3.2, I had to fix some conflicts. Opened https://github.com/apache/hadoop/pull/4102 for testing. > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server from hanging > -- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.3.3 > > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, > YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, > image-2021-03-29-14-05-32-708.png > > Time Spent: 10m > Remaining Estimate: 0h > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10720: - Fix Version/s: 3.3.3 Cherry-picked to branch-3.3. > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server from hanging > -- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Fix For: 3.4.0, 3.3.3 > > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, > YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, > image-2021-03-29-14-05-32-708.png > > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10747) Bump YARN CSI protobuf version to 3.7.1
[ https://issues.apache.org/jira/browse/YARN-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10747: - Fix Version/s: 3.3.3 Backported to branch-3.3. > Bump YARN CSI protobuf version to 3.7.1 > --- > > Key: YARN-10747 > URL: https://issues.apache.org/jira/browse/YARN-10747 > Project: Hadoop YARN > Issue Type: Task >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.3 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Bumping YARN CSI protobuf version to 3.7.1 to keep it consistent with > hadoop's protobuf version. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10538) Add recommissioning nodes to the list of updated nodes returned to the AM
[ https://issues.apache.org/jira/browse/YARN-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10538: - Fix Version/s: 3.2.3 Backported to branch-3.2 and branch-3.2.3. > Add recommissioning nodes to the list of updated nodes returned to the AM > - > > Key: YARN-10538 > URL: https://issues.apache.org/jira/browse/YARN-10538 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.9.1, 3.1.1 >Reporter: Srinivas S T >Assignee: Srinivas S T >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Time Spent: 2h > Remaining Estimate: 0h > > YARN-6483 introduced nodes that transitioned to DECOMMISSIONING state to the > list of updated nodes returned to the AM. This allows the Spark application > master to gracefully decommission its containers on the decommissioning node. > But if the node were to be recommissioned, the Spark application master would > not be aware of this. We propose to add recommissioned node to the list of > updated nodes sent to the AM when a recommission node transition occurs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11081) TestYarnConfigurationFields consistently keeps failing
[ https://issues.apache.org/jira/browse/YARN-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11081. -- Fix Version/s: 3.4.0 Resolution: Fixed Merged the PR into trunk. Thank you [~vjasani] for your contribution. > TestYarnConfigurationFields consistently keeps failing > -- > > Key: YARN-11081 > URL: https://issues.apache.org/jira/browse/YARN-11081 > Project: Hadoop YARN > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > TestYarnConfigurationFields consistently keeps failing with error: > {code:java} > Error Messageclass org.apache.hadoop.yarn.conf.YarnConfiguration has 1 > variables missing in yarn-default.xml Entries: > yarn.scheduler.app-placement-allocator.class expected:<0> but > was:<1>Stacktracejava.lang.AssertionError: class > org.apache.hadoop.yarn.conf.YarnConfiguration has 1 variables missing in > yarn-default.xml Entries: yarn.scheduler.app-placement-allocator.class > expected:<0> but was:<1> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at > org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500616#comment-17500616 ] Akira Ajisaka commented on YARN-11073: -- Hi [~jchenjc22], the fix looks good to me. There are some comments from my side: - Need to add regression tests - It's better to print debug log when computeNormGuarFromAbsCapacity or computeNormGuarEvenly is used - The patch should be targeted to trunk > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Priority: Major > Attachments: YARN-11073.tmp-1.patch > > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500566#comment-17500566 ] Akira Ajisaka commented on YARN-11073: -- {quote}IMO, it's okay to round up the queue's guaranteedCapacity. {quote} Rethinking this, it's not okay to round up the guaranteedCapacity because the round up can allocate more resources than the configured overall capacity. I'll look into your patch. > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Priority: Major > Attachments: YARN-11073.tmp-1.patch > > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10472) Backport YARN-10314 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10472: - Fix Version/s: (was: 3.2.3) > Backport YARN-10314 to branch-3.2 > - > > Key: YARN-10472 > URL: https://issues.apache.org/jira/browse/YARN-10472 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.2.2 >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Blocker > Fix For: 3.2.2 > > > Filing this jira to raise the following concern: > YARN-10314 fixes a problem with the shaded jars in 3.3.0. But it is not > backported to branch-3.2 yet. [~weichiu] and I ([~smeng]) are looking into > this. > I have submitted a PR on branch-3.2: > https://github.com/apache/hadoop/pull/2412 > CC [~hexiaoqiao] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7266) Timeline Server event handler threads locked
[ https://issues.apache.org/jira/browse/YARN-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-7266: Fix Version/s: 3.2.3 (was: 3.2.4) Cherry-picked to branch-3.2.3. > Timeline Server event handler threads locked > > > Key: YARN-7266 > URL: https://issues.apache.org/jira/browse/YARN-7266 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2, timelineserver >Affects Versions: 2.7.3 >Reporter: Venkata Puneet Ravuri >Assignee: Prabhu Joseph >Priority: Major > Fix For: 2.7.8, 3.3.0, 2.8.6, 2.9.3, 2.10.2, 3.2.3 > > Attachments: YARN-7266-0005.patch, YARN-7266-001.patch, > YARN-7266-002.patch, YARN-7266-003.patch, YARN-7266-004.patch, > YARN-7266-006.patch, YARN-7266-007.patch, YARN-7266-008.patch, > YARN-7266-branch-2.7.001.patch, YARN-7266-branch-2.8.001.patch > > > Event handlers for Timeline Server seem to take a lock while parsing HTTP > headers of the request. This is causing all other threads to wait and slowing > down the overall performance of Timeline server. We have resourcemanager > metrics enabled to send to timeline server. Because of the high load on > ResourceManager, the metrics to be sent are getting backlogged and in turn > increasing heap footprint of Resource Manager (due to pending metrics). > This is the complete stack trace of a blocked thread on timeline server:- > "2079644967@qtp-1658980982-4560" #4632 daemon prio=5 os_prio=0 > tid=0x7f6ba490a000 nid=0x5eb waiting for monitor entry > [0x7f6b9142c000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector.prepare(AccessorInjector.java:82) > - waiting to lock <0x0005c0621860> (a java.lang.Class for > com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector) > at > com.sun.xml.bind.v2.runtime.reflect.opt.OptimizedAccessorFactory.get(OptimizedAccessorFactory.java:168) > at > com.sun.xml.bind.v2.runtime.reflect.Accessor$FieldReflection.optimize(Accessor.java:282) > at > com.sun.xml.bind.v2.runtime.property.SingleElementNodeProperty.(SingleElementNodeProperty.java:94) > at sun.reflect.GeneratedConstructorAccessor52.newInstance(Unknown > Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at > com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128) > at > com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:551) > at > com.sun.xml.bind.v2.runtime.property.ArrayElementProperty.(ArrayElementProperty.java:112) > at > com.sun.xml.bind.v2.runtime.property.ArrayElementNodeProperty.(ArrayElementNodeProperty.java:62) > at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown > Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at > com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128) > at > com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:347) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170) > at > com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at javax.xml.bind.ContextFinder.newInstance(Unknown Source) > at javax.xml.bind.ContextFinder.newInstance(Unknown Source) > at javax.xml.bind.ContextFinder.find(Unknown Source) > at javax.xml.bind.JAXBContext.newInstance(Unknown Source) > at javax.xml.bind.JAXBContext.newInstance(Unknown Source) > at > com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.buildModelAndSchemas(WadlGeneratorJAXBGrammarGenerator.java:412) > at > com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.createExternalGrammar(WadlGeneratorJAXBGrammarGenerator.java:352) > at > com.sun.jersey.server.wadl.WadlBuilder.generat
[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495860#comment-17495860 ] Akira Ajisaka commented on YARN-11073: -- Thank you [~jchenjc22] for your comment. IMO, it's okay to round up the queue's guaranteedCapacity. Hi [~snemeth], do you have any suggestion? > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Priority: Major > Attachments: YARN-11073.tmp-1.patch > > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11068) Update transitive log4j2 dependency to 2.17.1
[ https://issues.apache.org/jira/browse/YARN-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11068. -- Fix Version/s: 3.4.0 Resolution: Fixed Merged the PR#3963 into trunk. > Update transitive log4j2 dependency to 2.17.1 > - > > Key: YARN-11068 > URL: https://issues.apache.org/jira/browse/YARN-11068 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Similar to HADOOP-18092, we have transitive log4j2 dependency coming from > solr-core 8 that must be excluded. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492307#comment-17492307 ] Akira Ajisaka commented on YARN-11073: -- Note that you will need to rename patch such as "YARN-11073-branch-2.10-001.patch" to test patch against branch-2.10. Creating a GitHub pull request in https://github.com/apache/hadoop is preferred approach rather than attaching a patch here. In addition, the target branch must be trunk. All the patches first go to trunk and then backported to other release branches. > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Priority: Major > Attachments: YARN-11073.tmp-1.patch > > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492305#comment-17492305 ] Akira Ajisaka commented on YARN-11073: -- Thank you [~jchenjc22] for your report. I'm +1 to always roundUp when calculating a queue's guaranteed capacity because it is very simple fix. Hi [~wangda], do you have any suggestion? > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Priority: Major > Attachments: YARN-11073.tmp-1.patch > > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10788) TestCsiClient fails
[ https://issues.apache.org/jira/browse/YARN-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491702#comment-17491702 ] Akira Ajisaka commented on YARN-10788: -- Thank you [~ayushtkn] for the investigation. I created a PR to reduce the path length. > TestCsiClient fails > --- > > Key: YARN-10788 > URL: https://issues.apache.org/jira/browse/YARN-10788 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Akira Ajisaka >Assignee: Akira Ajisaka >Priority: Major > Labels: pull-request-available > Attachments: > patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-csi.txt > > Time Spent: 10m > Remaining Estimate: 0h > > TestCsiClient fails to bind to unix domain socket. > https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/518/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-csi.txt > {noformat} > [INFO] Running org.apache.hadoop.yarn.csi.client.TestCsiClient > [ERROR] Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 0.67 > s <<< FAILURE! - in org.apache.hadoop.yarn.csi.client.TestCsiClient > [ERROR] testIdentityService(org.apache.hadoop.yarn.csi.client.TestCsiClient) > Time elapsed: 0.457 s <<< ERROR! > java.io.IOException: Failed to bind > at io.grpc.netty.NettyServer.start(NettyServer.java:257) > at io.grpc.internal.ServerImpl.start(ServerImpl.java:184) > at io.grpc.internal.ServerImpl.start(ServerImpl.java:90) > at > org.apache.hadoop.yarn.csi.client.FakeCsiDriver.start(FakeCsiDriver.java:56) > at > org.apache.hadoop.yarn.csi.client.TestCsiClient.testIdentityService(TestCsiClient.java:72) > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10788) TestCsiClient fails
[ https://issues.apache.org/jira/browse/YARN-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka reassigned YARN-10788: Assignee: Akira Ajisaka > TestCsiClient fails > --- > > Key: YARN-10788 > URL: https://issues.apache.org/jira/browse/YARN-10788 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Akira Ajisaka >Assignee: Akira Ajisaka >Priority: Major > Attachments: > patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-csi.txt > > > TestCsiClient fails to bind to unix domain socket. > https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/518/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-csi.txt > {noformat} > [INFO] Running org.apache.hadoop.yarn.csi.client.TestCsiClient > [ERROR] Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 0.67 > s <<< FAILURE! - in org.apache.hadoop.yarn.csi.client.TestCsiClient > [ERROR] testIdentityService(org.apache.hadoop.yarn.csi.client.TestCsiClient) > Time elapsed: 0.457 s <<< ERROR! > java.io.IOException: Failed to bind > at io.grpc.netty.NettyServer.start(NettyServer.java:257) > at io.grpc.internal.ServerImpl.start(ServerImpl.java:184) > at io.grpc.internal.ServerImpl.start(ServerImpl.java:90) > at > org.apache.hadoop.yarn.csi.client.FakeCsiDriver.start(FakeCsiDriver.java:56) > at > org.apache.hadoop.yarn.csi.client.TestCsiClient.testIdentityService(TestCsiClient.java:72) > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10561) Upgrade node.js to 12.22.1 and yarn to 1.22.5 in YARN application catalog webapp
[ https://issues.apache.org/jira/browse/YARN-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10561: - Summary: Upgrade node.js to 12.22.1 and yarn to 1.22.5 in YARN application catalog webapp (was: Upgrade node.js to at least 12.x in YARN application catalog webapp) > Upgrade node.js to 12.22.1 and yarn to 1.22.5 in YARN application catalog > webapp > > > Key: YARN-10561 > URL: https://issues.apache.org/jira/browse/YARN-10561 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Reporter: Akira Ajisaka >Assignee: Akira Ajisaka >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.3.3 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > YARN application catalog webapp is using node.js 8.11.3, and 8.x are already > EoL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10561) Upgrade node.js to at least 12.x in YARN application catalog webapp
[ https://issues.apache.org/jira/browse/YARN-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10561: - Target Version/s: 3.4.0, 3.3.2 (was: 3.4.0) > Upgrade node.js to at least 12.x in YARN application catalog webapp > --- > > Key: YARN-10561 > URL: https://issues.apache.org/jira/browse/YARN-10561 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Reporter: Akira Ajisaka >Assignee: Akira Ajisaka >Priority: Critical > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > YARN application catalog webapp is using node.js 8.11.3, and 8.x are already > EoL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10561) Upgrade node.js to at least 12.x in YARN application catalog webapp
[ https://issues.apache.org/jira/browse/YARN-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10561: - Priority: Critical (was: Major) > Upgrade node.js to at least 12.x in YARN application catalog webapp > --- > > Key: YARN-10561 > URL: https://issues.apache.org/jira/browse/YARN-10561 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Reporter: Akira Ajisaka >Assignee: Akira Ajisaka >Priority: Critical > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > YARN application catalog webapp is using node.js 8.11.3, and 8.x are already > EoL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10561) Upgrade node.js to at least 12.x in YARN application catalog webapp
[ https://issues.apache.org/jira/browse/YARN-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483550#comment-17483550 ] Akira Ajisaka commented on YARN-10561: -- I'm facing the error when building. Raising the priority: {code} [INFO] --- frontend-maven-plugin:1.11.2:yarn (yarn install) @ hadoop-yarn-applications-catalog-webapp --- [INFO] testFailureIgnore property is ignored in non test phases [INFO] Running 'yarn ' in /home/runner/work/hadoop-document/hadoop-document/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-webapp/target [INFO] yarn install v1.7.0 [INFO] info No lockfile found. [INFO] [1/4] Resolving packages... [INFO] [2/4] Fetching packages... [INFO] error safe-stable-stringify@2.3.1: The engine "node" is incompatible with this module. Expected version ">=10". [INFO] error Found incompatible module [INFO] info Visit https://yarnpkg.com/en/docs/cli/install for documentation about this command. {code} > Upgrade node.js to at least 12.x in YARN application catalog webapp > --- > > Key: YARN-10561 > URL: https://issues.apache.org/jira/browse/YARN-10561 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Reporter: Akira Ajisaka >Assignee: Akira Ajisaka >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > YARN application catalog webapp is using node.js 8.11.3, and 8.x are already > EoL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11068) Exclude transitive log4j2 dependency coming from solr 8
[ https://issues.apache.org/jira/browse/YARN-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483223#comment-17483223 ] Akira Ajisaka commented on YARN-11068: -- Hadoop 3.3 has the same issue. {code} [INFO] +- org.apache.solr:solr-core:jar:7.7.0:test [INFO] | +- org.apache.logging.log4j:log4j-1.2-api:jar:2.11.0:test [INFO] | +- org.apache.logging.log4j:log4j-api:jar:2.11.0:test [INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.11.0:test [INFO] | +- org.apache.logging.log4j:log4j-slf4j-impl:jar:2.11.0:test {code} BTW, I don't think this issue is blocker because the scope is test and the jar files are not in the binary tarball. > Exclude transitive log4j2 dependency coming from solr 8 > --- > > Key: YARN-11068 > URL: https://issues.apache.org/jira/browse/YARN-11068 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Blocker > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > Similar to HADOOP-18092, we have transitive log4j2 dependency coming from > solr-core 8 that must be excluded. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11065) Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui
[ https://issues.apache.org/jira/browse/YARN-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11065: - Fix Version/s: 3.3.3 Backported to branch-3.3. > Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui > - > > Key: YARN-11065 > URL: https://issues.apache.org/jira/browse/YARN-11065 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Akira Ajisaka >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.3 > > Time Spent: 10m > Remaining Estimate: 0h > > Upgrade follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11065) Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui
[ https://issues.apache.org/jira/browse/YARN-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11065. -- Fix Version/s: 3.4.0 Resolution: Fixed > Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui > - > > Key: YARN-11065 > URL: https://issues.apache.org/jira/browse/YARN-11065 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Akira Ajisaka >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Upgrade follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11065) Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui
[ https://issues.apache.org/jira/browse/YARN-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479336#comment-17479336 ] Akira Ajisaka commented on YARN-11065: -- Merged [https://github.com/apache/hadoop/pull/3890] into trunk. The PR is created by dependabot, so let the assignee empty. > Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui > - > > Key: YARN-11065 > URL: https://issues.apache.org/jira/browse/YARN-11065 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Akira Ajisaka >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Upgrade follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11065) Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui
Akira Ajisaka created YARN-11065: Summary: Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui Key: YARN-11065 URL: https://issues.apache.org/jira/browse/YARN-11065 Project: Hadoop YARN Issue Type: Bug Components: yarn-ui-v2 Reporter: Akira Ajisaka Upgrade follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11055) In cgroups-operations.c some fprintf format strings don't end with "\n"
[ https://issues.apache.org/jira/browse/YARN-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11055. -- Fix Version/s: 3.4.0 3.2.4 3.3.3 Resolution: Fixed Committed to trunk, branch-3.3, and branch-3.2. Thank you [~jira.shegalov] for your contribution. > In cgroups-operations.c some fprintf format strings don't end with "\n" > > > Key: YARN-11055 > URL: https://issues.apache.org/jira/browse/YARN-11055 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.3.1 >Reporter: Gera Shegalov >Assignee: Gera Shegalov >Priority: Minor > Labels: cgroups, easyfix, pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at > the end leading to a hard-to-parse error message output > example: > https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11053) AuxService should not use class name as default system classes
[ https://issues.apache.org/jira/browse/YARN-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-11053: - Fix Version/s: 3.3.2 (was: 3.3.3) Cherry-picked to branch-3.3.2. > AuxService should not use class name as default system classes > -- > > Key: YARN-11053 > URL: https://issues.apache.org/jira/browse/YARN-11053 > Project: Hadoop YARN > Issue Type: Bug > Components: auxservices >Affects Versions: 3.3.1 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.2 > > Time Spent: 50m > Remaining Estimate: 0h > > Following Apache Spark document to configure Spark Shuffle Service as YARN > AuxService, > [https://spark.apache.org/docs/3.2.0/running-on-yarn.html#running-multiple-versions-of-the-spark-shuffle-service] > > {code:java} > > yarn.nodemanager.aux-services > spark_shuffle > > > yarn.nodemanager.aux-services.spark_shuffle.classpath > /opt/apache/spark/yarn/* > > > > yarn.nodemanager.aux-services.spark_shuffle.class&lt;/name> > org.apache.spark.network.yarn.YarnShuffleService >{code} > but failed with exception > {code:java} > 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: classpath: > [file:/opt/apache/spark/yarn/spark-3.2.0-yarn-shuffle.jar] > 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: system classes: > [org.apache.spark.network.yarn.YarnShuffleService] > 2021-12-02 15:34:00,887 INFO service.AbstractService: Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed > in state INITED > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042) > Caused by: java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at java.net.URLClassLoader.findClass(URLClassLoader.java:387) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:165) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452) > ... 10 more > {code} > A workaround is adding > {code:java} > > yarn.nodemanager.aux-services.spark_shuffle.system-classes > not.existed.class > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9967) Fix NodeManager failing to start when Hdfs Auxillary Jar is set
[ https://issues.apache.org/jira/browse/YARN-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-9967. - Resolution: Duplicate Fixed by YARN-11053. Closing as duplicate. > Fix NodeManager failing to start when Hdfs Auxillary Jar is set > --- > > Key: YARN-9967 > URL: https://issues.apache.org/jira/browse/YARN-9967 > Project: Hadoop YARN > Issue Type: Bug > Components: auxservices, nodemanager >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Tarun Parimi >Priority: Major > > Loading an auxiliary jar from a Hdfs location on a node manager fails with > ClassNotFound Exception > {code:java} > 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: > classpath: [] > 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: > system classes: [java., javax.accessibility., javax.activation., > javax.activity., javax.annotation., javax.annotation.processing., > javax.crypto., javax.imageio., javax.jws., javax.lang.model., > -javax.management.j2ee., javax.management., javax.naming., javax.net., > javax.print., javax.rmi., javax.script., -javax.security.auth.message., > javax.security.auth., javax.security.cert., javax.security.sasl., > javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., > -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., > org.xml.sax., org.apache.commons.logging., org.apache.log4j., > -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, > hdfs-default.xml, mapred-default.xml, yarn-default.xml] > 2019-11-08 03:59:49,257 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed > in state INITED > java.lang.ClassNotFoundException: org.apache.auxtest.AuxServiceFromHDFS > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:270) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:321) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:478) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:936) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1016) > {code} > *Repro:* > {code:java} > 1. Prepare a custom auxiliary service jar and place it on hdfs > [hdfs@yarndocker-1 yarn]$ cat TestShuffleHandler2.java > package org; > import org.apache.hadoop.yarn.server.api.AuxiliaryService; > import org.apache.hadoop.yarn.server.api.ApplicationInitializationContext; > import org.apache.hadoop.yarn.server.api.ApplicationTerminationContext; > import java.nio.ByteBuffer; > public class TestShuffleHandler2 extends AuxiliaryService { > public static final String MAPREDUCE_TEST_SHUFFLE_SERVICEID = > "test_shuffle2"; > public TestShuffleHandler2() { > super("testshuffle2"); > } > @Override > public void initializeApplication(ApplicationInitializationContext > context) { > } > @Override > public void stopApplication(ApplicationTerminationContext context) { > } > @Override > public synchronized ByteBuffer getMetaData() { > return ByteBuffer.allocate(0); > } > } > > [hdfs@yarndocker-1 yarn]$ javac -d . -cp `hadoop classpath` > TestShuffleHandler2.java > [hdfs@yarndocker-1 yarn]$ jar cvf auxhdfs.jar org/ > [hdfs@yarndocker-1 mapreduce]$ hadoop fs -mkdir
[jira] [Resolved] (YARN-11053) AuxService should not use class name as default system classes
[ https://issues.apache.org/jira/browse/YARN-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved YARN-11053. -- Fix Version/s: 3.4.0 3.3.3 Resolution: Fixed Committed to trunk and branch-3.3. > AuxService should not use class name as default system classes > -- > > Key: YARN-11053 > URL: https://issues.apache.org/jira/browse/YARN-11053 > Project: Hadoop YARN > Issue Type: Bug > Components: auxservices >Affects Versions: 3.3.1 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.3 > > Time Spent: 40m > Remaining Estimate: 0h > > Following Apache Spark document to configure Spark Shuffle Service as YARN > AuxService, > [https://spark.apache.org/docs/3.2.0/running-on-yarn.html#running-multiple-versions-of-the-spark-shuffle-service] > > {code:java} > > yarn.nodemanager.aux-services > spark_shuffle > > > yarn.nodemanager.aux-services.spark_shuffle.classpath > /opt/apache/spark/yarn/* > > > > yarn.nodemanager.aux-services.spark_shuffle.class&lt;/name> > org.apache.spark.network.yarn.YarnShuffleService >{code} > but failed with exception > {code:java} > 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: classpath: > [file:/opt/apache/spark/yarn/spark-3.2.0-yarn-shuffle.jar] > 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: system classes: > [org.apache.spark.network.yarn.YarnShuffleService] > 2021-12-02 15:34:00,887 INFO service.AbstractService: Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed > in state INITED > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042) > Caused by: java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at java.net.URLClassLoader.findClass(URLClassLoader.java:387) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:165) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452) > ... 10 more > {code} > A workaround is adding > {code:java} > > yarn.nodemanager.aux-services.spark_shuffle.system-classes > not.existed.class > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11030) ClassNotFoundException when aux service class is loaded from customized classpath
[ https://issues.apache.org/jira/browse/YARN-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17464861#comment-17464861 ] Akira Ajisaka commented on YARN-11030: -- Thank you for your response. I'll merge the PR shortly. > ClassNotFoundException when aux service class is loaded from customized > classpath > - > > Key: YARN-11030 > URL: https://issues.apache.org/jira/browse/YARN-11030 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.3.1 >Reporter: Hiroyuki Adachi >Priority: Minor > > NodeManager failed to load the aux service with ClassNotFoundException while > loading the class from the customized classpath. > {noformat} > > > value="org.apache.spark.network.yarn.YarnShuffleService"/> > value="/tmp/spark-3.1.2-yarn-shuffle.jar"/> > > {noformat} > {noformat} > 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: > classpath: [file:/tmp/spark-3.1.2-yarn-shuffle.jar] > 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: > system classes: [org.apache.spark.network.yarn.YarnShuffleService] > 2021-12-06 15:32:09,169 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed > in > state INITED > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042) > Caused by: java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.ja > va:165) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452) > ... 10 more > 2021-12-06 15:32:09,172 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > > failed in state INITED{noformat} > > YARN-9075 may cause this problem. The default system classes were changed by > this patch. > Before YARN-9075: isSystemClass() returns false since the system classes does > not contain the aux service class itself, and the class will be loaded from > the customized classpath. > [https://github.com/apache/hadoop/blob/rel/release-3.3.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ApplicationClassLoader.java#L176] > {noformat} > 2021-12-06 15:50:21,332 INFO org.apache.hadoop.util.ApplicationClassLoader: > classpath: [file:/tmp/spark-3.1.2-yarn-shuffle.jar] > 2021-12-06 15:50:21,332 INFO org.apache.hadoop.util.ApplicationClas
[jira] [Assigned] (YARN-11053) AuxService should not use class name as default system classes
[ https://issues.apache.org/jira/browse/YARN-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka reassigned YARN-11053: Assignee: Cheng Pan > AuxService should not use class name as default system classes > -- > > Key: YARN-11053 > URL: https://issues.apache.org/jira/browse/YARN-11053 > Project: Hadoop YARN > Issue Type: Bug > Components: auxservices >Affects Versions: 3.3.1 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Following Apache Spark document to configure Spark Shuffle Service as YARN > AuxService, > [https://spark.apache.org/docs/3.2.0/running-on-yarn.html#running-multiple-versions-of-the-spark-shuffle-service] > > {code:java} > > yarn.nodemanager.aux-services > spark_shuffle > > > yarn.nodemanager.aux-services.spark_shuffle.classpath > /opt/apache/spark/yarn/* > > > yarn.nodemanager.aux-services.spark_shuffle.class</name> > org.apache.spark.network.yarn.YarnShuffleService >{code} > but failed with exception > {code:java} > 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: classpath: > [file:/opt/apache/spark/yarn/spark-3.2.0-yarn-shuffle.jar] > 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: system classes: > [org.apache.spark.network.yarn.YarnShuffleService] > 2021-12-02 15:34:00,887 INFO service.AbstractService: Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed > in state INITED > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042) > Caused by: java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at java.net.URLClassLoader.findClass(URLClassLoader.java:387) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:165) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452) > ... 10 more > {code} > An workaround is add > {code:java} > > yarn.nodemanager.aux-services.spark_shuffle.system-classes > not.existed.class > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9967) Fix NodeManager failing to start when Hdfs Auxillary Jar is set
[ https://issues.apache.org/jira/browse/YARN-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17464429#comment-17464429 ] Akira Ajisaka commented on YARN-9967: - Hi [~minni31], there is already a PR: https://github.com/apache/hadoop/pull/3816 Would you check this? > Fix NodeManager failing to start when Hdfs Auxillary Jar is set > --- > > Key: YARN-9967 > URL: https://issues.apache.org/jira/browse/YARN-9967 > Project: Hadoop YARN > Issue Type: Bug > Components: auxservices, nodemanager >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Tarun Parimi >Priority: Major > > Loading an auxiliary jar from a Hdfs location on a node manager fails with > ClassNotFound Exception > {code:java} > 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: > classpath: [] > 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: > system classes: [java., javax.accessibility., javax.activation., > javax.activity., javax.annotation., javax.annotation.processing., > javax.crypto., javax.imageio., javax.jws., javax.lang.model., > -javax.management.j2ee., javax.management., javax.naming., javax.net., > javax.print., javax.rmi., javax.script., -javax.security.auth.message., > javax.security.auth., javax.security.cert., javax.security.sasl., > javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., > -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., > org.xml.sax., org.apache.commons.logging., org.apache.log4j., > -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, > hdfs-default.xml, mapred-default.xml, yarn-default.xml] > 2019-11-08 03:59:49,257 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed > in state INITED > java.lang.ClassNotFoundException: org.apache.auxtest.AuxServiceFromHDFS > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:270) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:321) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:478) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:936) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1016) > {code} > *Repro:* > {code:java} > 1. Prepare a custom auxiliary service jar and place it on hdfs > [hdfs@yarndocker-1 yarn]$ cat TestShuffleHandler2.java > package org; > import org.apache.hadoop.yarn.server.api.AuxiliaryService; > import org.apache.hadoop.yarn.server.api.ApplicationInitializationContext; > import org.apache.hadoop.yarn.server.api.ApplicationTerminationContext; > import java.nio.ByteBuffer; > public class TestShuffleHandler2 extends AuxiliaryService { > public static final String MAPREDUCE_TEST_SHUFFLE_SERVICEID = > "test_shuffle2"; > public TestShuffleHandler2() { > super("testshuffle2"); > } > @Override > public void initializeApplication(ApplicationInitializationContext > context) { > } > @Override > public void stopApplication(ApplicationTerminationContext context) { > } > @Override > public synchronized ByteBuffer getMetaData() { > return ByteBuffer.allocate(0); > } > } > > [hdfs@yarndocker-1 yarn]$ javac -d . -cp `hadoop classpath` > TestShuffleHandler2.java > [hdfs@yarnd
[jira] [Commented] (YARN-11030) ClassNotFoundException when aux service class is loaded from customized classpath
[ https://issues.apache.org/jira/browse/YARN-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17464426#comment-17464426 ] Akira Ajisaka commented on YARN-11030: -- I found YARN-11053 duplicates this issue, and there is a PR https://github.com/apache/hadoop/pull/3816 Hi [~hadachi], is the PR fix this problem? > ClassNotFoundException when aux service class is loaded from customized > classpath > - > > Key: YARN-11030 > URL: https://issues.apache.org/jira/browse/YARN-11030 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.3.1 >Reporter: Hiroyuki Adachi >Priority: Minor > > NodeManager failed to load the aux service with ClassNotFoundException while > loading the class from the customized classpath. > {noformat} > > > value="org.apache.spark.network.yarn.YarnShuffleService"/> > value="/tmp/spark-3.1.2-yarn-shuffle.jar"/> > > {noformat} > {noformat} > 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: > classpath: [file:/tmp/spark-3.1.2-yarn-shuffle.jar] > 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: > system classes: [org.apache.spark.network.yarn.YarnShuffleService] > 2021-12-06 15:32:09,169 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed > in > state INITED > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042) > Caused by: java.lang.ClassNotFoundException: > org.apache.spark.network.yarn.YarnShuffleService > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.ja > va:165) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452) > ... 10 more > 2021-12-06 15:32:09,172 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > > failed in state INITED{noformat} > > YARN-9075 may cause this problem. The default system classes were changed by > this patch. > Before YARN-9075: isSystemClass() returns false since the system classes does > not contain the aux service class itself, and the class will be loaded from > the customized classpath. > [https://github.com/apache/hadoop/blob/rel/release-3.3.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ApplicationClassLoader.java#L176] > {noformat} > 2021-12-06 15:50:21,332 INFO org.apache.hadoop.util.ApplicationClassLoader: > classpath: [file:/tmp/spark-3
[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch
[ https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-8234: Release Note: When Timeline Service V1 or V1.5 is used, if "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.enable-batch" is set to true, ResourceManager sends timeline events in batch. The default value is false. If this functionality is enabled, the maximum number that events published in batch is configured by "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.batch-size". The default value is 1000. The interval of publishing events can be configured by "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.interval-seconds". By default, it is set to 60 seconds. (was: If "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.enable-batch" is set to true, ResourceManager sends timeline events in batch. The default value is false. If this functionality is enabled, the maximum number that events published in batch is configured by "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.batch-size". The default value is 1000. The interval of publishing events can be configured by "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.interval-seconds". By default, it is set to 60 seconds.) > Improve RM system metrics publisher's performance by pushing events to > timeline server in batch > --- > > Key: YARN-8234 > URL: https://issues.apache.org/jira/browse/YARN-8234 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, timelineserver >Affects Versions: 2.8.3 >Reporter: Hu Ziqian >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Attachments: YARN-8234-branch-2.8.3.001.patch, > YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, > YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch, > YARN-8234.003.patch, YARN-8234.004.patch > > Time Spent: 7.5h > Remaining Estimate: 0h > > When system metrics publisher is enabled, RM will push events to timeline > server via restful api. If the cluster load is heavy, many events are sent to > timeline server and the timeline server's event handler thread locked. > YARN-7266 talked about the detail of this problem. Because of the lock, > timeline server can't receive event as fast as it generated in RM and lots of > timeline event stays in RM's memory. Finally, those events will consume all > RM's memory and RM will start a full gc (which cause an JVM stop-world and > cause a timeout from rm to zookeeper) or even get an OOM. > The main problem here is that timeline can't receive timeline server's event > as fast as it generated. Now, RM system metrics publisher put only one event > in a request, and most time costs on handling http header or some thing about > the net connection on timeline side. Only few time is spent on dealing with > the timeline event which is truly valuable. > In this issue, we add a buffer in system metrics publisher and let publisher > send events to timeline server in batch via one request. When sets the batch > size to 1000, in out experiment the speed of the timeline server receives > events has 100x improvement. We have implement this function int our product > environment which accepts 2 app's in one hour and it works fine. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch
[ https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-8234: Release Note: If "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.enable-batch" is set to true, ResourceManager sends timeline events in batch. The default value is false. If this functionality is enabled, the maximum number that events published in batch is configured by "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.batch-size". The default value is 1000. The interval of publishing events can be configured by "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.interval-seconds". By default, it is set to 60 seconds. > Improve RM system metrics publisher's performance by pushing events to > timeline server in batch > --- > > Key: YARN-8234 > URL: https://issues.apache.org/jira/browse/YARN-8234 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, timelineserver >Affects Versions: 2.8.3 >Reporter: Hu Ziqian >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Attachments: YARN-8234-branch-2.8.3.001.patch, > YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, > YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch, > YARN-8234.003.patch, YARN-8234.004.patch > > Time Spent: 7.5h > Remaining Estimate: 0h > > When system metrics publisher is enabled, RM will push events to timeline > server via restful api. If the cluster load is heavy, many events are sent to > timeline server and the timeline server's event handler thread locked. > YARN-7266 talked about the detail of this problem. Because of the lock, > timeline server can't receive event as fast as it generated in RM and lots of > timeline event stays in RM's memory. Finally, those events will consume all > RM's memory and RM will start a full gc (which cause an JVM stop-world and > cause a timeout from rm to zookeeper) or even get an OOM. > The main problem here is that timeline can't receive timeline server's event > as fast as it generated. Now, RM system metrics publisher put only one event > in a request, and most time costs on handling http header or some thing about > the net connection on timeline side. Only few time is spent on dealing with > the timeline event which is truly valuable. > In this issue, we add a buffer in system metrics publisher and let publisher > send events to timeline server in batch via one request. When sets the batch > size to 1000, in out experiment the speed of the timeline server receives > events has 100x improvement. We have implement this function int our product > environment which accepts 2 app's in one hour and it works fine. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch
[ https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-8234: Description: When system metrics publisher is enabled, RM will push events to timeline server via restful api. If the cluster load is heavy, many events are sent to timeline server and the timeline server's event handler thread locked. YARN-7266 talked about the detail of this problem. Because of the lock, timeline server can't receive event as fast as it generated in RM and lots of timeline event stays in RM's memory. Finally, those events will consume all RM's memory and RM will start a full gc (which cause an JVM stop-world and cause a timeout from rm to zookeeper) or even get an OOM. The main problem here is that timeline can't receive timeline server's event as fast as it generated. Now, RM system metrics publisher put only one event in a request, and most time costs on handling http header or some thing about the net connection on timeline side. Only few time is spent on dealing with the timeline event which is truly valuable. In this issue, we add a buffer in system metrics publisher and let publisher send events to timeline server in batch via one request. When sets the batch size to 1000, in out experiment the speed of the timeline server receives events has 100x improvement. We have implement this function int our product environment which accepts 2 app's in one hour and it works fine. was: When system metrics publisher is enabled, RM will push events to timeline server via restful api. If the cluster load is heavy, many events are sent to timeline server and the timeline server's event handler thread locked. YARN-7266 talked about the detail of this problem. Because of the lock, timeline server can't receive event as fast as it generated in RM and lots of timeline event stays in RM's memory. Finally, those events will consume all RM's memory and RM will start a full gc (which cause an JVM stop-world and cause a timeout from rm to zookeeper) or even get an OOM. The main problem here is that timeline can't receive timeline server's event as fast as it generated. Now, RM system metrics publisher put only one event in a request, and most time costs on handling http header or some thing about the net connection on timeline side. Only few time is spent on dealing with the timeline event which is truly valuable. In this issue, we add a buffer in system metrics publisher and let publisher send events to timeline server in batch via one request. When sets the batch size to 1000, in out experiment the speed of the timeline server receives events has 100x improvement. We have implement this function int our product environment which accepts 2 app's in one hour and it works fine. We add following configuration: * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of system metrics publisher sending events in one request. Default value is 1000 * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the event buffer in system metrics publisher. * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When enable batch publishing, we must avoid that the publisher waits for a batch to be filled up and hold events in buffer for long time. So we add another thread which send event's in the buffer periodically. This config sets the interval of the cyclical sending thread. The default value is 60s. > Improve RM system metrics publisher's performance by pushing events to > timeline server in batch > --- > > Key: YARN-8234 > URL: https://issues.apache.org/jira/browse/YARN-8234 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, timelineserver >Affects Versions: 2.8.3 >Reporter: Hu Ziqian >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Attachments: YARN-8234-branch-2.8.3.001.patch, > YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, > YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch, > YARN-8234.003.patch, YARN-8234.004.patch > > Time Spent: 7.5h > Remaining Estimate: 0h > > When system metrics publisher is enabled, RM will push events to timeline > server via restful api. If the cluster load is heavy, many events are sent to > timeline server and the timeline server's event handler thread locked. > YARN-7266 talked about the detail of this problem. Because of the lock, > timeline server can't receive event as fast as it generated in RM and lots of > timeline event stays in RM's memory. Finally, those events will consume all > RM's memory and RM will start a full gc (which cause an JVM st