[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824884#comment-17824884 ]
ASF GitHub Bot commented on YARN-11656: --------------------------------------- hadoop-yetus commented on PR #6569: URL: https://github.com/apache/hadoop/pull/6569#issuecomment-1986567474 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |:----:|----------:|--------:|:--------:|:-------:| | +0 :ok: | reexec | 17m 58s | | Docker mode activated. | |||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 8 new or modified test files. | |||| _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 47s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 36m 36s | | trunk passed | | +1 :green_heart: | compile | 8m 6s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 7m 21s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 58s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 1s | | trunk passed | | +1 :green_heart: | javadoc | 2m 6s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 57s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 55s | | trunk passed | | +1 :green_heart: | shadedclient | 43m 52s | | branch has no errors when building and testing our client artifacts. | |||| _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 31s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 33s | | the patch passed | | +1 :green_heart: | compile | 9m 18s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 9m 18s | | the patch passed | | +1 :green_heart: | compile | 9m 26s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 9m 26s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 55s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/9/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt) | hadoop-yarn-project/hadoop-yarn: The patch generated 14 new + 95 unchanged - 3 fixed = 109 total (was 98) | | +1 :green_heart: | mvnsite | 1m 52s | | the patch passed | | +1 :green_heart: | javadoc | 1m 52s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 48s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | -1 :x: | spotbugs | 1m 56s | [/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/9/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.html) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) | | +1 :green_heart: | shadedclient | 40m 7s | | patch has no errors when building and testing our client artifacts. | |||| _ Other Tests _ | | +1 :green_heart: | unit | 5m 37s | | hadoop-yarn-common in the patch passed. | | -1 :x: | unit | 104m 43s | [/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/9/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 55s | | The patch does not generate ASF License warnings. | | | | 328m 36s | | | | Reason | Tests | |-------:|:------| | SpotBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common | | | Bad attempt to compute absolute value of signed 32-bit hashcode in org.apache.hadoop.yarn.event.multidispatcher.MultiDispatcherExecutor.execute(Event, Runnable) At MultiDispatcherExecutor.java:value of signed 32-bit hashcode in org.apache.hadoop.yarn.event.multidispatcher.MultiDispatcherExecutor.execute(Event, Runnable) At MultiDispatcherExecutor.java:[line 60] | | | new org.apache.hadoop.yarn.event.multidispatcher.MultiDispatcherExecutor(Logger, MultiDispatcherConfig, String) invokes org.apache.hadoop.yarn.event.multidispatcher.MultiDispatcherExecutor$MultiDispatcherExecutorThread.start() At MultiDispatcherExecutor.java:org.apache.hadoop.yarn.event.multidispatcher.MultiDispatcherExecutor$MultiDispatcherExecutorThread.start() At MultiDispatcherExecutor.java:[line 54] | | Failed junit tests | hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy | | Subsystem | Report/Notes | |----------:|:-------------| | Docker | ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/9/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6569 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 7993079112fd 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 06542118fd6c6cb9f43683720f86a263c86abcf0 | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/9/testReport/ | | Max. process+thread count | 1834 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/9/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > RMStateStore event queue blocked > -------------------------------- > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn > Affects Versions: 3.4.1 > Reporter: Bence Kosztolnik > Assignee: Bence Kosztolnik > Priority: Major > Labels: pull-request-available > Attachments: issue.png, log.png > > > h2. Problem statement > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > Also another way to identify the issue if we can see too much time is > required to store info for app after reach new_saving state > {panel:title=How issue can look like in log} > !log.png|height=250! > {panel} > h2. Solution > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event > The dispatcher has the following configs ( the placeholder is for the > dispatcher name, for example, rm-state-store ) > ||Config name||Description||Default value|| > |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel > threads should execute the parallel event execution| 4| > |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full > the execution threads will scale up to this many|8| > |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will > be destroyed after this many seconds|10| > |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 > 000| > |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event > queue will be logged with this frequency (if not zero) |30| > |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop > signal the dispatcher will wait this many seconds to be able to process the > incoming events before terminating them|60| > {panel:title=Example output from RM JMX api} > {noformat} > ... > { > "name": "Hadoop:service=ResourceManager,name=Event metrics for > rm-state-store", > "modelerType": "Event metrics for rm-state-store", > "tag.Context": "yarn", > "tag.Hostname": CENSORED > "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51, > "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#STORE_APP_Current": 124, > "RMStateStoreEventType#STORE_APP_NumOps": 46, > "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25, > "RMStateStoreEventType#UPDATE_APP_Current": 31, > "RMStateStoreEventType#UPDATE_APP_NumOps": 16, > "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.6666666666665, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.6666666666665, > "RMStateStoreEventType#REMOVE_APP_Current": 12, > "RMStateStoreEventType#REMOVE_APP_NumOps": 3, > "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#FENCED_Current": 0, > "RMStateStoreEventType#FENCED_NumOps": 0, > "RMStateStoreEventType#FENCED_AvgTime": 0.0, > "RMStateStoreEventType#STORE_MASTERKEY_Current": 0, > "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0, > "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0, > "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0, > "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0, > "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0, > "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0, > "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0, > "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0, > "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0, > "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0, > "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0, > "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#UPDATE_AMRM_TOKEN_Current": 0, > "RMStateStoreEventType#UPDATE_AMRM_TOKEN_NumOps": 0, > "RMStateStoreEventType#UPDATE_AMRM_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#STORE_RESERVATION_Current": 0, > "RMStateStoreEventType#STORE_RESERVATION_NumOps": 0, > "RMStateStoreEventType#STORE_RESERVATION_AvgTime": 0.0, > "RMStateStoreEventType#REMOVE_RESERVATION_Current": 0, > "RMStateStoreEventType#REMOVE_RESERVATION_NumOps": 0, > "RMStateStoreEventType#REMOVE_RESERVATION_AvgTime": 0.0, > "RMStateStoreEventType#STORE_PROXY_CA_CERT_Current": 0, > "RMStateStoreEventType#STORE_PROXY_CA_CERT_NumOps": 0, > "RMStateStoreEventType#STORE_PROXY_CA_CERT_AvgTime": 0.0 > }, > ... > {noformat} > {panel} > h2. Testing > I deployed the MultiDispatcher supported version of yarn to the cluster and > applied the following performance test: > {code:bash} > #!/bin/bash > for i in {1..50}; > do > ssh root@$i-node-url 'nohup ./perf.sh 4 1>/dev/null 2>/dev/nul &' & > done > sleep 300 > for i in {1..50}; > do > ssh root@$i-node-url "pkill -9 -f perf" & > done > sleep 5 > echo "DONE" > {code} > Each node had do following perf script > {code:bash} > #!/bin/bash > while true > do > if [ $(ps -o pid= -u hadoop | wc -l) -le $1 ] > then > hadoop jar /opt/hadoop-mapreduce-examples.jar pi 20 20 1>/dev/null > 2>&1 & > fi > sleep 1 > done > {code} > This way in 5 minute (+ wait until all job finish) i could process 332 app. > After i tested the same with the official build i needed 5 minute only to > finish with the first app, after that 221 app were finished. > I also tested it with LeveldbRMStateStore and ZKRMStateStore and did not > found any problem with the implementation -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org