[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840637#comment-17840637 ]
ASF GitHub Bot commented on YARN-11656: --------------------------------------- hadoop-yetus commented on PR #6569: URL: https://github.com/apache/hadoop/pull/6569#issuecomment-2076287327 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |:----:|----------:|--------:|:--------:|:-------:| |||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 05s | | No case conflicting files found. | | +0 :ok: | spotbugs | 0m 00s | | spotbugs executables are not available. | | +0 :ok: | codespell | 0m 00s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 00s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 01s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 00s | | The patch appears to include 8 new or modified test files. | |||| _ trunk Compile Tests _ | | +0 :ok: | mvndep | 2m 26s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 89m 23s | | trunk passed | | +1 :green_heart: | compile | 11m 21s | | trunk passed | | +1 :green_heart: | checkstyle | 5m 05s | | trunk passed | | +1 :green_heart: | mvnsite | 11m 06s | | trunk passed | | +1 :green_heart: | javadoc | 10m 44s | | trunk passed | | +1 :green_heart: | shadedclient | 159m 11s | | branch has no errors when building and testing our client artifacts. | |||| _ Patch Compile Tests _ | | +0 :ok: | mvndep | 2m 28s | | Maven dependency ordering for patch | | -1 :x: | mvninstall | 3m 06s | [/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6569/1/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch failed. | | -1 :x: | compile | 6m 16s | [/patch-compile-hadoop-yarn-project_hadoop-yarn.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6569/1/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn.txt) | hadoop-yarn in the patch failed. | | -1 :x: | javac | 6m 16s | [/patch-compile-hadoop-yarn-project_hadoop-yarn.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6569/1/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn.txt) | hadoop-yarn in the patch failed. | | +1 :green_heart: | blanks | 0m 01s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 3m 03s | | the patch passed | | -1 :x: | mvnsite | 3m 26s | [/patch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6569/1/artifact/out/patch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch failed. | | +1 :green_heart: | javadoc | 7m 30s | | the patch passed | | -1 :x: | shadedclient | 85m 23s | | patch has errors when building and testing our client artifacts. | |||| _ Other Tests _ | | +1 :green_heart: | asflicense | 4m 25s | | The patch does not generate ASF License warnings. | | | | 375m 07s | | | | Subsystem | Report/Notes | |----------:|:-------------| | GITHUB PR | https://github.com/apache/hadoop/pull/6569 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | MINGW64_NT-10.0-17763 9bb75f535c07 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys | | Build tool | maven | | Personality | /c/hadoop/dev-support/bin/hadoop.sh | | git revision | trunk / 465a0de4f481ee2dc7383e6fb40b93e32135ba3c | | Default Java | Azul Systems, Inc.-1.8.0_332-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6569/1/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6569/1/console | | versions | git=2.44.0.windows.1 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > RMStateStore event queue blocked > -------------------------------- > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn > Affects Versions: 3.4.1 > Reporter: Bence Kosztolnik > Assignee: Bence Kosztolnik > Priority: Major > Labels: pull-request-available > Attachments: issue.png, log.png > > > h2. Problem statement > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > Also another way to identify the issue if we can see too much time is > required to store info for app after reach new_saving state > {panel:title=How issue can look like in log} > !log.png|height=250! > {panel} > h2. Solution > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event > The dispatcher has the following configs ( the placeholder is for the > dispatcher name, for example, rm-state-store ) > ||Config name||Description||Default value|| > |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel > threads should execute the parallel event execution|4| > |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 > 000| > |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event > queue will be logged with this frequency (if not zero) |0| > |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop > signal the dispatcher will wait this many seconds to be able to process the > incoming events before terminating them|60| > |yarn.dispatcher.multi-thread.{}.*metrics-enabled*|The dispatcher should > publish metrics data to the metric system|false| > {panel:title=Example output from RM JMX api} > {noformat} > ... > { > "name": "Hadoop:service=ResourceManager,name=Event metrics for > rm-state-store", > "modelerType": "Event metrics for rm-state-store", > "tag.Context": "yarn", > "tag.Hostname": CENSORED > "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51, > "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#STORE_APP_Current": 124, > "RMStateStoreEventType#STORE_APP_NumOps": 46, > "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25, > "RMStateStoreEventType#UPDATE_APP_Current": 31, > "RMStateStoreEventType#UPDATE_APP_NumOps": 16, > "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.6666666666665, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.6666666666665, > "RMStateStoreEventType#REMOVE_APP_Current": 12, > "RMStateStoreEventType#REMOVE_APP_NumOps": 3, > "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#FENCED_Current": 0, > "RMStateStoreEventType#FENCED_NumOps": 0, > "RMStateStoreEventType#FENCED_AvgTime": 0.0, > "RMStateStoreEventType#STORE_MASTERKEY_Current": 0, > "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0, > "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0, > "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0, > "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0, > "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0, > "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0, > "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0, > "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0, > "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0, > "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0, > "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0, > "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#UPDATE_AMRM_TOKEN_Current": 0, > "RMStateStoreEventType#UPDATE_AMRM_TOKEN_NumOps": 0, > "RMStateStoreEventType#UPDATE_AMRM_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#STORE_RESERVATION_Current": 0, > "RMStateStoreEventType#STORE_RESERVATION_NumOps": 0, > "RMStateStoreEventType#STORE_RESERVATION_AvgTime": 0.0, > "RMStateStoreEventType#REMOVE_RESERVATION_Current": 0, > "RMStateStoreEventType#REMOVE_RESERVATION_NumOps": 0, > "RMStateStoreEventType#REMOVE_RESERVATION_AvgTime": 0.0, > "RMStateStoreEventType#STORE_PROXY_CA_CERT_Current": 0, > "RMStateStoreEventType#STORE_PROXY_CA_CERT_NumOps": 0, > "RMStateStoreEventType#STORE_PROXY_CA_CERT_AvgTime": 0.0 > }, > ... > {noformat} > {panel} > h2. Testing > I deployed the MultiDispatcher supported version of yarn to the cluster and > applied the following performance test: > {code:bash} > #!/bin/bash > for i in {1..50}; > do > ssh root@$i-node-url 'nohup ./perf.sh 4 1>/dev/null 2>/dev/nul &' & > done > sleep 300 > for i in {1..50}; > do > ssh root@$i-node-url "pkill -9 -f perf" & > done > sleep 5 > echo "DONE" > {code} > Each node had do following perf script > {code:bash} > #!/bin/bash > while true > do > if [ $(ps -o pid= -u hadoop | wc -l) -le $1 ] > then > hadoop jar /opt/hadoop-mapreduce-examples.jar pi 20 20 1>/dev/null > 2>&1 & > fi > sleep 1 > done > {code} > This way in 5 minute (+ wait until all job finish) i could process 332 app. > After i tested the same with the official build i needed 5 minute only to > finish with the first app, after that 221 app were finished. > I also tested it with LeveldbRMStateStore and ZKRMStateStore and did not > found any problem with the implementation -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org