[ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824960#comment-17824960
 ] 

ASF GitHub Bot commented on YARN-11656:
---------------------------------------

hadoop-yetus commented on PR #6569:
URL: https://github.com/apache/hadoop/pull/6569#issuecomment-1986880158

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 8 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m  8s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  36m 50s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   8m 56s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   7m 41s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   2m  8s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m  0s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m  3s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 53s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   4m 19s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  40m 11s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 31s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   8m  3s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   8m  3s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   8m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   8m 26s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   2m  3s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/10/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt)
 |  hadoop-yarn-project/hadoop-yarn: The patch generated 20 new + 95 unchanged 
- 3 fixed = 115 total (was 98)  |
   | +1 :green_heart: |  mvnsite  |   2m  5s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 59s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   4m 33s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  41m 52s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | +1 :green_heart: |  unit  |   5m 40s |  |  hadoop-yarn-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  | 108m 43s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 57s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 313m  2s |  |  |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/10/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6569 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 2c6393f94f3c 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 465a0de4f481ee2dc7383e6fb40b93e32135ba3c |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/10/testReport/ |
   | Max. process+thread count | 1597 (vs. ulimit of 5500) |
   | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: hadoop-yarn-project/hadoop-yarn |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/10/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> RMStateStore event queue blocked
> --------------------------------
>
>                 Key: YARN-11656
>                 URL: https://issues.apache.org/jira/browse/YARN-11656
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 3.4.1
>            Reporter: Bence Kosztolnik
>            Assignee: Bence Kosztolnik
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: issue.png, log.png
>
>
> h2. Problem statement
>  
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png|height=250!
> {panel}
> Also another way to identify the issue if we can see too much time is 
> required to store info for app after reach new_saving state
> {panel:title=How issue can look like in log}
>  !log.png|height=250!
> {panel}
> h2. Solution
> Created a *MultiDispatcher* class which implements the Dispatcher interface.
> The Dispatcher creates a separate metric object called _Event metrics for 
> "rm-state-store"_ where we can see 
> - how many unhandled events are currently present in the event queue for the 
> specific event type
> - how many events were handled for the specific event type
> - average execution time for the specific event
> The dispatcher has the following configs ( the placeholder is for the 
> dispatcher name, for example, rm-state-store )
> ||Config name||Description||Default value||
> |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel 
> threads should execute the parallel event execution| 4|
> |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full 
> the execution threads will scale up to this many|8|
> |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will 
> be destroyed after this many seconds|10|
> |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 
> 000|
> |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event 
> queue will be logged with this frequency (if not zero) |30|
> |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop 
> signal the dispatcher will wait this many seconds to be able to process the 
> incoming events before terminating them|60|
> {panel:title=Example output from RM JMX api}
> {noformat}
> ...
>     {
>       "name": "Hadoop:service=ResourceManager,name=Event metrics for 
> rm-state-store",
>       "modelerType": "Event metrics for rm-state-store",
>       "tag.Context": "yarn",
>       "tag.Hostname": CENSORED
>       "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51,
>       "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0,
>       "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0,
>       "RMStateStoreEventType#STORE_APP_Current": 124,
>       "RMStateStoreEventType#STORE_APP_NumOps": 46,
>       "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25,
>       "RMStateStoreEventType#UPDATE_APP_Current": 31,
>       "RMStateStoreEventType#UPDATE_APP_NumOps": 16,
>       "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.6666666666665,
>       "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31,
>       "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12,
>       "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.6666666666665,
>       "RMStateStoreEventType#REMOVE_APP_Current": 12,
>       "RMStateStoreEventType#REMOVE_APP_NumOps": 3,
>       "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0,
>       "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0,
>       "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0,
>       "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0,
>       "RMStateStoreEventType#FENCED_Current": 0,
>       "RMStateStoreEventType#FENCED_NumOps": 0,
>       "RMStateStoreEventType#FENCED_AvgTime": 0.0,
>       "RMStateStoreEventType#STORE_MASTERKEY_Current": 0,
>       "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0,
>       "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0,
>       "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0,
>       "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0,
>       "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0,
>       "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0,
>       "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0,
>       "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0,
>       "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0,
>       "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0,
>       "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0,
>       "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0,
>       "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0,
>       "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_AvgTime": 0.0,
>       "RMStateStoreEventType#UPDATE_AMRM_TOKEN_Current": 0,
>       "RMStateStoreEventType#UPDATE_AMRM_TOKEN_NumOps": 0,
>       "RMStateStoreEventType#UPDATE_AMRM_TOKEN_AvgTime": 0.0,
>       "RMStateStoreEventType#STORE_RESERVATION_Current": 0,
>       "RMStateStoreEventType#STORE_RESERVATION_NumOps": 0,
>       "RMStateStoreEventType#STORE_RESERVATION_AvgTime": 0.0,
>       "RMStateStoreEventType#REMOVE_RESERVATION_Current": 0,
>       "RMStateStoreEventType#REMOVE_RESERVATION_NumOps": 0,
>       "RMStateStoreEventType#REMOVE_RESERVATION_AvgTime": 0.0,
>       "RMStateStoreEventType#STORE_PROXY_CA_CERT_Current": 0,
>       "RMStateStoreEventType#STORE_PROXY_CA_CERT_NumOps": 0,
>       "RMStateStoreEventType#STORE_PROXY_CA_CERT_AvgTime": 0.0
>     },
> ...
> {noformat}
> {panel}
> h2. Testing
> I deployed the MultiDispatcher supported version of yarn to the cluster and 
> applied the following performance test:
> {code:bash}
> #!/bin/bash
> for i in {1..50}; 
> do
>       ssh root@$i-node-url 'nohup ./perf.sh 4 1>/dev/null 2>/dev/nul &' &
> done
> sleep 300
> for i in {1..50}; 
> do
>       ssh root@$i-node-url "pkill -9 -f perf" &
> done
> sleep 5
> echo "DONE"
> {code}
> Each node had do following perf script
> {code:bash}
> #!/bin/bash
> while true
> do
>     if [ $(ps -o pid= -u hadoop | wc -l) -le $1 ]
>     then
>         hadoop jar /opt/hadoop-mapreduce-examples.jar pi 20 20 1>/dev/null 
> 2>&1 &
>     fi
>     sleep 1
> done
> {code}
> This way in 5 minute (+ wait until all job finish) i could process 332 app.
> After i tested the same with the official build i needed 5 minute only to 
> finish with the first app, after that 221 app were finished.
> I also tested it with LeveldbRMStateStore and ZKRMStateStore and did not 
> found any problem with the implementation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to