Bence Kosztolnik created YARN-11656: ---------------------------------------
Summary: RMStateStore event queue blocked Key: YARN-11656 URL: https://issues.apache.org/jira/browse/YARN-11656 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.4.1 Reporter: Bence Kosztolnik Attachments: issue.png I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org