[ https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Qi Zhu updated YARN-10739: -------------------------- Attachment: YARN-10739.006.patch > GenericEventHandler.printEventQueueDetails cause RM recovery cost too much > time > ------------------------------------------------------------------------------- > > Key: YARN-10739 > URL: https://issues.apache.org/jira/browse/YARN-10739 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 3.4.0, 3.3.1, 3.2.3 > Reporter: Zhanqi Cai > Assignee: Qi Zhu > Priority: Critical > Attachments: YARN-10739-001.patch, YARN-10739-002.patch, > YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch, > YARN-10739.005.patch, YARN-10739.006.patch > > > Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on > AsyncDispatcher, if the event queue size is too large, the > printEventQueueDetails will cost too much time and RM take a long time to > process. > For example: > If we have 4K nodes on cluster and 4K apps running, if we do switch and the > node manager will register with RM, and RM will call NodesListManager to do > RMAppNodeUpdateEvent, code like below: > {code:java} > for(RMApp app : rmContext.getRMApps().values()) { > if (!app.isAppFinalStateStored()) { > this.rmContext > .getDispatcher() > .getEventHandler() > .handle( > new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, > appNodeUpdateType)); > } > }{code} > So the total event is 4k*4k=16 mil, during this window, the > GenericEventHandler.printEventQueueDetails will print the event queue detail > and be called frequently, once the event queue size reaches 1 mil+, the > Iterator of the queue from printEventQueueDetails will be so slow refer to > below: > {code:java} > private void printEventQueueDetails() { > Iterator<Event> iterator = eventQueue.iterator(); > Map<Enum, Long> counterMap = new HashMap<>(); > while (iterator.hasNext()) { > Enum eventType = iterator.next().getType(); > {code} > Then RM recovery will cost too much time..... > Refer to our log: > {code:java} > 2021-04-14 20:35:34,432 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(306)) - Size of event-queue is 12000000 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event > record counter: 310836 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, > Event record counter: 1103 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: > NODE_REMOVED, Event record counter: 1 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, > Event record counter: 1 > {code} > Between AsyncDispatcher.handle and printEventQueueDetails, here is more than > 1s to do Iterator. > I upload a file to ensure the printEventQueueDetails only be called one-time > pre-30s. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org