[ https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332076#comment-17332076 ]
Peter Bacsko commented on YARN-10739: ------------------------------------- Thanks for the patch [~zhuqi]. I have some comments: 1. {{PrintEventDetailsService #%d}} - I think it's better to call it {{PrintEventDetailsThread #%d}}. 2. Variable {{printEventDetailsService}} - same here, {{printEventDetailsExecutor}} sounds better. 3. {{printEventDetailsService.allowCoreThreadTimeOut(true);}} --> there is just one core thread. I think it's fine if we don't allow it to time out, so I suggest to set this to "false" (which is the default). 4. {{printEventDetailsService.shutdown();}} -- since we're shutting it down in {{serviceStop()}}, let's call {{shutdownNow()}} which is safer. Don't wait for printing. 5. Tracing log: {noformat} // For test if (LOG.isTraceEnabled()) { LOG.trace("Event type: " + entry.getKey() + " printed."); } {noformat} I know that this is for testing, but still, this affects production code. Trace level already floods the logs with everything. I don't think we should print this, even on TRACE. It's not a huge issue if it is not tested. > GenericEventHandler.printEventQueueDetails cause RM recovery cost too much > time > ------------------------------------------------------------------------------- > > Key: YARN-10739 > URL: https://issues.apache.org/jira/browse/YARN-10739 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 3.4.0, 3.3.1, 3.2.3 > Reporter: Zhanqi Cai > Assignee: Qi Zhu > Priority: Critical > Attachments: YARN-10739-001.patch, YARN-10739-002.patch, > YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch > > > Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on > AsyncDispatcher, if the event queue size is too large, the > printEventQueueDetails will cost too much time and RM take a long time to > process. > For example: > If we have 4K nodes on cluster and 4K apps running, if we do switch and the > node manager will register with RM, and RM will call NodesListManager to do > RMAppNodeUpdateEvent, code like below: > {code:java} > for(RMApp app : rmContext.getRMApps().values()) { > if (!app.isAppFinalStateStored()) { > this.rmContext > .getDispatcher() > .getEventHandler() > .handle( > new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, > appNodeUpdateType)); > } > }{code} > So the total event is 4k*4k=16 mil, during this window, the > GenericEventHandler.printEventQueueDetails will print the event queue detail > and be called frequently, once the event queue size reaches 1 mil+, the > Iterator of the queue from printEventQueueDetails will be so slow refer to > below: > {code:java} > private void printEventQueueDetails() { > Iterator<Event> iterator = eventQueue.iterator(); > Map<Enum, Long> counterMap = new HashMap<>(); > while (iterator.hasNext()) { > Enum eventType = iterator.next().getType(); > {code} > Then RM recovery will cost too much time..... > Refer to our log: > {code:java} > 2021-04-14 20:35:34,432 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(306)) - Size of event-queue is 12000000 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event > record counter: 310836 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, > Event record counter: 1103 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: > NODE_REMOVED, Event record counter: 1 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, > Event record counter: 1 > {code} > Between AsyncDispatcher.handle and printEventQueueDetails, here is more than > 1s to do Iterator. > I upload a file to ensure the printEventQueueDetails only be called one-time > pre-30s. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org