[ https://issues.apache.org/jira/browse/YARN-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526640#comment-16526640 ]
Botong Huang commented on YARN-8451: ------------------------------------ Hi [~jlowe], I am actually not changing this behavior (not to block dispatcher for resync), existing code has been creating a new thread for it. I think the reason is that resync involves draining existing heartbeat thread and a register call to RM, which can take a long time (say network slow or RM is down during master-slave switch). We don't want to block the entire NM for this. It maybe much more involved if we want to change this behavior. > Multiple NM heartbeat thread created when a slow NM resync with RM > ------------------------------------------------------------------ > > Key: YARN-8451 > URL: https://issues.apache.org/jira/browse/YARN-8451 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Botong Huang > Assignee: Botong Huang > Priority: Major > Attachments: YARN-8451.v1.patch > > > During a NM resync with RM (say RM did a master slave switch), if NM is > running slow, more than one RESYNC event may be put into the NM dispatcher by > the existing heartbeat thread before they are processed. As a result, > multiple new heartbeat thread are later created and start to hb to RM > concurrently with their own responseId. If at some point of time, one thread > becomes more than one step behind others, RM will send back a resync signal > in this heartbeat response, killing all containers in this NM. > See comments below for details on how this can happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org