[ https://issues.apache.org/jira/browse/YARN-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171396#comment-16171396 ]
zhangshilong commented on YARN-7214: ------------------------------------ !screenshot-1.png! generally, 1、 NM complete one container(c) and send to RM 2、RM sent c to AM, tell AM c is completed. 3、RM sent c to NM, tell NM c can be removed from NM. If RM restart before step 3, c will be in in context of NM for ever. If RM restart again, c will be duplicated container completed to AM. > duplicated container completed To AM > ------------------------------------ > > Key: YARN-7214 > URL: https://issues.apache.org/jira/browse/YARN-7214 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.1, 3.0.0-alpha3 > Environment: hadoop 2.7.1 rm recovery and nm recovery enabled > Reporter: zhangshilong > Attachments: screenshot-1.png > > > env: hadoop 2.7.1 with rm recovery and nm recovery enabled > case: > spark app(app1) running least one container(named c1) in NM1. > 1、NM1 crashed,and RM found NM1 expired in 10 minutes. > 2、RM will remove all containers in NM1(RMNodeImpl). and app1 will receive > c1 completed message.But RM can not send c1(to be removed) to NM1 because NM1 > lost. > 3、NM1 restart and register with RM(c1 in register request),but RM found NM1 > is lost and will not handle containers from NM1. > 4、NM1 will not heartbeat with c1(c1 not in heartbeat request). So c1 will > not removed from context of NM1. > 5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. > RM will send c1 complted message to AM of app1. So, app1 received duplicated > c1. > once spark AM receive one container completed from RM, it will allocate one > new container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org