Github user klion26 commented on the issue:

    https://github.com/apache/spark/pull/19145
  
    My colleague create a 
[issue](https://issues.apache.org/jira/browse/YARN-7214) here, I rewrite the 
description here.
    
    Spark Streaming (app1) running on Yarn, app1's one container (c1) runs on 
NM1.
    1. NM1 crashed, and RM found NM1 expired in 10 minutes.
    2. RM will remove all containers in NM1(RMNodeImpl). and app1 will receive 
completed message of c1. But RM can not send c1(to be removed) to NM1 because 
NM1 lost.
    3. NM1 restart and register with RM(c1 in register request), but RM found 
NM1 is lost and will not handle containers from NM1.
    4 NM1 will not heartbeat c1(c1 not in heartbeat request). So c1 will not 
removed from context of NM1.
    5. RM restart, NM1 re register with RM. And c1 will handled and recovered. 
RM will send c1 completed message to AM of app1. So, app1 received a duplicated 
completed message of c1.
    
    
    For the fix
    1. I changed the code from `completedContainerIdSet.contains(containerId)` 
to `completedContainerIdSet.remove(containerId)` to reclaim the memory. (The 
same container will not reported as completed more than twice)
    2. The code I added is to ignore the duplicated completed messages, ignore 
the completed message will avoid requesting new containers. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to