[ https://issues.apache.org/jira/browse/APEXCORE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044808#comment-16044808 ]
ASF GitHub Bot commented on APEXCORE-743: ----------------------------------------- GitHub user sandeshh opened a pull request: https://github.com/apache/apex-core/pull/543 APEXCORE-743 Added timeout for the Container kill request sent to NM. @PramodSSImmaneni @vrozov please review. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sandeshh/apex-core APEXCORE-743 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-core/pull/543.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #543 ---- commit 501dfa47517f94aa35d60c4e22ec825e2c99fa27 Author: Sandesh Hegde <sandesh.he...@gmail.com> Date: 2017-06-01T23:28:56Z APEXCORE-743 Added timeout for the Container kill request sent to NM. ---- > Killed container is shown as running > ------------------------------------ > > Key: APEXCORE-743 > URL: https://issues.apache.org/jira/browse/APEXCORE-743 > Project: Apache Apex Core > Issue Type: Bug > Reporter: Sandesh > > Here is the behavior > 1. Container Heartbeat timeout happened > 2. AppMaster sends the request to kill the container > 3. Container is killed > 4. AppMaster state is not updated and no new container was allocated > After analyzing the code here is the possible reason > 1. Send the kill request to NM > 2. Container killed by NM, but NM callback doesn't happen. RecoverContainer > is called in NM callback, which in this case is not called. > 3. AppMaster state is not updated > Possible fix. > Have a timeout for NM callback, so that if NM doesn't respond that the > container is killed in time, call the RecoverContainer. -- This message was sent by Atlassian JIRA (v6.3.15#6346)