[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.

Jason Lowe (JIRA) Thu, 18 Sep 2014 10:57:21 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139256#comment-14139256
 ]


Jason Lowe commented on YARN-2561:
----------------------------------

We should not assume that the lack of any container status means the NM doesn't 
support restart.  There may have been no containers running on it at the time, 
but it could still have outstanding applications active (e.g.: still shuffling 
for the mapreduce_shuffle service), and we definitely don't want AMs to be 
notified of the node removal in that case.

It would be better to check the list of applications on the node.  If there are 
no applications then that should necessarily mean there are also no containers. 
 If the node registers with no applications then it's probably safe to treat 
the node as a removal and re-add whether that node supports NM restart or not.

When a node reconnects with no containers but has the same port, we aren't 
updating it's potentially new totalCapability as we did before.


> MR job client cannot reconnect to AM after NM restart.
> ------------------------------------------------------
>
>                 Key: YARN-2561
>                 URL: https://issues.apache.org/jira/browse/YARN-2561
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Tassapol Athiapinya
>            Assignee: Junping Du
>            Priority: Blocker
>         Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, 
> YARN-2561-v4.patch, YARN-2561.patch
>
>
> Work-preserving NM restart is disabled.
> Submit a job. Restart the only NM and found that Job will hang with connect 
> retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.

Reply via email to