[ 
https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953076#comment-14953076
 ] 

Jason Lowe commented on YARN-4254:
----------------------------------

Thanks for the report and patch, Bibin!

The patch seems to be trying to fix a very specific failure mode, but in 
practice it will lead to a lot of AM attempt failures which isn't ideal.  Would 
it make more sense if the RM simply refused to accept nodemanagers into the 
cluster that are unresolvable?  Also the fact that we try forever seems broken 
to me.  We should be giving up at some point and failing the attempt, whether 
that be due to unknown host exceptions or other persistent errors.  Checking 
specifically for unknown host exception makes me think we'll just hit this type 
of problem again but for some other persistent error.



> ApplicationAttempt stuck for ever due to UnknowHostexception
> ------------------------------------------------------------
>
>                 Key: YARN-4254
>                 URL: https://issues.apache.org/jira/browse/YARN-4254
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: 0001-YARN-4254.patch
>
>
> Scenario
> =======
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get 
> stuck for ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to