[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243637#comment-16243637
 ] 

Nicolas Fraison edited comment on MAPREDUCE-6659 at 11/8/17 9:58 AM:
---------------------------------------------------------------------

[~jlowe], MAPREDUCE-5465 is already applied on the hadoop release I use 
(cdh5.5.0).
I've tested on cdh5.5 and trunk the behaviour when a nodemanager is lost and it 
is the same. 
The RM send a LostNM event to the AM which try to cleanup containers running on 
it (on cdh5.5 and on trunk). The attempt is failed only after a timeout to 
connect to the lost NM.
The main difference between cdh5.5 and the trunk is the timeout being really 
slower in trunk (3 min instead of 30 min at least).
This is thanks to patches YARN-4414 and YARN-3554
Backporting those patches can be consider sufficient, what do you think about 
this?


was (Author: nfraison.criteo):
[~jlowe], MAPREDUCE-5465 is already applied on the hadoop release I use 
(cdh5.5.0).
I've tested on cdh5.5 and trunk the behaviour when a nodemanager is lost and it 
is the same. 
The RM send a LostNM event to the AM which try to cleanup containers running on 
it (on cdh5.5 and on trunk). The attempt is failed only after a timeout to 
connect to the lost NM.
The main difference between cdh5.5 and the trunk is the timeout being really 
slower in trunk (3 min instead of 30 min at least).
This is thanks to patches https://issues.apache.org/jira/browse/YARN-4414 and 
https://issues.apache.org/jira/browse/YARN-3554
Backporting those patches can be consider sufficient, what do you think about 
this?

> Mapreduce App master waits long to kill containers on lost nodes.
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-6659
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6659
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.6.0
>            Reporter: Laxman
>            Assignee: Nicolas Fraison
>
> MR Application master waits for very long time to cleanup and relaunch the 
> tasks on lost nodes. Wait time is actually 2.5 hours 
> (ipc.client.connect.max.retries * ipc.client.connect.max.retries.on.timeouts 
> * ipc.client.connect.timeout = 10 * 45 * 20 = 9000 seconds = 2.5 hours)
> Some similar issue related in RM-AM rpc protocol is fixed in YARN-3809.
> As fixed in YARN-3809, we may need to introduce new configurations to control 
> this RPC retry behavior.
> Also, I feel this total retry time should honor and capped maximum to global 
> task time out (mapreduce.task.timeout = 600000 default)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to