[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243990#comment-16243990
 ] 

Jason Lowe edited comment on MAPREDUCE-6659 at 11/8/17 2:08 PM:
----------------------------------------------------------------

Ah, looks like this would have "just worked" if the AM had not tried to process 
the NM lost event.  The RM probably also sent container completed events for 
these containers due to the lost NM, but the AM is trying anyway to kill the 
containers.  Wondering if MAPREDUCE-6119 would help here (assuming AM is 
configured to ignore node events).

bq. The main difference between cdh5.5 and the trunk is the timeout being 
really slower in trunk (3 min instead of 30 min at least). This is thanks to 
patches YARN-4414 and YARN-3554 Backporting those patches can be consider 
sufficient, what do you think about this?

Did you mean to say "slower than trunk" rather than "slower in trunk"?   Sure, 
if the AM spends a lot less time trying to kill the containers it will never be 
able to kill then that also mitigates the issue.  Not as efficient as not 
trying in the first place, but yes, I can see lowering the NM client 
retries/timeouts as a viable approach.



was (Author: jlowe):
Ah, looks like this would have "just worked" if the AM had not tried to process 
the NM lost event.  The RM probably also sent container completed events for 
these containers due to the lost NM, but the AM is trying anyway to kill the 
containers.  Wondering if MPAREDUCE-6119 would help here (assuming AM is 
configured to ignore node events).

bq. The main difference between cdh5.5 and the trunk is the timeout being 
really slower in trunk (3 min instead of 30 min at least). This is thanks to 
patches YARN-4414 and YARN-3554 Backporting those patches can be consider 
sufficient, what do you think about this?

Did you mean to say "slower than trunk" rather than "slower in trunk"?   Sure, 
if the AM spends a lot less time trying to kill the containers it will never be 
able to kill then that also mitigates the issue.  Not as efficient as not 
trying in the first place, but yes, I can see lowering the NM client 
retries/timeouts as a viable approach.


> Mapreduce App master waits long to kill containers on lost nodes.
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-6659
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6659
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.6.0
>            Reporter: Laxman
>            Assignee: Nicolas Fraison
>
> MR Application master waits for very long time to cleanup and relaunch the 
> tasks on lost nodes. Wait time is actually 2.5 hours 
> (ipc.client.connect.max.retries * ipc.client.connect.max.retries.on.timeouts 
> * ipc.client.connect.timeout = 10 * 45 * 20 = 9000 seconds = 2.5 hours)
> Some similar issue related in RM-AM rpc protocol is fixed in YARN-3809.
> As fixed in YARN-3809, we may need to introduce new configurations to control 
> this RPC retry behavior.
> Also, I feel this total retry time should honor and capped maximum to global 
> task time out (mapreduce.task.timeout = 600000 default)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to