[ https://issues.apache.org/jira/browse/MAPREDUCE-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243990#comment-16243990 ]
Jason Lowe edited comment on MAPREDUCE-6659 at 11/8/17 2:08 PM: ---------------------------------------------------------------- Ah, looks like this would have "just worked" if the AM had not tried to process the NM lost event. The RM probably also sent container completed events for these containers due to the lost NM, but the AM is trying anyway to kill the containers. Wondering if MAPREDUCE-6119 would help here (assuming AM is configured to ignore node events). bq. The main difference between cdh5.5 and the trunk is the timeout being really slower in trunk (3 min instead of 30 min at least). This is thanks to patches YARN-4414 and YARN-3554 Backporting those patches can be consider sufficient, what do you think about this? Did you mean to say "slower than trunk" rather than "slower in trunk"? Sure, if the AM spends a lot less time trying to kill the containers it will never be able to kill then that also mitigates the issue. Not as efficient as not trying in the first place, but yes, I can see lowering the NM client retries/timeouts as a viable approach. was (Author: jlowe): Ah, looks like this would have "just worked" if the AM had not tried to process the NM lost event. The RM probably also sent container completed events for these containers due to the lost NM, but the AM is trying anyway to kill the containers. Wondering if MPAREDUCE-6119 would help here (assuming AM is configured to ignore node events). bq. The main difference between cdh5.5 and the trunk is the timeout being really slower in trunk (3 min instead of 30 min at least). This is thanks to patches YARN-4414 and YARN-3554 Backporting those patches can be consider sufficient, what do you think about this? Did you mean to say "slower than trunk" rather than "slower in trunk"? Sure, if the AM spends a lot less time trying to kill the containers it will never be able to kill then that also mitigates the issue. Not as efficient as not trying in the first place, but yes, I can see lowering the NM client retries/timeouts as a viable approach. > Mapreduce App master waits long to kill containers on lost nodes. > ----------------------------------------------------------------- > > Key: MAPREDUCE-6659 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6659 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 2.6.0 > Reporter: Laxman > Assignee: Nicolas Fraison > > MR Application master waits for very long time to cleanup and relaunch the > tasks on lost nodes. Wait time is actually 2.5 hours > (ipc.client.connect.max.retries * ipc.client.connect.max.retries.on.timeouts > * ipc.client.connect.timeout = 10 * 45 * 20 = 9000 seconds = 2.5 hours) > Some similar issue related in RM-AM rpc protocol is fixed in YARN-3809. > As fixed in YARN-3809, we may need to introduce new configurations to control > this RPC retry behavior. > Also, I feel this total retry time should honor and capped maximum to global > task time out (mapreduce.task.timeout = 600000 default) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org