[ https://issues.apache.org/jira/browse/MAPREDUCE-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
gaoyu updated MAPREDUCE-7349: ----------------------------- Comment: was deleted (was: i) > An unexpected node crash and delayed messages would fail the job > ---------------------------------------------------------------- > > Key: MAPREDUCE-7349 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7349 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster > Affects Versions: 3.2.2 > Reporter: gaoyu > Priority: Major > > Related cluster configuration: > * MAX_FETCH_FAILURES_NOTIFICATIONS is 3 > * NodeManager recovery is disabled > Bug scenario: > # submit a wordcount job which contains 2 simple map tasks ({{map_0}} and > {{map_1}}) and 1 simple reduce task ({{reduce_0}}); > # all map tasks were finished successfully and the AppMaster was notified; > # the NodeManager which runs the map task {{map_1}} crashes; > # the AppMaster schedules a reduce attempt; > # the reduce attempt sends {{statusUpdate}} message to AppMaster to notify a > fetch failure; > # the reduce attempt fails due to {{Shuffle$ShuffleError}} which was caused > by {{java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out}}; > # the reduce attempt send message {{fatalError}} to AppMaster > # the AppMaster successively reschedules another three reduce attempts, but > all of them were failed due to {{Shuffle$ShuffleError}}; > # AppMaster fails the wordcount job due to the failed reduce task; > # AppMaster receives three {{statusUpdate}} messages that state a fetch > failure like the message in step 5, but it has already failed the job and > would not rerun the task {{map_1}}. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org