[ https://issues.apache.org/jira/browse/MAPREDUCE-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982981#comment-16982981 ]
Wilfred Spiegelenburg commented on MAPREDUCE-7240: -------------------------------------------------- I checked the PRs that are linked to this jira. Jason gave a +1 on the trunk version in [PR #1674|https://github.com/apache/hadoop/pull/1674]. If your patch follows that change we should be good to go. +1 (non binding) For the concern raised in this [comment|https://issues.apache.org/jira/browse/MAPREDUCE-7240?focusedCommentId=16982254&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16982254]: if the container ignores the newly raised event then the AM needs to handle that as per normal. The main issue in the current code is that because it does not handle the fetch failure event a {{InvalidStateTransitionException}} is raised which causes the job to fail. After the change the event is handled and the job should continue and finish processing. The job can still fail as per normal but the single too many fetch failures event does not cause the job to fail immediately. > Exception ' Invalid event: TA_TOO_MANY_FETCH_FAILURE at > SUCCESS_FINISHING_CONTAINER' cause job error > ---------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-7240 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7240 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.8.2 > Reporter: luhuachao > Assignee: luhuachao > Priority: Critical > Labels: kerberos > Attachments: MAPREDUCE-7240-001.patch, > application_1566552310686_260041.log > > > *log in appmaster* > {noformat} > 2019-09-03 17:18:43,090 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures > for output of task attempt: attempt_1566552310686_260041_m_000052_0 ... > raising fetch failure to map > 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures > for output of task attempt: attempt_1566552310686_260041_m_000049_0 ... > raising fetch failure to map > 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures > for output of task attempt: attempt_1566552310686_260041_m_000051_0 ... > raising fetch failure to map > 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures > for output of task attempt: attempt_1566552310686_260041_m_000050_0 ... > raising fetch failure to map > 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures > for output of task attempt: attempt_1566552310686_260041_m_000053_0 ... > raising fetch failure to map > 2019-09-03 17:18:43,092 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1566552310686_260041_m_000052_0 transitioned from state SUCCEEDED to > FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and nodeId=yarn095:45454 > 2019-09-03 17:18:43,092 ERROR [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Can't handle > this event at current state for attempt_1566552310686_260041_m_000049_0 > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:1206) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:146) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1458) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1450) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > 2019-09-03 17:18:43,093 ERROR [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Can't handle > this event at current state for attempt_1566552310686_260041_m_000051_0 > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:1206) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:146) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1458) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1450) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > 2019-09-03 17:18:43,093 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1566552310686_260041_m_000050_0 transitioned from state SUCCEEDED to > FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and nodeId=yarn095:45454 > 2019-09-03 17:18:43,093 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1566552310686_260041_m_000053_0 transitioned from state SUCCEEDED to > FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and nodeId=yarn095:45454 > 2019-09-03 17:18:43,094 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: > task_1566552310686_260041_m_000052 Task Transitioned from SUCCEEDED to > SCHEDULED > 2019-09-03 17:18:43,096 FATAL [IPC Server handler 27 on 35972] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1566552310686_260041_r_000005_0 - exited : > org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in > shuffle in fetcher#22 > at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1961) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) > Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; > bailing-out. > at > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:367) > at > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:289) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:355) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) > 2019-09-03 17:18:43,096 INFO [IPC Server handler 27 on 35972] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from > attempt_1566552310686_260041_r_000005_0: Error: > org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in > shuffle in fetcher#22 > at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1961) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) > Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; > bailing-out. > at > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:367) > at > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:289) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:355) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) > 2019-09-03 17:18:43,097 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1566552310686_260041Job Transitioned from RUNNING to ERROR > 2019-09-03 17:18:43,099 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job > {noformat} > > nodemanager's log is like same with log in MAPREDUCE-6869. > the code in TaskAttemptImpl indicate the Invalid event: > TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER cause the job state > turn into error; what i confused is > # what cause the appmater handle the TA_TOO_MANY_FETCH_FAILURE event on > SUCCESS_FINISHING_CONTAINER,illegal event on this state. but some other can > successfully transitioned from state SUCCEEDED to FAILED on > TA_TOO_MANY_FETCH_FAILURE event. > # restart the nodemanager would solve the error in nm; the shuffle error > would fix too. what cause this phenomenon. > Correct me if I am wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org