[jira] [Commented] (MAPREDUCE-7240) Exception ' Invalid event: TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER' cause job error

Wilfred Spiegelenburg (Jira) Tue, 26 Nov 2019 14:47:12 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982981#comment-16982981
 ]


Wilfred Spiegelenburg commented on MAPREDUCE-7240:
--------------------------------------------------

I checked the PRs that are linked to this jira. Jason gave a +1 on the trunk 
version in [PR #1674|https://github.com/apache/hadoop/pull/1674]. If your patch 
follows that change we should be good to go.

+1 (non binding)

For the concern raised in this 
[comment|https://issues.apache.org/jira/browse/MAPREDUCE-7240?focusedCommentId=16982254&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16982254]:
 if the container ignores the newly raised event then the AM needs to handle 
that as per normal.
The main issue in the current code is that because it does not handle the fetch 
failure event a {{InvalidStateTransitionException}} is raised which causes the 
job to fail. After the change the event is handled and the job should continue 
and finish processing. The job can still fail as per normal but the single too 
many fetch failures event does not cause the job to fail immediately.

> Exception ' Invalid event: TA_TOO_MANY_FETCH_FAILURE at 
> SUCCESS_FINISHING_CONTAINER' cause job error
> ----------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7240
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7240
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.8.2
>            Reporter: luhuachao
>            Assignee: luhuachao
>            Priority: Critical
>              Labels: kerberos
>         Attachments: MAPREDUCE-7240-001.patch, 
> application_1566552310686_260041.log
>
>
> *log in appmaster*
> {noformat}
> 2019-09-03 17:18:43,090 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures 
> for output of task attempt: attempt_1566552310686_260041_m_000052_0 ... 
> raising fetch failure to map
> 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures 
> for output of task attempt: attempt_1566552310686_260041_m_000049_0 ... 
> raising fetch failure to map
> 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures 
> for output of task attempt: attempt_1566552310686_260041_m_000051_0 ... 
> raising fetch failure to map
> 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures 
> for output of task attempt: attempt_1566552310686_260041_m_000050_0 ... 
> raising fetch failure to map
> 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures 
> for output of task attempt: attempt_1566552310686_260041_m_000053_0 ... 
> raising fetch failure to map
> 2019-09-03 17:18:43,092 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1566552310686_260041_m_000052_0 transitioned from state SUCCEEDED to 
> FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and nodeId=yarn095:45454
> 2019-09-03 17:18:43,092 ERROR [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Can't handle 
> this event at current state for attempt_1566552310686_260041_m_000049_0
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>       at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:1206)
>       at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:146)
>       at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1458)
>       at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1450)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>       at java.lang.Thread.run(Thread.java:745)
> 2019-09-03 17:18:43,093 ERROR [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Can't handle 
> this event at current state for attempt_1566552310686_260041_m_000051_0
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>       at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:1206)
>       at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:146)
>       at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1458)
>       at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1450)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>       at java.lang.Thread.run(Thread.java:745)
> 2019-09-03 17:18:43,093 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1566552310686_260041_m_000050_0 transitioned from state SUCCEEDED to 
> FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and nodeId=yarn095:45454
> 2019-09-03 17:18:43,093 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1566552310686_260041_m_000053_0 transitioned from state SUCCEEDED to 
> FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and nodeId=yarn095:45454
> 2019-09-03 17:18:43,094 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: 
> task_1566552310686_260041_m_000052 Task Transitioned from SUCCEEDED to 
> SCHEDULED
> 2019-09-03 17:18:43,096 FATAL [IPC Server handler 27 on 35972] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1566552310686_260041_r_000005_0 - exited : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#22
>       at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1961)
>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
> Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; 
> bailing-out.
>       at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:367)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:289)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:355)
>       at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
> 2019-09-03 17:18:43,096 INFO [IPC Server handler 27 on 35972] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from 
> attempt_1566552310686_260041_r_000005_0: Error: 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#22
>       at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1961)
>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
> Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; 
> bailing-out.
>       at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:367)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:289)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:355)
>       at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
> 2019-09-03 17:18:43,097 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1566552310686_260041Job Transitioned from RUNNING to ERROR
> 2019-09-03 17:18:43,099 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job
> {noformat}
>   
> nodemanager's  log is like same with log in  MAPREDUCE-6869.
> the code in TaskAttemptImpl indicate the Invalid event: 
> TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER cause the job state 
> turn into error; what i confused is 
>  # what cause the appmater  handle the TA_TOO_MANY_FETCH_FAILURE  event on 
> SUCCESS_FINISHING_CONTAINER，illegal event on this state.  but some other can 
> successfully transitioned from state SUCCEEDED to FAILED on 
> TA_TOO_MANY_FETCH_FAILURE  event.
>  # restart the nodemanager would solve the error in nm; the shuffle error 
> would fix too. what cause this phenomenon.
> Correct me if I am wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7240) Exception ' Invalid event: TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER' cause job error

Reply via email to