[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030268#comment-14030268
 ] 

Zhijie Shen commented on MAPREDUCE-5924:
----------------------------------------

Some investigation based our logs:

TA_COMMIT_PENDING at COMMIT_PENDING happens because there're two 
TA_COMMIT_PENDING that have been sent to TaskAttemptListenerImpl. The first one 
causes the state transition from RUNNING to COMMIT_PENDING, and the second one 
causes the invalid transition.

Here's the code that notifies AM of committing the pending task.
{code}
// say the task tracker that task is commit pending
      while (true) {
        try {
          umbilical.commitPending(taskId, taskStatus);
          break;
        } catch (InterruptedException ie) {
          // ignore
        } catch (IOException ie) {
          LOG.warn("Failure sending commit pending: " + 
                    StringUtils.stringifyException(ie));
          if (--retries == 0) {
            System.exit(67);
          }
        }
      }
{code}

The loop has been iterated twice, because the following exception happens:

{code}
2014-06-09 09:15:38,539 WARN [main] org.apache.hadoop.mapred.Task: Failure 
sending commit pending: java.io.IOException: Failed on local exception: 
java.io.IOException: An existing connection was forcibly closed by the remote 
host; Host Details : local host is: "sdevin-clster53/10.215.16.72"; destination 
host is: "sdevin-clster54":63415; 
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
        at org.apache.hadoop.ipc.Client.call(Client.java:1414)
        at org.apache.hadoop.ipc.Client.call(Client.java:1363)
        at 
org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
        at com.sun.proxy.$Proxy9.commitPending(Unknown Source)
        at org.apache.hadoop.mapred.Task.done(Task.java:1006)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:397)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.IOException: An existing connection was forcibly closed by 
the remote host
        at sun.nio.ch.SocketDispatcher.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:225)
        at sun.nio.ch.IOUtil.read(IOUtil.java:198)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
        at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:510)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1054)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:949)
{code}

Throughout the log, I can see a lot of "IOException: An existing connection was 
forcibly closed by the remote host" (This issue seems to have been raised and 
solved in HADOOP-317, but it still occurs on Windows), and this particular 
exception instance results in re-sending committing message to AM.

In addition, it seem to have a logic problem about re-sending the committing 
message. Given the service side has done everything, but IOException happens on 
the response. The second should not to be sent by the client, or should be 
ignored by the service. This is something we need to fix at the MR side. 
There're two options to fix the invalid transition:

1. At the state machine, we ignore TA_COMMIT_PENDING when a task attempt is at 
COMMIT_PENDING. This is just a one-line fix.

2. We can add RetryCache for TaskAttemptListenerImpl, when commitPending is 
called, we check whether we have cache for a particular attempt Id. If we have, 
it means the transition has been executed before, and we can skip the 
transition.





> Windows: Sort Job failed due to 'Invalid event: TA_COMMIT_PENDING at 
> COMMIT_PENDING'
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5924
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5924
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>
> Post the issue on behalf of [~yeshavora]:
> The Sort job over 1GB data failed with below error
> {code}
> 2014-06-09 09:15:38,746 INFO [Socket Reader #1 for port 63415] 
> SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for 
> job_1402304714683_0002 (auth:SIMPLE)
> 2014-06-09 09:15:38,750 INFO [IPC Server handler 13 on 63415] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit-pending state update 
> from attempt_1402304714683_0002_r_000015_1000
> 2014-06-09 09:15:38,751 ERROR [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Can't handle 
> this event at current state for attempt_1402304714683_0002_r_000015_1000
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> TA_COMMIT_PENDING at COMMIT_PENDING
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:1058)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:145)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1271)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1263)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>         at java.lang.Thread.run(Thread.java:722)
> 2014-06-09 09:15:38,753 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1402304714683_0002Job Transitioned from RUNNING to ERROR
> {code}
> The JobHistory Url prints job state = ERROR



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to