[ https://issues.apache.org/jira/browse/MAPREDUCE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187485#comment-13187485 ]
Hudson commented on MAPREDUCE-2450: ----------------------------------- Integrated in Hadoop-Common-trunk-Commit #1551 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1551/]) MAPREDUCE-2450. Fixed a corner case with interrupted communication threads leading to a long timeout in Task. Contributed by Rajesh Balamohan. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1232314 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java > Calls from running tasks to TaskTracker methods sometimes fail and incur a > 60s timeout > -------------------------------------------------------------------------------------- > > Key: MAPREDUCE-2450 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2450 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 0.23.0 > Reporter: Matei Zaharia > Assignee: Rajesh Balamohan > Fix For: 1.0.0 > > Attachments: HADOOP-5380.Y.20.branch.patch, HADOOP-5380.patch, > HADOOP_5380-Y.0.20.20x.patch, MAPREDUCE-2450.patch, mapreduce-2450.patch > > > I'm seeing some map tasks in my jobs take 1 minute to commit after they > finish the map computation. On the map side, the output looks like this: > <code> > 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot > initialize JVM Metrics with processName=MAP, sessionId= - already initialized > 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: > numReduceTasks: 800 > 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = > 300 > 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer = > 239075328/298844160 > 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer > = 786432/983040 > 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush > of map output > 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill > 0 > 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner: > Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of > commiting > 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner: > Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed > on local exception: java.nio.channels.ClosedChannelException > at org.apache.hadoop.ipc.Client.wrapException(Client.java:765) > at org.apache.hadoop.ipc.Client.call(Client.java:733) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) > at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source) > at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.nio.channels.ClosedChannelException > at > java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167) > at > java.nio.channels.SelectableChannel.register(SelectableChannel.java:254) > at > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.FilterInputStream.read(FilterInputStream.java:116) > at > org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > at java.io.DataInputStream.readInt(DataInputStream.java:370) > at > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) > 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task > 'attempt_200903022127_0001_m_003163_0' done. > </code> > In the TaskTracker log, it looks like this: > <code> > 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server > Responder, call ping(attempt_200903022127_0001_m_003163_0) from > 127.0.0.1:56884: output error > 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 10 on 50311 caught: java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at > org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195) > at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) > at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981) > </code> > Note that the task actually seemed to commit - it didn't get speculatively > executed or anything. However, the job wasn't able to continue until this one > task was done. Both parties seem to think the channel was closed. How does > the channel get closed externally? If closing it from outside is unavoidable, > maybe the right thing to do is to set a much lower timeout, because 1 minute > delay can be pretty significant for a small job. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira