[ https://issues.apache.org/jira/browse/MAPREDUCE-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Piotr Kołaczkowski updated MAPREDUCE-4506: ------------------------------------------ Status: Patch Available (was: Open) I attach a patch disabling the 'break connection' feature. > EofException / 'connection reset by peer' while copying map output > ------------------------------------------------------------------- > > Key: MAPREDUCE-4506 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4506 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 1.0.3 > Environment: Ubuntu Linux 12.04 LTS, 64-bit, Java 6 update 33 > Reporter: Piotr Kołaczkowski > Priority: Minor > Attachments: RamManager.patch, ReduceTask.patch > > > When running complex mapreduce jobs with many mappers and reducers (e.g. 8 > mappers, 8 reducers on a 8 core machine), sometimes the following exceptions > pop up in the logs during the shuffle phase: > {noformat} > WARN [570516323@qtp-2060060479-164] 2012-07-19 02:50:21,229 TaskTracker.java > (line 3894) getMapOutput(attempt_201207161621_0217_m_000071_0,0) failed : > org.mortbay.jetty.EofException > at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:787) > at > org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:568) > at > org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1005) > at > org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:648) > at > org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:579) > at > org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3872) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166) > at > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > at > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72) > at sun.nio.ch.IOUtil.write(IOUtil.java:43) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) > at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:169) > at > org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221) > at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:721) > {noformat} > The problem looks like some network problems at first, however it turns out > that hadoop shuffleInMemory sometimes deliberately closes map-output-copy > connections just to reopen them a few milliseconds later, because of > temporary unavailability of free memory. Because the sending side does not > expect this, an exception is thrown. Additionally this leads to wasting > resources on the sender side, which does more work than required serving > additional requests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira