Hi Robert, Thanks for the reply. Version of hadoop is hadoop-0.20.203.0. It is weird how this is only a problem when the amount of data goes up.
My setup might be to blame, this is all a learning process for me so I have 5 VMs running. 1 VM is the JobTracker/Namenode, the other 4 are data/task nodes. They can all ping each other and ssh to each other ok. Cheers Russell On 4 Nov 2011, at 15:39, Robert Evans wrote: > I am not sure what is causing this, but yes they are related. In hadoop the > map output is served to the reducers through jetty, which is an imbedded web > server. If the reducers are not able to fetch the map outputs, then they > assume that the mapper is bad and a new mapper is relaunched to compute the > map output. From the errors it looks like the map output is being > deleted/not showing up for some of the mappers. I am not really sure why > that would be happening. What version of hadoop are you using. > > --Bobby Evans > > On 11/4/11 10:28 AM, "Russell Brown" <misterr...@gmail.com> wrote: > > Hi, > I have a cluster of 4 tasktracker/datanodes and 1 JobTracker/Namenode. I can > run small jobs on this cluster fine (like up to a few thousand keys) but more > than that and I start seeing errors like this: > > > 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : > attempt_201111040342_0006_m_000005_0, Status : FAILED > Too many fetch-failures > 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection > refused > 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection > refused > 11/11/04 08:16:13 INFO mapred.JobClient: map 97% reduce 1% > 11/11/04 08:16:25 INFO mapred.JobClient: map 100% reduce 1% > 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : > attempt_201111040342_0006_m_000010_0, Status : FAILED > Too many fetch-failures > 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection > refused > 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection > refused > 11/11/04 08:17:24 INFO mapred.JobClient: map 97% reduce 1% > 11/11/04 08:17:36 INFO mapred.JobClient: map 100% reduce 1% > 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : > attempt_201111040342_0006_m_000011_0, Status : FAILED > Too many fetch-failures > > > > I have no IDEA what this means. All my nodes can ssh to each other, pass > wordlessly, all the time. > > On the individual data/task nodes the logs have errors like this: > > 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: > getMapOutput(attempt_201111040342_0006_m_000015_0,2) failed : > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find > taskTracker/vagrant/jobcache/job_201111040342_0006/attempt_201111040342_0006_m_000015_0/output/file.out.index > in any of the configured local directories > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160) > at > org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > at > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > > 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: Unknown > child with bad map output: attempt_201111040342_0006_m_000015_0. Ignored. > > > Are they related? What d any of the mean? > > If I use a much smaller amount of data I don't see any of these errors and > everything works fine, so I guess they are to do with some resource (though > what I don't know?) Looking at MASTERNODE:50070/dfsnodelist.jsp?whatNodes=LIVE > > I see that datanodes have ample disk space, that isn't it… > > Any help at all really appreciated. Searching for the errors on Google has me > nothing, reading the Hadoop definitive guide as me nothing. > > Many thanks in advance > > Russell >