Re: Combiner timing out
I'm using CDH3u0 and streaming, so this is hadoop-0.20.2 at patch level 923.21 (cf https://ccp.cloudera.com/display/DOC/Downloading+CDH+Releases). I modified the streaming code to confirm that it is calling progress when I ask it to and which Reporter class is actually being used. It's the Task.TaskReporter class for map and reduce but the Reporter.NULL class for combine (both map-side and reduce-side combines). It appears to be the mapred layer (as opposed to streaming) that sets the reporter, so this should affect non-streaming jobs as well. Chris On Nov 4, 2011, at 9:11 AM, Robert Evans wrote: > There was a change that went into 0.20.205 > https://issues.apache.org/jira/browse/MAPREDUCE-2187 where after so many > inputs to the combiner progress is automatically reported. I looked through > the code for 0.20.205 and from what I can see the CombineOutputCollector > should be getting an instance of TaskReporter. What version of Hadoop are > you running? Are you using the old APIs in the mapred package or the newer > APIs in the mapreduce java package? > > --Bobby Evans > > On 11/4/11 1:20 AM, "Christopher Egner" wrote: > > Hi all, > > Let me preface this with my understanding of how tasks work. > > If a task takes a long time (default 10min) and demonstrates no progress, the > task tracker will decide the process is hung, kill it, and start a new > attempt. Normally, one uses a Reporter instance's progress method to provide > progress updates and avoid this. For a streaming mapper, the Reporter class > is org.apache.hadoop.mapred.Task$TaskReporter and this works well. Streaming > is even set up to take progress, status, and counter updates from stderr, > which is really cool. > > However, for combiner tasks, the class is > org.apache.hadoop.mapred.Reporter$1. The first subclass in this particular > java file is the Reporter.NULL class, which ignores all updates. So even if > a combiner task is updating its reporter in accordance with docs (see > postscript), its updates are ignored and it dies at 10 minutes. Or one sets > mapred.task.timeout very high, allowing truly hung tasks to go unrecognised > for much longer. > > At least this is what I've been able to put together from reading code and > searching the web for docs (except hadoop jira which has been down for a > while - my bad luck). > > So am I understanding this correctly? Are there plans to change this? Or > reasons that combiners can't have normal reporters associated to them? > > Thanks for any help, > Chris > > http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Reporter > http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ (cf tip 7) > http://hadoop.apache.org/common/docs/r0.18.3/streaming.html#How+do+I+update+counters+in+streaming+applications%3F > http://hadoop.apache.org/common/docs/r0.20.0/mapred-default.html (cf > mapred.task.timeout) >
Re: How do I diagnose a really slow copy
The task has been running several hours and the map phase is essentially a null mapper - rewrite the key and value stored by an earlier reducer. There is no firewall - the entire job is running on an internal cluster - admitted launched from my local box on the company network - it is running WAY slower than jobs previously run on the same hardware and I suspect something is wring but lack the tools to even start diagnosing the issue On Fri, Nov 4, 2011 at 9:07 AM, Harsh J wrote: > Steve, > > The copy phase may start early, and the slow copy could also just be due > to unavailability of completed map outputs at this stage. Does your > question eliminate that case here? > > I'd also check the network speeds you get between two slave nodes, and if > your TaskTracker logs are indicating issues transferring map output > requests via HTTP. > > Also, do you run any form of network filtering stuff, firewalls, etc. that > may be working at the packet levels? I've seen it cause slowdowns before, > but am not too sure if that's the case here. > > On 04-Nov-2011, at 8:50 PM, Steve Lewis wrote: > > I have been finding a that my cluster is running abnormally slowly > A typical reduce task reports > reduce > copy (113 of 431 at 0.07 MB/s) > 70 kb / second is a truely dreadful rate and tasks are running much slower > under hadoop than the > same code on a the same operations on a single box - > Where do I look to find why IO operations might be so slow?? > > -- > Steven M. Lewis PhD > > > > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: Combiner timing out
There was a change that went into 0.20.205 https://issues.apache.org/jira/browse/MAPREDUCE-2187 where after so many inputs to the combiner progress is automatically reported. I looked through the code for 0.20.205 and from what I can see the CombineOutputCollector should be getting an instance of TaskReporter. What version of Hadoop are you running? Are you using the old APIs in the mapred package or the newer APIs in the mapreduce java package? --Bobby Evans On 11/4/11 1:20 AM, "Christopher Egner" wrote: Hi all, Let me preface this with my understanding of how tasks work. If a task takes a long time (default 10min) and demonstrates no progress, the task tracker will decide the process is hung, kill it, and start a new attempt. Normally, one uses a Reporter instance's progress method to provide progress updates and avoid this. For a streaming mapper, the Reporter class is org.apache.hadoop.mapred.Task$TaskReporter and this works well. Streaming is even set up to take progress, status, and counter updates from stderr, which is really cool. However, for combiner tasks, the class is org.apache.hadoop.mapred.Reporter$1. The first subclass in this particular java file is the Reporter.NULL class, which ignores all updates. So even if a combiner task is updating its reporter in accordance with docs (see postscript), its updates are ignored and it dies at 10 minutes. Or one sets mapred.task.timeout very high, allowing truly hung tasks to go unrecognised for much longer. At least this is what I've been able to put together from reading code and searching the web for docs (except hadoop jira which has been down for a while - my bad luck). So am I understanding this correctly? Are there plans to change this? Or reasons that combiners can't have normal reporters associated to them? Thanks for any help, Chris http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Reporter http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ (cf tip 7) http://hadoop.apache.org/common/docs/r0.18.3/streaming.html#How+do+I+update+counters+in+streaming+applications%3F http://hadoop.apache.org/common/docs/r0.20.0/mapred-default.html (cf mapred.task.timeout)
Re: Never ending reduce jobs, error Error reading task outputConnection refused
Done so, working, Awesome and many many thanks! Cheers Russell On 4 Nov 2011, at 16:06, Uma Maheswara Rao G 72686 wrote: > - Original Message - > From: Russell Brown > Date: Friday, November 4, 2011 9:18 pm > Subject: Re: Never ending reduce jobs, error Error reading task > outputConnection refused > To: mapreduce-user@hadoop.apache.org > >> >> On 4 Nov 2011, at 15:44, Uma Maheswara Rao G 72686 wrote: >> >>> - Original Message - >>> From: Russell Brown >>> Date: Friday, November 4, 2011 9:11 pm >>> Subject: Re: Never ending reduce jobs, error Error reading task >> outputConnection refused >>> To: mapreduce-user@hadoop.apache.org >>> On 4 Nov 2011, at 15:35, Uma Maheswara Rao G 72686 wrote: > This problem may come if you dont configure the hostmappings properly.> Can you check whether your tasktrackers are pingable from each other with the configured hosts names? Hi, Thanks for replying so fast! Hostnames? I use IP addresses in the slaves config file, and >> via IP addresses everyone can ping everyone else, do I need to set >> up hostnames too? >>> Yes, can you configure hostname mappings and check.. >> >> Like full blown DNS? I mean there is no reference to any machine >> by hostname in any of my config anywhere, so I'm not sure where to >> start. These machines are just on my local network. > you need to configure them in /etc/hosts file. > ex: xx.xx.xx.xx1 TT_HOSTNAME1 >xx.xx.xx.xx2 TT_HOSTNAME2 >xx.xx.xx.xx3 TT_HOSTNAME3 >xx.xx.xx.xx4 TT_HOSTNAME4 > configure them in all the machines and check. >> Cheers Russell > > Regards, > Uma > - Original Message - > From: Russell Brown > Date: Friday, November 4, 2011 9:00 pm > Subject: Never ending reduce jobs, error Error reading task outputConnection refused > To: mapreduce-user@hadoop.apache.org > >> Hi, >> I have a cluster of 4 tasktracker/datanodes and 1 >> JobTracker/Namenode. I can run small jobs on this cluster >> fine >> (like up to a few thousand keys) but more than that and I >> start >> seeing errors like this: >> >> >> 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : >> attempt_20040342_0006_m_05_0, Status : FAILED >> Too many fetch-failures >> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task >> outputConnection refused >> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task >> outputConnection refused >> 11/11/04 08:16:13 INFO mapred.JobClient: map 97% reduce 1% >> 11/11/04 08:16:25 INFO mapred.JobClient: map 100% reduce 1% >> 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : >> attempt_20040342_0006_m_10_0, Status : FAILED >> Too many fetch-failures >> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task >> outputConnection refused >> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task >> outputConnection refused >> 11/11/04 08:17:24 INFO mapred.JobClient: map 97% reduce 1% >> 11/11/04 08:17:36 INFO mapred.JobClient: map 100% reduce 1% >> 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : >> attempt_20040342_0006_m_11_0, Status : FAILED >> Too many fetch-failures >> >> >> >> I have no IDEA what this means. All my nodes can ssh to each >> other, pass wordlessly, all the time. >> >> On the individual data/task nodes the logs have errors like this: >> >> 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: >> getMapOutput(attempt_20040342_0006_m_15_0,2) failed : >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not >> find >> taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index >> in any of the configured local directories >> at >> >> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429) >> at >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160) >> at >> >> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543) >>at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) >> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) >> at >> >> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) >> at >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) >> at >> >> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816) >>at >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) >> at >> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.ja
Re: How do I diagnose a really slow copy
Steve, The copy phase may start early, and the slow copy could also just be due to unavailability of completed map outputs at this stage. Does your question eliminate that case here? I'd also check the network speeds you get between two slave nodes, and if your TaskTracker logs are indicating issues transferring map output requests via HTTP. Also, do you run any form of network filtering stuff, firewalls, etc. that may be working at the packet levels? I've seen it cause slowdowns before, but am not too sure if that's the case here. On 04-Nov-2011, at 8:50 PM, Steve Lewis wrote: > I have been finding a that my cluster is running abnormally slowly > A typical reduce task reports > reduce > copy (113 of 431 at 0.07 MB/s) > 70 kb / second is a truely dreadful rate and tasks are running much slower > under hadoop than the > same code on a the same operations on a single box - > Where do I look to find why IO operations might be so slow?? > > -- > Steven M. Lewis PhD > >
Re: Never ending reduce jobs, error Error reading task outputConnection refused
- Original Message - From: Russell Brown Date: Friday, November 4, 2011 9:18 pm Subject: Re: Never ending reduce jobs, error Error reading task outputConnection refused To: mapreduce-user@hadoop.apache.org > > On 4 Nov 2011, at 15:44, Uma Maheswara Rao G 72686 wrote: > > > - Original Message - > > From: Russell Brown > > Date: Friday, November 4, 2011 9:11 pm > > Subject: Re: Never ending reduce jobs, error Error reading task > outputConnection refused > > To: mapreduce-user@hadoop.apache.org > > > >> > >> On 4 Nov 2011, at 15:35, Uma Maheswara Rao G 72686 wrote: > >> > >>> This problem may come if you dont configure the hostmappings > >> properly.> Can you check whether your tasktrackers are pingable > >> from each other with the configured hosts names? > >> > >> > >> Hi, > >> Thanks for replying so fast! > >> > >> Hostnames? I use IP addresses in the slaves config file, and > via > >> IP addresses everyone can ping everyone else, do I need to set > up > >> hostnames too? > > Yes, can you configure hostname mappings and check.. > > Like full blown DNS? I mean there is no reference to any machine > by hostname in any of my config anywhere, so I'm not sure where to > start. These machines are just on my local network. you need to configure them in /etc/hosts file. ex: xx.xx.xx.xx1 TT_HOSTNAME1 xx.xx.xx.xx2 TT_HOSTNAME2 xx.xx.xx.xx3 TT_HOSTNAME3 xx.xx.xx.xx4 TT_HOSTNAME4 configure them in all the machines and check. > > >> > >> Cheers > >> > >> Russell > >>> > >>> Regards, > >>> Uma > >>> - Original Message - > >>> From: Russell Brown > >>> Date: Friday, November 4, 2011 9:00 pm > >>> Subject: Never ending reduce jobs, error Error reading task > >> outputConnection refused > >>> To: mapreduce-user@hadoop.apache.org > >>> > Hi, > I have a cluster of 4 tasktracker/datanodes and 1 > JobTracker/Namenode. I can run small jobs on this cluster > fine > (like up to a few thousand keys) but more than that and I > start > seeing errors like this: > > > 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : > attempt_20040342_0006_m_05_0, Status : FAILED > Too many fetch-failures > 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task > outputConnection refused > 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task > outputConnection refused > 11/11/04 08:16:13 INFO mapred.JobClient: map 97% reduce 1% > 11/11/04 08:16:25 INFO mapred.JobClient: map 100% reduce 1% > 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : > attempt_20040342_0006_m_10_0, Status : FAILED > Too many fetch-failures > 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task > outputConnection refused > 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task > outputConnection refused > 11/11/04 08:17:24 INFO mapred.JobClient: map 97% reduce 1% > 11/11/04 08:17:36 INFO mapred.JobClient: map 100% reduce 1% > 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : > attempt_20040342_0006_m_11_0, Status : FAILED > Too many fetch-failures > > > > I have no IDEA what this means. All my nodes can ssh to each > other, pass wordlessly, all the time. > > On the individual data/task nodes the logs have errors like this: > > 2011-11-04 08:24:42,514 WARN > >> org.apache.hadoop.mapred.TaskTracker: > getMapOutput(attempt_20040342_0006_m_15_0,2) failed : > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could > >> not > find > >> > taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index > in any of the configured local directories > at > > >> > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160) > at > > >> > org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > > >> > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > > >> > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > > >> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > > >> > org.mortbay.jetty.servlet.SessionHandler.handl
Re: Never ending reduce jobs, error Error reading task outputConnection refused
Hi Robert, Thanks for the reply. Version of hadoop is hadoop-0.20.203.0. It is weird how this is only a problem when the amount of data goes up. My setup might be to blame, this is all a learning process for me so I have 5 VMs running. 1 VM is the JobTracker/Namenode, the other 4 are data/task nodes. They can all ping each other and ssh to each other ok. Cheers Russell On 4 Nov 2011, at 15:39, Robert Evans wrote: > I am not sure what is causing this, but yes they are related. In hadoop the > map output is served to the reducers through jetty, which is an imbedded web > server. If the reducers are not able to fetch the map outputs, then they > assume that the mapper is bad and a new mapper is relaunched to compute the > map output. From the errors it looks like the map output is being > deleted/not showing up for some of the mappers. I am not really sure why > that would be happening. What version of hadoop are you using. > > --Bobby Evans > > On 11/4/11 10:28 AM, "Russell Brown" wrote: > > Hi, > I have a cluster of 4 tasktracker/datanodes and 1 JobTracker/Namenode. I can > run small jobs on this cluster fine (like up to a few thousand keys) but more > than that and I start seeing errors like this: > > > 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : > attempt_20040342_0006_m_05_0, Status : FAILED > Too many fetch-failures > 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection > refused > 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection > refused > 11/11/04 08:16:13 INFO mapred.JobClient: map 97% reduce 1% > 11/11/04 08:16:25 INFO mapred.JobClient: map 100% reduce 1% > 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : > attempt_20040342_0006_m_10_0, Status : FAILED > Too many fetch-failures > 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection > refused > 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection > refused > 11/11/04 08:17:24 INFO mapred.JobClient: map 97% reduce 1% > 11/11/04 08:17:36 INFO mapred.JobClient: map 100% reduce 1% > 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : > attempt_20040342_0006_m_11_0, Status : FAILED > Too many fetch-failures > > > > I have no IDEA what this means. All my nodes can ssh to each other, pass > wordlessly, all the time. > > On the individual data/task nodes the logs have errors like this: > > 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: > getMapOutput(attempt_20040342_0006_m_15_0,2) failed : > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find > taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index > in any of the configured local directories > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160) > at > org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > at > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > >
Re: Never ending reduce jobs, error Error reading task outputConnection refused
On 4 Nov 2011, at 15:44, Uma Maheswara Rao G 72686 wrote: > - Original Message - > From: Russell Brown > Date: Friday, November 4, 2011 9:11 pm > Subject: Re: Never ending reduce jobs, error Error reading task > outputConnection refused > To: mapreduce-user@hadoop.apache.org > >> >> On 4 Nov 2011, at 15:35, Uma Maheswara Rao G 72686 wrote: >> >>> This problem may come if you dont configure the hostmappings >> properly.> Can you check whether your tasktrackers are pingable >> from each other with the configured hosts names? >> >> >> Hi, >> Thanks for replying so fast! >> >> Hostnames? I use IP addresses in the slaves config file, and via >> IP addresses everyone can ping everyone else, do I need to set up >> hostnames too? > Yes, can you configure hostname mappings and check.. Like full blown DNS? I mean there is no reference to any machine by hostname in any of my config anywhere, so I'm not sure where to start. These machines are just on my local network. >> >> Cheers >> >> Russell >>> >>> Regards, >>> Uma >>> - Original Message - >>> From: Russell Brown >>> Date: Friday, November 4, 2011 9:00 pm >>> Subject: Never ending reduce jobs, error Error reading task >> outputConnection refused >>> To: mapreduce-user@hadoop.apache.org >>> Hi, I have a cluster of 4 tasktracker/datanodes and 1 JobTracker/Namenode. I can run small jobs on this cluster fine (like up to a few thousand keys) but more than that and I start seeing errors like this: 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : attempt_20040342_0006_m_05_0, Status : FAILED Too many fetch-failures 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:16:13 INFO mapred.JobClient: map 97% reduce 1% 11/11/04 08:16:25 INFO mapred.JobClient: map 100% reduce 1% 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : attempt_20040342_0006_m_10_0, Status : FAILED Too many fetch-failures 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:17:24 INFO mapred.JobClient: map 97% reduce 1% 11/11/04 08:17:36 INFO mapred.JobClient: map 100% reduce 1% 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : attempt_20040342_0006_m_11_0, Status : FAILED Too many fetch-failures I have no IDEA what this means. All my nodes can ssh to each other, pass wordlessly, all the time. On the individual data/task nodes the logs have errors like this: 2011-11-04 08:24:42,514 WARN >> org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_20040342_0006_m_15_0,2) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could >> not find >> taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index >> in any of the configured local directories at >> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429) >> at >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160) at >> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543) >>at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at >> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) >> at >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at >> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816) >>at >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) >> at >> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at >> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) >> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at >> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) >> at >> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at >> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) >> at org.mortbay.jetty.Server.handle(Server.java:326) at >> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) >> at >> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnectio
Re: Never ending reduce jobs, error Error reading task outputConnection refused
- Original Message - From: Russell Brown Date: Friday, November 4, 2011 9:11 pm Subject: Re: Never ending reduce jobs, error Error reading task outputConnection refused To: mapreduce-user@hadoop.apache.org > > On 4 Nov 2011, at 15:35, Uma Maheswara Rao G 72686 wrote: > > > This problem may come if you dont configure the hostmappings > properly.> Can you check whether your tasktrackers are pingable > from each other with the configured hosts names? > > > Hi, > Thanks for replying so fast! > > Hostnames? I use IP addresses in the slaves config file, and via > IP addresses everyone can ping everyone else, do I need to set up > hostnames too? Yes, can you configure hostname mappings and check.. > > Cheers > > Russell > > > > Regards, > > Uma > > - Original Message - > > From: Russell Brown > > Date: Friday, November 4, 2011 9:00 pm > > Subject: Never ending reduce jobs, error Error reading task > outputConnection refused > > To: mapreduce-user@hadoop.apache.org > > > >> Hi, > >> I have a cluster of 4 tasktracker/datanodes and 1 > >> JobTracker/Namenode. I can run small jobs on this cluster fine > >> (like up to a few thousand keys) but more than that and I start > >> seeing errors like this: > >> > >> > >> 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : > >> attempt_20040342_0006_m_05_0, Status : FAILED > >> Too many fetch-failures > >> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task > >> outputConnection refused > >> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task > >> outputConnection refused > >> 11/11/04 08:16:13 INFO mapred.JobClient: map 97% reduce 1% > >> 11/11/04 08:16:25 INFO mapred.JobClient: map 100% reduce 1% > >> 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : > >> attempt_20040342_0006_m_10_0, Status : FAILED > >> Too many fetch-failures > >> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task > >> outputConnection refused > >> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task > >> outputConnection refused > >> 11/11/04 08:17:24 INFO mapred.JobClient: map 97% reduce 1% > >> 11/11/04 08:17:36 INFO mapred.JobClient: map 100% reduce 1% > >> 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : > >> attempt_20040342_0006_m_11_0, Status : FAILED > >> Too many fetch-failures > >> > >> > >> > >> I have no IDEA what this means. All my nodes can ssh to each > >> other, pass wordlessly, all the time. > >> > >> On the individual data/task nodes the logs have errors like this: > >> > >> 2011-11-04 08:24:42,514 WARN > org.apache.hadoop.mapred.TaskTracker: > >> getMapOutput(attempt_20040342_0006_m_15_0,2) failed : > >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could > not > >> find > taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index > in any of the configured local directories > >>at > >> > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160) > >>at > >> > org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > >>at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > >>at > >> > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > >>at > >> > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > >>at > >> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > >>at > >> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > >>at > >> > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > >>at > >> > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > >>at > >> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) > >>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) > >>at > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)>> > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > >>at > >> > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectCha
Re: Never ending reduce jobs, error Error reading task outputConnection refused
On 4 Nov 2011, at 15:35, Uma Maheswara Rao G 72686 wrote: > This problem may come if you dont configure the hostmappings properly. > Can you check whether your tasktrackers are pingable from each other with the > configured hosts names? Hi, Thanks for replying so fast! Hostnames? I use IP addresses in the slaves config file, and via IP addresses everyone can ping everyone else, do I need to set up hostnames too? Cheers Russell > > Regards, > Uma > - Original Message - > From: Russell Brown > Date: Friday, November 4, 2011 9:00 pm > Subject: Never ending reduce jobs, error Error reading task outputConnection > refused > To: mapreduce-user@hadoop.apache.org > >> Hi, >> I have a cluster of 4 tasktracker/datanodes and 1 >> JobTracker/Namenode. I can run small jobs on this cluster fine >> (like up to a few thousand keys) but more than that and I start >> seeing errors like this: >> >> >> 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : >> attempt_20040342_0006_m_05_0, Status : FAILED >> Too many fetch-failures >> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task >> outputConnection refused >> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task >> outputConnection refused >> 11/11/04 08:16:13 INFO mapred.JobClient: map 97% reduce 1% >> 11/11/04 08:16:25 INFO mapred.JobClient: map 100% reduce 1% >> 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : >> attempt_20040342_0006_m_10_0, Status : FAILED >> Too many fetch-failures >> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task >> outputConnection refused >> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task >> outputConnection refused >> 11/11/04 08:17:24 INFO mapred.JobClient: map 97% reduce 1% >> 11/11/04 08:17:36 INFO mapred.JobClient: map 100% reduce 1% >> 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : >> attempt_20040342_0006_m_11_0, Status : FAILED >> Too many fetch-failures >> >> >> >> I have no IDEA what this means. All my nodes can ssh to each >> other, pass wordlessly, all the time. >> >> On the individual data/task nodes the logs have errors like this: >> >> 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: >> getMapOutput(attempt_20040342_0006_m_15_0,2) failed : >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not >> find >> taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index >> in any of the configured local directories >> at >> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429) >> at >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160) >> at >> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543) >>at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) >> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) >> at >> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) >> at >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) >> at >> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816) >>at >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) >> at >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) >> at >> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) >> at >> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) >> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) >> at >> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) >> at >> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) >> at >> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) >> at org.mortbay.jetty.Server.handle(Server.java:326) >> at >> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) >> at >> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) >> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) >> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) >> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) >> at >> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) >> at >> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) >> >> 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: >> Unknown child with bad map output: >> attempt_20040342_0006_m_15_0. Ignored. >> >> >> Are they related? What d any of the mean? >> >> If I use a much smaller amount of data I don't see any of these >> errors and everything works fine, so I
Re: Never ending reduce jobs, error Error reading task outputConnection refused
I am not sure what is causing this, but yes they are related. In hadoop the map output is served to the reducers through jetty, which is an imbedded web server. If the reducers are not able to fetch the map outputs, then they assume that the mapper is bad and a new mapper is relaunched to compute the map output. From the errors it looks like the map output is being deleted/not showing up for some of the mappers. I am not really sure why that would be happening. What version of hadoop are you using. --Bobby Evans On 11/4/11 10:28 AM, "Russell Brown" wrote: Hi, I have a cluster of 4 tasktracker/datanodes and 1 JobTracker/Namenode. I can run small jobs on this cluster fine (like up to a few thousand keys) but more than that and I start seeing errors like this: 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : attempt_20040342_0006_m_05_0, Status : FAILED Too many fetch-failures 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:16:13 INFO mapred.JobClient: map 97% reduce 1% 11/11/04 08:16:25 INFO mapred.JobClient: map 100% reduce 1% 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : attempt_20040342_0006_m_10_0, Status : FAILED Too many fetch-failures 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:17:24 INFO mapred.JobClient: map 97% reduce 1% 11/11/04 08:17:36 INFO mapred.JobClient: map 100% reduce 1% 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : attempt_20040342_0006_m_11_0, Status : FAILED Too many fetch-failures I have no IDEA what this means. All my nodes can ssh to each other, pass wordlessly, all the time. On the individual data/task nodes the logs have errors like this: 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_20040342_0006_m_15_0,2) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index in any of the configured local directories at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: Unknown child with bad map output: attempt_20040342_0006_m_15_0. Ignored. Are they related? What d any of the mean? If I use a much smaller amount of data I don't see any of these errors and everything works fine, so I guess they are to do with some resource (though what I don't know?) Looking at MASTERNODE:50070/dfsnodelist.jsp?whatNodes=LIVE I see that datanodes have ample disk space, that isn't it... Any help at all really appreciated. Searching for the errors on Google has me nothing, reading the Hadoop definitive guide as
Re: Never ending reduce jobs, error Error reading task outputConnection refused
This problem may come if you dont configure the hostmappings properly. Can you check whether your tasktrackers are pingable from each other with the configured hostsnames? Regards, Uma - Original Message - From: Russell Brown Date: Friday, November 4, 2011 9:00 pm Subject: Never ending reduce jobs, error Error reading task outputConnection refused To: mapreduce-user@hadoop.apache.org > Hi, > I have a cluster of 4 tasktracker/datanodes and 1 > JobTracker/Namenode. I can run small jobs on this cluster fine > (like up to a few thousand keys) but more than that and I start > seeing errors like this: > > > 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : > attempt_20040342_0006_m_05_0, Status : FAILED > Too many fetch-failures > 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task > outputConnection refused > 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task > outputConnection refused > 11/11/04 08:16:13 INFO mapred.JobClient: map 97% reduce 1% > 11/11/04 08:16:25 INFO mapred.JobClient: map 100% reduce 1% > 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : > attempt_20040342_0006_m_10_0, Status : FAILED > Too many fetch-failures > 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task > outputConnection refused > 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task > outputConnection refused > 11/11/04 08:17:24 INFO mapred.JobClient: map 97% reduce 1% > 11/11/04 08:17:36 INFO mapred.JobClient: map 100% reduce 1% > 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : > attempt_20040342_0006_m_11_0, Status : FAILED > Too many fetch-failures > > > > I have no IDEA what this means. All my nodes can ssh to each > other, pass wordlessly, all the time. > > On the individual data/task nodes the logs have errors like this: > > 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: > getMapOutput(attempt_20040342_0006_m_15_0,2) failed : > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not > find > taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index > in any of the configured local directories > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160) > at > org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > at > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > > 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: > Unknown child with bad map output: > attempt_20040342_0006_m_15_0. Ignored. > > > Are they related? What d any of the mean? > > If I use a much smaller amount of data I don't see any of these > errors and everything works fine, so I guess they are to do with > some resource (though what I don't know?) Looking at > MASTERNODE:50070/dfsnodelist.jsp?whatNodes=LIVE > I see that datanodes have ample disk space, that isn't it… > > Any help at all really appreciated. Searching for the errors on > Google has me nothing, reading the Hadoop definitive guide as me > nothing. > Many thanks in advance >
Never ending reduce jobs, error Error reading task outputConnection refused
Hi, I have a cluster of 4 tasktracker/datanodes and 1 JobTracker/Namenode. I can run small jobs on this cluster fine (like up to a few thousand keys) but more than that and I start seeing errors like this: 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : attempt_20040342_0006_m_05_0, Status : FAILED Too many fetch-failures 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:16:13 INFO mapred.JobClient: map 97% reduce 1% 11/11/04 08:16:25 INFO mapred.JobClient: map 100% reduce 1% 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : attempt_20040342_0006_m_10_0, Status : FAILED Too many fetch-failures 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection refused 11/11/04 08:17:24 INFO mapred.JobClient: map 97% reduce 1% 11/11/04 08:17:36 INFO mapred.JobClient: map 100% reduce 1% 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : attempt_20040342_0006_m_11_0, Status : FAILED Too many fetch-failures I have no IDEA what this means. All my nodes can ssh to each other, pass wordlessly, all the time. On the individual data/task nodes the logs have errors like this: 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_20040342_0006_m_15_0,2) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index in any of the configured local directories at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: Unknown child with bad map output: attempt_20040342_0006_m_15_0. Ignored. Are they related? What d any of the mean? If I use a much smaller amount of data I don't see any of these errors and everything works fine, so I guess they are to do with some resource (though what I don't know?) Looking at MASTERNODE:50070/dfsnodelist.jsp?whatNodes=LIVE I see that datanodes have ample disk space, that isn't it… Any help at all really appreciated. Searching for the errors on Google has me nothing, reading the Hadoop definitive guide as me nothing. Many thanks in advance Russell
How do I diagnose a really slow copy
I have been finding a that my cluster is running abnormally slowly A typical reduce task reports reduce > copy (113 of 431 at 0.07 MB/s) 70 kb / second is a truely dreadful rate and tasks are running much slower under hadoop than the same code on a the same operations on a single box - Where do I look to find why IO operations might be so slow?? -- Steven M. Lewis PhD
Re: HDFS error : Could not Complete file
Hi, I think the below issue was only due to HDFS architecture and not map-reduce. BUt just to make sure that's the case, I am cross-posting to this group as well. I have also attached the program used to get this error. My input file contains abt 2 million+ records. Any help is greatly appreciated. Thanks Sudhan S On Fri, Nov 4, 2011 at 2:47 PM, Sudharsan Sampath wrote: > Hi, > > I have a simple map-reduce program [map only :) ]that reads the input and > emits the same to n outputs on a single node cluster with max map tasks set > to 10 on a 16 core processor machine. > > After a while the tasks begin to fail with the following exception log. > > 2011-01-01 03:17:52,149 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=temp,temp > ip=/x.x.x.x cmd=delete > > src=/TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2 > dst=nullperm=null > 2011-01-01 03:17:52,156 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_7046642930904717718_23143 on x.x.x.x: size 66148 But it does not > belong to any file. > 2011-01-01 03:17:52,156 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: failed to complete > /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 > because dir.getFileBlocks() is null and pendingFile is null > 2011-01-01 03:17:52,156 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 12 on 9000, call > complete(/TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0, > DFSClient_attempt_201101010256_0006_m_00_2) from x.x.x.x: error: > java.io.IOException: Could not complete write to file > /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 > by DFSClient_attempt_201101010256_0006_m_00_2 > java.io.IOException: Could not complete write to file > /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 > by DFSClient_attempt_201101010256_0006_m_00_2 > at > org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497) > at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962) > > > Looks like there's a delete command issued by FsNameSystem.audit before > the it errors out stating it could not complete write to the file inside > that.. > > Any clue on what could have gone wrong? > > Thanks > Sudharsan S > TestMultipleOutputs.java Description: Binary data
Re: under cygwin JUST tasktracker run by cyg_server user, Permission denied .....
in 205, code is different than trace Which version are you using? I just verified the code in older versions, http://mail-archives.apache.org/mod_mbox/hadoop-common-commits/201109.mbox/%3c20110902221116.d0b192388...@eris.apache.org%3E below is the code snippet. +boolean rv = true; + +// read perms +rv = f.setReadable(group.implies(FsAction.READ), false); +checkReturnValue(rv, f, permission); if rv is false then it throws the below error. Can you please create a simple program with the below path and try call setReadable with the user where task tracker starts. Then we can get to know what error it is giving. look at the javadoc http://download.oracle.com/javase/6/docs/api/java/io/File.html#setReadable(boolean,%20boolean) setReadable public boolean setReadable(boolean readable, boolean ownerOnly)Sets the owner's or everybody's read permission for this abstract pathname. Parameters: readable - If true, sets the access permission to allow read operations; if false to disallow read operations ownerOnly - If true, the read permission applies only to the owner's read permission; otherwise, it applies to everybody. If the underlying file system can not distinguish the owner's read permission from that of others, then the permission will apply to everybody, regardless of this value. Returns: true if and only if the operation succeeded. The operation will fail if the user does not have permission to change the access permissions of this abstract pathname. If readable is false and the underlying file system does not implement a read permission, then the operation will fail. I am not sure how to provide the athentications in Cygwin. Please make sure you should have rights to change the permissions with the user. If i get some more info, i will update you. i sent it to mapreduce user and cced to common Regards, Uma *- Original Message - From: Masoud Date: Friday, November 4, 2011 7:01 am Subject: Re: under cygwin JUST tasktracker run by cyg_server user, Permission denied . To: common-u...@hadoop.apache.org > Dear Uma, > as you know when we use start-all.sh command, all the outputs > saved in > log files, > when i check the tasktracker log file, i see the below error > message and > its shutdown. > im really confused, its more than 4 days im working in this issue > and > tried different ways but no result.^^ > > BS. > Masoud > > On 11/03/2011 08:34 PM, Uma Maheswara Rao G 72686 wrote: > > it wont disply any thing on console. > > If you get any error while exceuting the command, then only it > will disply on console. In your case it might executed successfully. > > Still you are facing same problem with TT startup? > > > > Regards, > > Uma > > - Original Message - > > From: Masoud > > Date: Thursday, November 3, 2011 7:02 am > > Subject: Re: under cygwin JUST tasktracker run by cyg_server > user, Permission denied . > > To: common-u...@hadoop.apache.org > > > >> Hi, > >> thanks for info, i checked that report, seems same with mine but > >> no > >> specific solution mentioned. > >> Yes, i changed this folder permission via cygwin,NO RESULT. > >> Im really confused. ... > >> > >> any idea please ...? > >> > >> Thanks, > >> B.S > >> > >> > >> On 11/01/2011 05:38 PM, Uma Maheswara Rao G 72686 wrote: > >>> Looks, that is permissions related issue on local dirs > >>> There is an issue filed in mapred, related to this problem > >> https://issues.apache.org/jira/browse/MAPREDUCE-2921 > >>> Can you please provide permissions explicitely and try? > >>> > >>> Regards, > >>> Uma > >>> - Original Message - > >>> From: Masoud > >>> Date: Tuesday, November 1, 2011 1:19 pm > >>> Subject: Re: under cygwin JUST tasktracker run by cyg_server > >> user, Permission denied . > >>> To: common-u...@hadoop.apache.org > >>> > Sure, ^^ > > when I run {namenode -fromat} it makes dfs in c:/tmp/ > administrator_hadoop/ > after that by running "start-all.sh" every thing is OK, all > daemons run > except tasktracker. > My current user in administrator, but tacktracker runs by > cyg_server > user that made by cygwin in installation time;This is a part > of log > file: > 2011-11-01 14:26:54,463 INFO > org.apache.hadoop.mapred.TaskTracker: Starting tasktracker > with owner as cyg_server > 2011-11-01 14:26:54,463 INFO > org.apache.hadoop.mapred.TaskTracker: Good > mapred local directories are: /tmp/hadoop-cyg_server/mapred/local > 2011-11-01 14:26:54,479 ERROR > org.apache.hadoop.mapred.TaskTracker: Can > not start task tracker because java.io.IOException: Failed to set > permissions of path: \tmp\hadoop- > cyg_server\mapred\local\ttprivate to 0700 > at > org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:680) > at > org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:653)