Re: Combiner timing out

2011-11-04 Thread Christopher Egner
I'm using CDH3u0 and streaming, so this is hadoop-0.20.2 at patch level 923.21 
(cf https://ccp.cloudera.com/display/DOC/Downloading+CDH+Releases).  

I modified the streaming code to confirm that it is calling progress when I ask 
it to and which Reporter class is actually being used.  It's the 
Task.TaskReporter class for map and reduce but the Reporter.NULL class for 
combine (both map-side and reduce-side combines).  It appears to be the mapred 
layer (as opposed to streaming) that sets the reporter, so this should affect 
non-streaming jobs as well.

Chris

On Nov 4, 2011, at 9:11 AM, Robert Evans wrote:

> There was a change that went into 0.20.205 
> https://issues.apache.org/jira/browse/MAPREDUCE-2187 where after so many 
> inputs to the combiner progress is automatically reported.  I looked through 
> the code for 0.20.205 and from what I can see the CombineOutputCollector 
> should be getting an instance of TaskReporter.  What version of Hadoop are 
> you running?  Are you using the old APIs in the mapred package or the newer 
> APIs in the mapreduce java package?
> 
> --Bobby Evans
> 
> On 11/4/11 1:20 AM, "Christopher Egner"  wrote:
> 
> Hi all,
> 
> Let me preface this with my understanding of how tasks work.
> 
> If a task takes a long time (default 10min) and demonstrates no progress, the 
> task tracker will decide the process is hung, kill it, and start a new 
> attempt.  Normally, one uses a Reporter instance's progress method to provide 
> progress updates and avoid this. For a streaming mapper, the Reporter class 
> is org.apache.hadoop.mapred.Task$TaskReporter and this works well.  Streaming 
> is even set up to take progress, status, and counter updates from stderr, 
> which is really cool.
> 
> However, for combiner tasks, the class is 
> org.apache.hadoop.mapred.Reporter$1.  The first subclass in this particular 
> java file is the Reporter.NULL class, which ignores all updates.  So even if 
> a combiner task is updating its reporter in accordance with docs (see 
> postscript), its updates are ignored and it dies at 10 minutes.  Or one sets 
> mapred.task.timeout very high, allowing truly hung tasks to go unrecognised 
> for much longer.
> 
> At least this is what I've been able to put together from reading code and 
> searching the web for docs (except hadoop jira which has been down for a 
> while - my bad luck).
> 
> So am I understanding this correctly?  Are there plans to change this?  Or 
> reasons that combiners can't have normal reporters associated to them?
> 
> Thanks for any help,
> Chris
> 
> http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Reporter
> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ (cf tip 7)
> http://hadoop.apache.org/common/docs/r0.18.3/streaming.html#How+do+I+update+counters+in+streaming+applications%3F
> http://hadoop.apache.org/common/docs/r0.20.0/mapred-default.html  (cf 
> mapred.task.timeout)
> 



Re: How do I diagnose a really slow copy

2011-11-04 Thread Steve Lewis
The task has been running several hours and the map phase is essentially a
null mapper - rewrite the key and value stored by an earlier reducer. There
is no firewall - the entire job is running on an internal cluster -
admitted launched from my local box on the company network - it is running
WAY slower than jobs previously run on the same hardware and I
suspect something is wring but lack the tools to even start diagnosing the
issue

On Fri, Nov 4, 2011 at 9:07 AM, Harsh J  wrote:

> Steve,
>
> The copy phase may start early, and the slow copy could also just be due
> to unavailability of completed map outputs at this stage. Does your
> question eliminate that case here?
>
> I'd also check the network speeds you get between two slave nodes, and if
> your TaskTracker logs are indicating issues transferring map output
> requests via HTTP.
>
> Also, do you run any form of network filtering stuff, firewalls, etc. that
> may be working at the packet levels? I've seen it cause slowdowns before,
> but am not too sure if that's the case here.
>
> On 04-Nov-2011, at 8:50 PM, Steve Lewis wrote:
>
> I have been finding a that my cluster is running abnormally slowly
> A typical reduce task reports
> reduce > copy (113 of 431 at 0.07 MB/s)
> 70 kb / second is a truely dreadful rate and tasks are running much slower
> under hadoop than the
> same code on a the same operations on a single box -
> Where do I look to find why IO operations might  be so slow??
>
> --
> Steven M. Lewis PhD
>
>
>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Re: Combiner timing out

2011-11-04 Thread Robert Evans
There was a change that went into 0.20.205 
https://issues.apache.org/jira/browse/MAPREDUCE-2187 where after so many inputs 
to the combiner progress is automatically reported.  I looked through the code 
for 0.20.205 and from what I can see the CombineOutputCollector should be 
getting an instance of TaskReporter.  What version of Hadoop are you running?  
Are you using the old APIs in the mapred package or the newer APIs in the 
mapreduce java package?

--Bobby Evans

On 11/4/11 1:20 AM, "Christopher Egner"  wrote:

Hi all,

Let me preface this with my understanding of how tasks work.

If a task takes a long time (default 10min) and demonstrates no progress, the 
task tracker will decide the process is hung, kill it, and start a new attempt. 
 Normally, one uses a Reporter instance's progress method to provide progress 
updates and avoid this. For a streaming mapper, the Reporter class is 
org.apache.hadoop.mapred.Task$TaskReporter and this works well.  Streaming is 
even set up to take progress, status, and counter updates from stderr, which is 
really cool.

However, for combiner tasks, the class is org.apache.hadoop.mapred.Reporter$1.  
The first subclass in this particular java file is the Reporter.NULL class, 
which ignores all updates.  So even if a combiner task is updating its reporter 
in accordance with docs (see postscript), its updates are ignored and it dies 
at 10 minutes.  Or one sets mapred.task.timeout very high, allowing truly hung 
tasks to go unrecognised for much longer.

At least this is what I've been able to put together from reading code and 
searching the web for docs (except hadoop jira which has been down for a while 
- my bad luck).

So am I understanding this correctly?  Are there plans to change this?  Or 
reasons that combiners can't have normal reporters associated to them?

Thanks for any help,
Chris

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Reporter
http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ (cf tip 7)
http://hadoop.apache.org/common/docs/r0.18.3/streaming.html#How+do+I+update+counters+in+streaming+applications%3F
http://hadoop.apache.org/common/docs/r0.20.0/mapred-default.html  (cf 
mapred.task.timeout)



Re: Never ending reduce jobs, error Error reading task outputConnection refused

2011-11-04 Thread Russell Brown
Done so, working, Awesome and many many thanks!

Cheers

Russell
On 4 Nov 2011, at 16:06, Uma Maheswara Rao G 72686 wrote:

> - Original Message -
> From: Russell Brown 
> Date: Friday, November 4, 2011 9:18 pm
> Subject: Re: Never ending reduce jobs, error Error reading task 
> outputConnection refused
> To: mapreduce-user@hadoop.apache.org
> 
>> 
>> On 4 Nov 2011, at 15:44, Uma Maheswara Rao G 72686 wrote:
>> 
>>> - Original Message -
>>> From: Russell Brown 
>>> Date: Friday, November 4, 2011 9:11 pm
>>> Subject: Re: Never ending reduce jobs, error Error reading task 
>> outputConnection refused
>>> To: mapreduce-user@hadoop.apache.org
>>> 
 
 On 4 Nov 2011, at 15:35, Uma Maheswara Rao G 72686 wrote:
 
> This problem may come if you dont configure the hostmappings 
 properly.> Can you check whether your tasktrackers are pingable 
 from each other with the configured hosts names?
 
 
 Hi,
 Thanks for replying so fast!
 
 Hostnames? I use IP addresses in the slaves config file, and 
>> via 
 IP addresses everyone can ping everyone else, do I need to set 
>> up 
 hostnames too?
>>> Yes, can you configure hostname mappings and check..
>> 
>> Like full blown DNS? I mean there is no reference to any machine 
>> by hostname in any of my config anywhere, so I'm not sure where to 
>> start. These machines are just on my local network.
> you need to configure them in /etc/hosts file.
> ex: xx.xx.xx.xx1 TT_HOSTNAME1
>xx.xx.xx.xx2 TT_HOSTNAME2
>xx.xx.xx.xx3 TT_HOSTNAME3
>xx.xx.xx.xx4 TT_HOSTNAME4
> configure them in all the machines and check.
>> 
 
 Cheers
 
 Russell
> 
> Regards,
> Uma
> - Original Message -
> From: Russell Brown 
> Date: Friday, November 4, 2011 9:00 pm
> Subject: Never ending reduce jobs, error Error reading task 
 outputConnection refused
> To: mapreduce-user@hadoop.apache.org
> 
>> Hi,
>> I have a cluster of 4 tasktracker/datanodes and 1 
>> JobTracker/Namenode. I can run small jobs on this cluster 
>> fine 
>> (like up to a few thousand keys) but more than that and I 
>> start 
>> seeing errors like this:
>> 
>> 
>> 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : 
>> attempt_20040342_0006_m_05_0, Status : FAILED
>> Too many fetch-failures
>> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
>> outputConnection refused
>> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
>> outputConnection refused
>> 11/11/04 08:16:13 INFO mapred.JobClient:  map 97% reduce 1%
>> 11/11/04 08:16:25 INFO mapred.JobClient:  map 100% reduce 1%
>> 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : 
>> attempt_20040342_0006_m_10_0, Status : FAILED
>> Too many fetch-failures
>> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
>> outputConnection refused
>> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
>> outputConnection refused
>> 11/11/04 08:17:24 INFO mapred.JobClient:  map 97% reduce 1%
>> 11/11/04 08:17:36 INFO mapred.JobClient:  map 100% reduce 1%
>> 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : 
>> attempt_20040342_0006_m_11_0, Status : FAILED
>> Too many fetch-failures
>> 
>> 
>> 
>> I have no IDEA what this means. All my nodes can ssh to each 
>> other, pass wordlessly, all the time.
>> 
>> On the individual data/task nodes the logs have errors like this:
>> 
>> 2011-11-04 08:24:42,514 WARN 
 org.apache.hadoop.mapred.TaskTracker: 
>> getMapOutput(attempt_20040342_0006_m_15_0,2) failed :
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could 
 not 
>> find 
 
>> taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index
>>  in any of the configured local directories
>>  at 
>> 
 
>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
>> at 
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
>>  at 
>> 
 
>> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543)
>>at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>>  at 
>> 
 
>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)   
>> at 
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>>  at 
>> 
 
>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
>>at 
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>  at 
>> 
 
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.ja

Re: How do I diagnose a really slow copy

2011-11-04 Thread Harsh J
Steve,

The copy phase may start early, and the slow copy could also just be due to 
unavailability of completed map outputs at this stage. Does your question 
eliminate that case here?

I'd also check the network speeds you get between two slave nodes, and if your 
TaskTracker logs are indicating issues transferring map output requests via 
HTTP.

Also, do you run any form of network filtering stuff, firewalls, etc. that may 
be working at the packet levels? I've seen it cause slowdowns before, but am 
not too sure if that's the case here.

On 04-Nov-2011, at 8:50 PM, Steve Lewis wrote:

> I have been finding a that my cluster is running abnormally slowly
> A typical reduce task reports 
> reduce > copy (113 of 431 at 0.07 MB/s) 
> 70 kb / second is a truely dreadful rate and tasks are running much slower 
> under hadoop than the 
> same code on a the same operations on a single box -
> Where do I look to find why IO operations might  be so slow??
> 
> -- 
> Steven M. Lewis PhD
>  
> 



Re: Never ending reduce jobs, error Error reading task outputConnection refused

2011-11-04 Thread Uma Maheswara Rao G 72686
- Original Message -
From: Russell Brown 
Date: Friday, November 4, 2011 9:18 pm
Subject: Re: Never ending reduce jobs, error Error reading task 
outputConnection refused
To: mapreduce-user@hadoop.apache.org

> 
> On 4 Nov 2011, at 15:44, Uma Maheswara Rao G 72686 wrote:
> 
> > - Original Message -
> > From: Russell Brown 
> > Date: Friday, November 4, 2011 9:11 pm
> > Subject: Re: Never ending reduce jobs, error Error reading task 
> outputConnection refused
> > To: mapreduce-user@hadoop.apache.org
> > 
> >> 
> >> On 4 Nov 2011, at 15:35, Uma Maheswara Rao G 72686 wrote:
> >> 
> >>> This problem may come if you dont configure the hostmappings 
> >> properly.> Can you check whether your tasktrackers are pingable 
> >> from each other with the configured hosts names?
> >> 
> >> 
> >> Hi,
> >> Thanks for replying so fast!
> >> 
> >> Hostnames? I use IP addresses in the slaves config file, and 
> via 
> >> IP addresses everyone can ping everyone else, do I need to set 
> up 
> >> hostnames too?
> > Yes, can you configure hostname mappings and check..
> 
> Like full blown DNS? I mean there is no reference to any machine 
> by hostname in any of my config anywhere, so I'm not sure where to 
> start. These machines are just on my local network.
you need to configure them in /etc/hosts file.
ex: xx.xx.xx.xx1 TT_HOSTNAME1
xx.xx.xx.xx2 TT_HOSTNAME2
xx.xx.xx.xx3 TT_HOSTNAME3
xx.xx.xx.xx4 TT_HOSTNAME4
 configure them in all the machines and check.
> 
> >> 
> >> Cheers
> >> 
> >> Russell
> >>> 
> >>> Regards,
> >>> Uma
> >>> - Original Message -
> >>> From: Russell Brown 
> >>> Date: Friday, November 4, 2011 9:00 pm
> >>> Subject: Never ending reduce jobs, error Error reading task 
> >> outputConnection refused
> >>> To: mapreduce-user@hadoop.apache.org
> >>> 
>  Hi,
>  I have a cluster of 4 tasktracker/datanodes and 1 
>  JobTracker/Namenode. I can run small jobs on this cluster 
> fine 
>  (like up to a few thousand keys) but more than that and I 
> start 
>  seeing errors like this:
>  
>  
>  11/11/04 08:16:08 INFO mapred.JobClient: Task Id : 
>  attempt_20040342_0006_m_05_0, Status : FAILED
>  Too many fetch-failures
>  11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
>  outputConnection refused
>  11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
>  outputConnection refused
>  11/11/04 08:16:13 INFO mapred.JobClient:  map 97% reduce 1%
>  11/11/04 08:16:25 INFO mapred.JobClient:  map 100% reduce 1%
>  11/11/04 08:17:20 INFO mapred.JobClient: Task Id : 
>  attempt_20040342_0006_m_10_0, Status : FAILED
>  Too many fetch-failures
>  11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
>  outputConnection refused
>  11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
>  outputConnection refused
>  11/11/04 08:17:24 INFO mapred.JobClient:  map 97% reduce 1%
>  11/11/04 08:17:36 INFO mapred.JobClient:  map 100% reduce 1%
>  11/11/04 08:19:20 INFO mapred.JobClient: Task Id : 
>  attempt_20040342_0006_m_11_0, Status : FAILED
>  Too many fetch-failures
>  
>  
>  
>  I have no IDEA what this means. All my nodes can ssh to each 
>  other, pass wordlessly, all the time.
>  
>  On the individual data/task nodes the logs have errors like this:
>  
>  2011-11-04 08:24:42,514 WARN 
> >> org.apache.hadoop.mapred.TaskTracker: 
>  getMapOutput(attempt_20040342_0006_m_15_0,2) failed :
>  org.apache.hadoop.util.DiskChecker$DiskErrorException: Could 
> >> not 
>  find 
> >> 
> taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index
>  in any of the configured local directories
>   at 
>  
> >> 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
>  at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
>   at 
>  
> >> 
> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
>  
> >> 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
>  
> >> 
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
>  
> >> 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)  
> at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>   at 
>  
> >> 
> org.mortbay.jetty.servlet.SessionHandler.handl

Re: Never ending reduce jobs, error Error reading task outputConnection refused

2011-11-04 Thread Russell Brown
Hi Robert,
Thanks for the reply. Version of hadoop is hadoop-0.20.203.0.

It is weird how this is only a problem when the amount of data goes up.

My setup might be to blame, this is all a learning process for me so I have 5 
VMs running. 1 VM is the JobTracker/Namenode, the other 4 are data/task nodes. 
They can all ping each other and ssh to each other ok.

Cheers

Russell
On 4 Nov 2011, at 15:39, Robert Evans wrote:

> I am not sure what is causing this, but yes they are related.  In hadoop the 
> map output is served to the reducers through jetty, which is an imbedded web 
> server.  If the reducers are not able to fetch the map outputs, then they 
> assume that the mapper is bad and a new mapper is relaunched to compute the 
> map output.  From the errors it looks like the map output is being 
> deleted/not showing up for some of the mappers.  I am not really sure why 
> that would be happening.  What version of hadoop are you using.
> 
> --Bobby Evans
> 
> On 11/4/11 10:28 AM, "Russell Brown"  wrote:
> 
> Hi,
> I have a cluster of 4 tasktracker/datanodes and 1 JobTracker/Namenode. I can 
> run small jobs on this cluster fine (like up to a few thousand keys) but more 
> than that and I start seeing errors like this:
> 
> 
> 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : 
> attempt_20040342_0006_m_05_0, Status : FAILED
> Too many fetch-failures
> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection 
> refused
> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection 
> refused
> 11/11/04 08:16:13 INFO mapred.JobClient:  map 97% reduce 1%
> 11/11/04 08:16:25 INFO mapred.JobClient:  map 100% reduce 1%
> 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : 
> attempt_20040342_0006_m_10_0, Status : FAILED
> Too many fetch-failures
> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection 
> refused
> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection 
> refused
> 11/11/04 08:17:24 INFO mapred.JobClient:  map 97% reduce 1%
> 11/11/04 08:17:36 INFO mapred.JobClient:  map 100% reduce 1%
> 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : 
> attempt_20040342_0006_m_11_0, Status : FAILED
> Too many fetch-failures
> 
> 
> 
> I have no IDEA what this means. All my nodes can ssh to each other, pass 
> wordlessly, all the time.
> 
> On the individual data/task nodes the logs have errors like this:
> 
> 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: 
> getMapOutput(attempt_20040342_0006_m_15_0,2) failed :
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
> taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index
>  in any of the configured local directories
> at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
> at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
> at 
> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
> at 
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326)
> at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
> at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> 
>

Re: Never ending reduce jobs, error Error reading task outputConnection refused

2011-11-04 Thread Russell Brown

On 4 Nov 2011, at 15:44, Uma Maheswara Rao G 72686 wrote:

> - Original Message -
> From: Russell Brown 
> Date: Friday, November 4, 2011 9:11 pm
> Subject: Re: Never ending reduce jobs, error Error reading task 
> outputConnection refused
> To: mapreduce-user@hadoop.apache.org
> 
>> 
>> On 4 Nov 2011, at 15:35, Uma Maheswara Rao G 72686 wrote:
>> 
>>> This problem may come if you dont configure the hostmappings 
>> properly.> Can you check whether your tasktrackers are pingable 
>> from each other with the configured hosts names?
>> 
>> 
>> Hi,
>> Thanks for replying so fast!
>> 
>> Hostnames? I use IP addresses in the slaves config file, and via 
>> IP addresses everyone can ping everyone else, do I need to set up 
>> hostnames too?
> Yes, can you configure hostname mappings and check..

Like full blown DNS? I mean there is no reference to any machine by hostname in 
any of my config anywhere, so I'm not sure where to start. These machines are 
just on my local network.

>> 
>> Cheers
>> 
>> Russell
>>> 
>>> Regards,
>>> Uma
>>> - Original Message -
>>> From: Russell Brown 
>>> Date: Friday, November 4, 2011 9:00 pm
>>> Subject: Never ending reduce jobs, error Error reading task 
>> outputConnection refused
>>> To: mapreduce-user@hadoop.apache.org
>>> 
 Hi,
 I have a cluster of 4 tasktracker/datanodes and 1 
 JobTracker/Namenode. I can run small jobs on this cluster fine 
 (like up to a few thousand keys) but more than that and I start 
 seeing errors like this:
 
 
 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : 
 attempt_20040342_0006_m_05_0, Status : FAILED
 Too many fetch-failures
 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
 outputConnection refused
 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
 outputConnection refused
 11/11/04 08:16:13 INFO mapred.JobClient:  map 97% reduce 1%
 11/11/04 08:16:25 INFO mapred.JobClient:  map 100% reduce 1%
 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : 
 attempt_20040342_0006_m_10_0, Status : FAILED
 Too many fetch-failures
 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
 outputConnection refused
 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
 outputConnection refused
 11/11/04 08:17:24 INFO mapred.JobClient:  map 97% reduce 1%
 11/11/04 08:17:36 INFO mapred.JobClient:  map 100% reduce 1%
 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : 
 attempt_20040342_0006_m_11_0, Status : FAILED
 Too many fetch-failures
 
 
 
 I have no IDEA what this means. All my nodes can ssh to each 
 other, pass wordlessly, all the time.
 
 On the individual data/task nodes the logs have errors like this:
 
 2011-11-04 08:24:42,514 WARN 
>> org.apache.hadoop.mapred.TaskTracker: 
 getMapOutput(attempt_20040342_0006_m_15_0,2) failed :
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could 
>> not 
 find 
>> taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index
>>  in any of the configured local directories
at 
 
>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
>> at 
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
at 
 
>> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543)
>>at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
 
>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)   
>> at 
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at 
 
>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
>>at 
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
 
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) 
>> at 
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
 
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) 
>> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at 
 
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>> at 
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
 
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) 
>> at org.mortbay.jetty.Server.handle(Server.java:326)
at 
 
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)  
>> at 
>> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnectio

Re: Never ending reduce jobs, error Error reading task outputConnection refused

2011-11-04 Thread Uma Maheswara Rao G 72686
- Original Message -
From: Russell Brown 
Date: Friday, November 4, 2011 9:11 pm
Subject: Re: Never ending reduce jobs, error Error reading task 
outputConnection refused
To: mapreduce-user@hadoop.apache.org

> 
> On 4 Nov 2011, at 15:35, Uma Maheswara Rao G 72686 wrote:
> 
> > This problem may come if you dont configure the hostmappings 
> properly.> Can you check whether your tasktrackers are pingable 
> from each other with the configured hosts names?
> 
> 
> Hi,
> Thanks for replying so fast!
> 
> Hostnames? I use IP addresses in the slaves config file, and via 
> IP addresses everyone can ping everyone else, do I need to set up 
> hostnames too?
Yes, can you configure hostname mappings and check..
> 
> Cheers
> 
> Russell
> > 
> > Regards,
> > Uma
> > - Original Message -
> > From: Russell Brown 
> > Date: Friday, November 4, 2011 9:00 pm
> > Subject: Never ending reduce jobs, error Error reading task 
> outputConnection refused
> > To: mapreduce-user@hadoop.apache.org
> > 
> >> Hi,
> >> I have a cluster of 4 tasktracker/datanodes and 1 
> >> JobTracker/Namenode. I can run small jobs on this cluster fine 
> >> (like up to a few thousand keys) but more than that and I start 
> >> seeing errors like this:
> >> 
> >> 
> >> 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : 
> >> attempt_20040342_0006_m_05_0, Status : FAILED
> >> Too many fetch-failures
> >> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
> >> outputConnection refused
> >> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
> >> outputConnection refused
> >> 11/11/04 08:16:13 INFO mapred.JobClient:  map 97% reduce 1%
> >> 11/11/04 08:16:25 INFO mapred.JobClient:  map 100% reduce 1%
> >> 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : 
> >> attempt_20040342_0006_m_10_0, Status : FAILED
> >> Too many fetch-failures
> >> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
> >> outputConnection refused
> >> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
> >> outputConnection refused
> >> 11/11/04 08:17:24 INFO mapred.JobClient:  map 97% reduce 1%
> >> 11/11/04 08:17:36 INFO mapred.JobClient:  map 100% reduce 1%
> >> 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : 
> >> attempt_20040342_0006_m_11_0, Status : FAILED
> >> Too many fetch-failures
> >> 
> >> 
> >> 
> >> I have no IDEA what this means. All my nodes can ssh to each 
> >> other, pass wordlessly, all the time.
> >> 
> >> On the individual data/task nodes the logs have errors like this:
> >> 
> >> 2011-11-04 08:24:42,514 WARN 
> org.apache.hadoop.mapred.TaskTracker: 
> >> getMapOutput(attempt_20040342_0006_m_15_0,2) failed :
> >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could 
> not 
> >> find 
> taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index
>  in any of the configured local directories
> >>at 
> >> 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
>  at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
> >>at 
> >> 
> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
> >>at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> >>at 
> >> 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
> >>at 
> >> 
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >>at 
> >> 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)  
> at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >>at 
> >> 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)  
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> >>at 
> >> 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> >>at 
> >> 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)  
> at org.mortbay.jetty.Server.handle(Server.java:326)
> >>at 
> >> 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)   
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> >>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> >>at 
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)>> 
>   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> >>at 
> >> 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectCha

Re: Never ending reduce jobs, error Error reading task outputConnection refused

2011-11-04 Thread Russell Brown

On 4 Nov 2011, at 15:35, Uma Maheswara Rao G 72686 wrote:

> This problem may come if you dont configure the hostmappings properly.
> Can you check whether your tasktrackers are pingable from each other with the 
> configured hosts names?


Hi,
Thanks for replying so fast!

Hostnames? I use IP addresses in the slaves config file, and via IP addresses 
everyone can ping everyone else, do I need to set up hostnames too?

Cheers

Russell
> 
> Regards,
> Uma
> - Original Message -
> From: Russell Brown 
> Date: Friday, November 4, 2011 9:00 pm
> Subject: Never ending reduce jobs, error Error reading task outputConnection 
> refused
> To: mapreduce-user@hadoop.apache.org
> 
>> Hi,
>> I have a cluster of 4 tasktracker/datanodes and 1 
>> JobTracker/Namenode. I can run small jobs on this cluster fine 
>> (like up to a few thousand keys) but more than that and I start 
>> seeing errors like this:
>> 
>> 
>> 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : 
>> attempt_20040342_0006_m_05_0, Status : FAILED
>> Too many fetch-failures
>> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
>> outputConnection refused
>> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
>> outputConnection refused
>> 11/11/04 08:16:13 INFO mapred.JobClient:  map 97% reduce 1%
>> 11/11/04 08:16:25 INFO mapred.JobClient:  map 100% reduce 1%
>> 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : 
>> attempt_20040342_0006_m_10_0, Status : FAILED
>> Too many fetch-failures
>> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
>> outputConnection refused
>> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
>> outputConnection refused
>> 11/11/04 08:17:24 INFO mapred.JobClient:  map 97% reduce 1%
>> 11/11/04 08:17:36 INFO mapred.JobClient:  map 100% reduce 1%
>> 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : 
>> attempt_20040342_0006_m_11_0, Status : FAILED
>> Too many fetch-failures
>> 
>> 
>> 
>> I have no IDEA what this means. All my nodes can ssh to each 
>> other, pass wordlessly, all the time.
>> 
>> On the individual data/task nodes the logs have errors like this:
>> 
>> 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: 
>> getMapOutput(attempt_20040342_0006_m_15_0,2) failed :
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not 
>> find 
>> taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index
>>  in any of the configured local directories
>>  at 
>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
>> at 
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
>>  at 
>> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543)
>>at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>>  at 
>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)   
>> at 
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>>  at 
>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
>>at 
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>  at 
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) 
>> at 
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>  at 
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) 
>> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>>  at 
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>> at 
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>  at 
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) 
>> at org.mortbay.jetty.Server.handle(Server.java:326)
>>  at 
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)  
>> at 
>> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>>  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>>  at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>>  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>  at 
>> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) 
>> at 
>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>> 
>> 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: 
>> Unknown child with bad map output: 
>> attempt_20040342_0006_m_15_0. Ignored.
>> 
>> 
>> Are they related? What d any of the mean?
>> 
>> If I use a much smaller amount of data I don't see any of these 
>> errors and everything works fine, so I 

Re: Never ending reduce jobs, error Error reading task outputConnection refused

2011-11-04 Thread Robert Evans
I am not sure what is causing this, but yes they are related.  In hadoop the 
map output is served to the reducers through jetty, which is an imbedded web 
server.  If the reducers are not able to fetch the map outputs, then they 
assume that the mapper is bad and a new mapper is relaunched to compute the map 
output.  From the errors it looks like the map output is being deleted/not 
showing up for some of the mappers.  I am not really sure why that would be 
happening.  What version of hadoop are you using.

--Bobby Evans

On 11/4/11 10:28 AM, "Russell Brown"  wrote:

Hi,
I have a cluster of 4 tasktracker/datanodes and 1 JobTracker/Namenode. I can 
run small jobs on this cluster fine (like up to a few thousand keys) but more 
than that and I start seeing errors like this:


11/11/04 08:16:08 INFO mapred.JobClient: Task Id : 
attempt_20040342_0006_m_05_0, Status : FAILED
Too many fetch-failures
11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection 
refused
11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection 
refused
11/11/04 08:16:13 INFO mapred.JobClient:  map 97% reduce 1%
11/11/04 08:16:25 INFO mapred.JobClient:  map 100% reduce 1%
11/11/04 08:17:20 INFO mapred.JobClient: Task Id : 
attempt_20040342_0006_m_10_0, Status : FAILED
Too many fetch-failures
11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection 
refused
11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection 
refused
11/11/04 08:17:24 INFO mapred.JobClient:  map 97% reduce 1%
11/11/04 08:17:36 INFO mapred.JobClient:  map 100% reduce 1%
11/11/04 08:19:20 INFO mapred.JobClient: Task Id : 
attempt_20040342_0006_m_11_0, Status : FAILED
Too many fetch-failures



I have no IDEA what this means. All my nodes can ssh to each other, pass 
wordlessly, all the time.

On the individual data/task nodes the logs have errors like this:

2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: 
getMapOutput(attempt_20040342_0006_m_15_0,2) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index
 in any of the configured local directories
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
at 
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: Unknown 
child with bad map output: attempt_20040342_0006_m_15_0. Ignored.


Are they related? What d any of the mean?

If I use a much smaller amount of data I don't see any of these errors and 
everything works fine, so I guess they are to do with some resource (though 
what I don't know?) Looking at MASTERNODE:50070/dfsnodelist.jsp?whatNodes=LIVE

I see that datanodes have ample disk space, that isn't it...

Any help at all really appreciated. Searching for the errors on Google has me 
nothing, reading the Hadoop definitive guide as

Re: Never ending reduce jobs, error Error reading task outputConnection refused

2011-11-04 Thread Uma Maheswara Rao G 72686
This problem may come if you dont configure the hostmappings properly.
Can you check whether your tasktrackers are pingable from each other with the 
configured hostsnames?

Regards,
Uma
- Original Message -
From: Russell Brown 
Date: Friday, November 4, 2011 9:00 pm
Subject: Never ending reduce jobs, error Error reading task outputConnection 
refused
To: mapreduce-user@hadoop.apache.org

> Hi,
> I have a cluster of 4 tasktracker/datanodes and 1 
> JobTracker/Namenode. I can run small jobs on this cluster fine 
> (like up to a few thousand keys) but more than that and I start 
> seeing errors like this:
> 
> 
> 11/11/04 08:16:08 INFO mapred.JobClient: Task Id : 
> attempt_20040342_0006_m_05_0, Status : FAILED
> Too many fetch-failures
> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
> outputConnection refused
> 11/11/04 08:16:08 WARN mapred.JobClient: Error reading task 
> outputConnection refused
> 11/11/04 08:16:13 INFO mapred.JobClient:  map 97% reduce 1%
> 11/11/04 08:16:25 INFO mapred.JobClient:  map 100% reduce 1%
> 11/11/04 08:17:20 INFO mapred.JobClient: Task Id : 
> attempt_20040342_0006_m_10_0, Status : FAILED
> Too many fetch-failures
> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
> outputConnection refused
> 11/11/04 08:17:20 WARN mapred.JobClient: Error reading task 
> outputConnection refused
> 11/11/04 08:17:24 INFO mapred.JobClient:  map 97% reduce 1%
> 11/11/04 08:17:36 INFO mapred.JobClient:  map 100% reduce 1%
> 11/11/04 08:19:20 INFO mapred.JobClient: Task Id : 
> attempt_20040342_0006_m_11_0, Status : FAILED
> Too many fetch-failures
> 
> 
> 
> I have no IDEA what this means. All my nodes can ssh to each 
> other, pass wordlessly, all the time.
> 
> On the individual data/task nodes the logs have errors like this:
> 
> 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: 
> getMapOutput(attempt_20040342_0006_m_15_0,2) failed :
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not 
> find 
> taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index
>  in any of the configured local directories
>   at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
>  at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
>   at 
> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)  
> at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>   at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)  
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>   at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>   at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)  
> at org.mortbay.jetty.Server.handle(Server.java:326)
>   at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)   
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>   at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)  
> at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> 
> 2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: 
> Unknown child with bad map output: 
> attempt_20040342_0006_m_15_0. Ignored.
> 
> 
> Are they related? What d any of the mean?
> 
> If I use a much smaller amount of data I don't see any of these 
> errors and everything works fine, so I guess they are to do with 
> some resource (though what I don't know?) Looking at 
> MASTERNODE:50070/dfsnodelist.jsp?whatNodes=LIVE
> I see that datanodes have ample disk space, that isn't it…
> 
> Any help at all really appreciated. Searching for the errors on 
> Google has me nothing, reading the Hadoop definitive guide as me 
> nothing.
> Many thanks in advance
> 

Never ending reduce jobs, error Error reading task outputConnection refused

2011-11-04 Thread Russell Brown
Hi,
I have a cluster of 4 tasktracker/datanodes and 1 JobTracker/Namenode. I can 
run small jobs on this cluster fine (like up to a few thousand keys) but more 
than that and I start seeing errors like this:


11/11/04 08:16:08 INFO mapred.JobClient: Task Id : 
attempt_20040342_0006_m_05_0, Status : FAILED
Too many fetch-failures
11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection 
refused
11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection 
refused
11/11/04 08:16:13 INFO mapred.JobClient:  map 97% reduce 1%
11/11/04 08:16:25 INFO mapred.JobClient:  map 100% reduce 1%
11/11/04 08:17:20 INFO mapred.JobClient: Task Id : 
attempt_20040342_0006_m_10_0, Status : FAILED
Too many fetch-failures
11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection 
refused
11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection 
refused
11/11/04 08:17:24 INFO mapred.JobClient:  map 97% reduce 1%
11/11/04 08:17:36 INFO mapred.JobClient:  map 100% reduce 1%
11/11/04 08:19:20 INFO mapred.JobClient: Task Id : 
attempt_20040342_0006_m_11_0, Status : FAILED
Too many fetch-failures



I have no IDEA what this means. All my nodes can ssh to each other, pass 
wordlessly, all the time.

On the individual data/task nodes the logs have errors like this:

2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: 
getMapOutput(attempt_20040342_0006_m_15_0,2) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/vagrant/jobcache/job_20040342_0006/attempt_20040342_0006_m_15_0/output/file.out.index
 in any of the configured local directories
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
at 
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: Unknown 
child with bad map output: attempt_20040342_0006_m_15_0. Ignored.


Are they related? What d any of the mean?

If I use a much smaller amount of data I don't see any of these errors and 
everything works fine, so I guess they are to do with some resource (though 
what I don't know?) Looking at MASTERNODE:50070/dfsnodelist.jsp?whatNodes=LIVE

I see that datanodes have ample disk space, that isn't it…

Any help at all really appreciated. Searching for the errors on Google has me 
nothing, reading the Hadoop definitive guide as me nothing.

Many thanks in advance

Russell

How do I diagnose a really slow copy

2011-11-04 Thread Steve Lewis
I have been finding a that my cluster is running abnormally slowly
A typical reduce task reports
reduce > copy (113 of 431 at 0.07 MB/s)
70 kb / second is a truely dreadful rate and tasks are running much slower
under hadoop than the
same code on a the same operations on a single box -
Where do I look to find why IO operations might  be so slow??

-- 
Steven M. Lewis PhD


Re: HDFS error : Could not Complete file

2011-11-04 Thread Sudharsan Sampath
Hi,

I think the below issue was only due to HDFS architecture and not
map-reduce. BUt just to make sure that's the case, I am cross-posting to
this group as well.
I have also attached the program used to get this error. My input file
contains abt 2 million+ records.

Any help is greatly appreciated.

Thanks
Sudhan S

On Fri, Nov 4, 2011 at 2:47 PM, Sudharsan Sampath wrote:

> Hi,
>
> I have a simple map-reduce program [map only :) ]that reads the input and
> emits the same to n outputs on a single node cluster with max map tasks set
> to 10 on a 16 core processor machine.
>
> After a while the tasks begin to fail with the following exception log.
>
> 2011-01-01 03:17:52,149 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=temp,temp
> ip=/x.x.x.x cmd=delete
>  
> src=/TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2
>   dst=nullperm=null
> 2011-01-01 03:17:52,156 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_7046642930904717718_23143 on x.x.x.x: size 66148 But it does not
> belong to any file.
> 2011-01-01 03:17:52,156 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: failed to complete
> /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0
> because dir.getFileBlocks() is null  and pendingFile is null
> 2011-01-01 03:17:52,156 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 12 on 9000, call
> complete(/TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0,
> DFSClient_attempt_201101010256_0006_m_00_2) from x.x.x.x: error:
> java.io.IOException: Could not complete write to file
> /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0
> by DFSClient_attempt_201101010256_0006_m_00_2
> java.io.IOException: Could not complete write to file
> /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0
> by DFSClient_attempt_201101010256_0006_m_00_2
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497)
> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>
>
> Looks like there's a delete command issued by FsNameSystem.audit before
> the it errors out stating it could not complete write to the file inside
> that..
>
> Any clue on what could have gone wrong?
>
> Thanks
> Sudharsan S
>


TestMultipleOutputs.java
Description: Binary data


Re: under cygwin JUST tasktracker run by cyg_server user, Permission denied .....

2011-11-04 Thread Uma Maheswara Rao G 72686
in 205, code is different than trace Which version are you using?

I just verified the code in older versions, 
http://mail-archives.apache.org/mod_mbox/hadoop-common-commits/201109.mbox/%3c20110902221116.d0b192388...@eris.apache.org%3E
below is the code snippet.
+boolean rv = true;
+
+// read perms
+rv = f.setReadable(group.implies(FsAction.READ), false);
+checkReturnValue(rv, f, permission);

if rv is false then it throws the below error.
Can you please create a simple program with the below path and try call 
setReadable with the user where task tracker starts. Then we can get to know 
what error it is giving.
 
 look at the javadoc 
http://download.oracle.com/javase/6/docs/api/java/io/File.html#setReadable(boolean,%20boolean)

setReadable
public boolean setReadable(boolean readable,
   boolean ownerOnly)Sets the owner's or everybody's 
read permission for this abstract pathname. 

Parameters:
readable - If true, sets the access permission to allow read operations; if 
false to disallow read operations
ownerOnly - If true, the read permission applies only to the owner's read 
permission; otherwise, it applies to everybody. If the underlying file system 
can not distinguish the owner's read permission from that of others, then the 
permission will apply to everybody, regardless of this value. 
Returns:
true if and only if the operation succeeded. The operation will fail if the 
user does not have permission to change the access permissions of this abstract 
pathname. If readable is false and the underlying file system does not 
implement a read permission, then the operation will fail. 

I am not sure how to provide the athentications in Cygwin. Please make sure you 
should have rights to change the permissions with the user. If i get some more 
info, i will update you.

i sent it to mapreduce user and cced to common

Regards,
Uma


*- Original Message -
From: Masoud 
Date: Friday, November 4, 2011 7:01 am
Subject: Re: under cygwin JUST tasktracker run by cyg_server user, Permission 
denied .
To: common-u...@hadoop.apache.org

> Dear Uma,
> as you know when we use start-all.sh command, all the outputs 
> saved in 
> log files,
> when i check the tasktracker log file, i see the below error 
> message and 
> its shutdown.
> im really confused, its more than 4 days im working in this issue 
> and 
> tried different ways but no result.^^
> 
> BS.
> Masoud
> 
> On 11/03/2011 08:34 PM, Uma Maheswara Rao G 72686 wrote:
> > it wont disply any thing on console.
> > If you get any error while exceuting the command, then only it 
> will disply on console. In your case it might executed successfully.
> > Still you are facing same problem with TT startup?
> >
> > Regards,
> > Uma
> > - Original Message -
> > From: Masoud
> > Date: Thursday, November 3, 2011 7:02 am
> > Subject: Re: under cygwin JUST tasktracker run by cyg_server 
> user, Permission denied .
> > To: common-u...@hadoop.apache.org
> >
> >> Hi,
> >> thanks for info, i checked that report, seems same with mine but
> >> no
> >> specific solution mentioned.
> >> Yes, i changed this folder permission via cygwin,NO RESULT.
> >> Im really confused. ...
> >>
> >> any idea please ...?
> >>
> >> Thanks,
> >> B.S
> >>
> >>
> >> On 11/01/2011 05:38 PM, Uma Maheswara Rao G 72686 wrote:
> >>> Looks, that is permissions related issue on local dirs
> >>> There is an issue filed in mapred, related to this problem
> >> https://issues.apache.org/jira/browse/MAPREDUCE-2921
> >>> Can you please provide permissions explicitely and try?
> >>>
> >>> Regards,
> >>> Uma
> >>> - Original Message -
> >>> From: Masoud
> >>> Date: Tuesday, November 1, 2011 1:19 pm
> >>> Subject: Re: under cygwin JUST tasktracker run by cyg_server
> >> user, Permission denied .
> >>> To: common-u...@hadoop.apache.org
> >>>
>  Sure, ^^
> 
>  when I run {namenode -fromat} it makes dfs in c:/tmp/
>  administrator_hadoop/
>  after that by running "start-all.sh" every thing is OK, all 
> daemons run
>  except tasktracker.
>  My current user in administrator, but tacktracker runs by
>  cyg_server
>  user that made by cygwin in installation time;This is a part 
> of log
>  file:
>  2011-11-01 14:26:54,463 INFO 
> org.apache.hadoop.mapred.TaskTracker: Starting tasktracker 
> with owner as cyg_server
>  2011-11-01 14:26:54,463 INFO 
> org.apache.hadoop.mapred.TaskTracker: Good
>  mapred local directories are: /tmp/hadoop-cyg_server/mapred/local
>  2011-11-01 14:26:54,479 ERROR 
> org.apache.hadoop.mapred.TaskTracker: Can
>  not start task tracker because java.io.IOException: Failed to set
>  permissions of path: \tmp\hadoop-
> cyg_server\mapred\local\ttprivate to 0700
>    at
>  org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:680)
> at 
> org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:653)