Re: TaskTrackers disengaging from JobTracker

2008-10-31 Thread Aaron Kimball
To complete the picture: not only was our network swamped, I realized
tonight that the NameNode/JobTracker was running on a 99% full disk (it hit
100% full about thirty minutes ago). That poor JobTracker was fighting
against a lot of odds. As soon as we upgrade to a bigger disk and switch it
back on, I'll apply the supplied patch to the cluster.

Thank you for looking into this!
- Aaron

On Thu, Oct 30, 2008 at 3:42 PM, Raghu Angadi [EMAIL PROTECTED] wrote:

 Raghu Angadi wrote:

 Devaraj fwded the stacks that Aaron sent. As he suspected there is a
 deadlock in RPC server. I will file a blocker for 0.18 and above. This
 deadlock is more likely on a busy network.


 Aaron,

 Could you try the patch attached to
 https://issues.apache.org/jira/browse/HADOOP-4552 ?

 Thanks,
 Raghu.



Re: TaskTrackers disengaging from JobTracker

2008-10-30 Thread Billy Sehmel
Aaron,

Does this have to do with the cloud we got at SoftLayer or is it something
at the university like you said?

I think I provisioned 6 - 10 with 1G connections let me log in and check
really quick.. no they're all correct..

Speed: 1000Mb/s




On Wed, Oct 29, 2008 at 9:49 PM, Aaron Kimball [EMAIL PROTECTED] wrote:

 Just as I wrote that, Murphy's law struck :) This did not fix the issue
 after all.

 I think the problem is occurring because a huge amount of network bandwidth
 is being consumed by the jobs. What settings (timeouts, thread counts, etc),
 if any, ought I dial up to correct for this?

 Thanks,
 - Aaron


 Aaron Kimball wrote:

 It's a cluster being used for a university course; there are 30 students
 all running code which (to be polite) probably tests the limits of Hadoop's
 failure recovery logic. :)

 The current assignment is PageRank over Wikipedia; a 20 GB input corpus.
 Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50
 reduce tasks.

 I wrote a patch to address the NPE in JobTracker.killJob() and compiled it
 against TRUNK. I've put this on the cluster and it's now been holding steady
 for the last hour or so.. so that plus whatever other differences there are
 between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the
 JIRA as soon as it finishes cranking against the JUnit tests)

 - Aaron


 Devaraj Das wrote:


 On 10/30/08 3:13 AM, Aaron Kimball [EMAIL PROTECTED] wrote:

  The system load and memory consumption on the JT are both very close to
 idle states -- it's not overworked, I don't think

 I may have an idea of the problem, though. Digging back up a ways into
 the
 JT logs, I see this:

 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 4 on 9001, call killJob(job_200810290855_0025) from
 10.1.143.245:48253: error: java.io.IOException:
 java.lang.NullPointerException
 java.io.IOException: java.lang.NullPointerException
 at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j

 ava:37)
 at java.lang.reflect.Method.invoke(Method.java:599)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



 This exception is then repeated for all the IPC server handlers. So I
 think
 the problem is that all the handler threads are dying one by one due to
 this
 NPE.


 This should not happen. IPC handler catches Throwable and handles that.
 Could you give more details like the kind of jobs (long/short) you are
 running, how many tasks they have, etc.

  This something I can fix myself, or is a patch available?

 - Aaron

 On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED]
 wrote:

  It's possible that the JobTracker is under duress and unable to respond
 to
 the TaskTrackers... what do the JobTracker logs say?

 Arun


 On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:

  Hi all,

 I'm working with a 30 node Hadoop cluster that has just started
 demonstrating some weird behavior. It's run without incident for a few
 weeks.. and now:

 The cluster will run smoothly for 90--120 minutes or so, handling jobs
 continually during this time. Then suddenly it will be the case that
 all
 29
 TaskTrackers will get disconnected from the JobTracker. All the
 tracker
 daemon processes are still running on each machine; but the JobTracker
 will
 say 0 nodes available on the web status screen. Restarting MapReduce
 fixes
 this for another 90--120 minutes.

 This looks similar to
 https://issues.apache.org/jira/browse/HADOOP-1763,
 but
 we're running on 0.18.1.

 I found this in a TaskTracker log:

 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker:
 Caught
 exception: java.io.IOException: Call failed on local exception
  at java.lang.Throwable.init(Throwable.java:67)
  at org.apache.hadoop.ipc.Client.call(Client.java:718)
  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
  at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
  at


 org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045

 )

  at
 org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)

  at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
  at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
 Caused by: java.io.IOException: Connection reset by peer
  at sun.nio.ch.FileDispatcher.read0(Native Method)
  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
  at sun.nio.ch.IOUtil.read(IOUtil.java:207)
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
  at

 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j

 ava:55)
  at

 

Re: TaskTrackers disengaging from JobTracker

2008-10-30 Thread Raghu Angadi

Devaraj Das wrote:

I wrote a patch to address the NPE in JobTracker.killJob() and compiled
it against TRUNK. I've put this on the cluster and it's now been holding
steady for the last hour or so.. so that plus whatever other differences
there are between 18.1 and TRUNK may have fixed things. (I'll submit the
patch to the JIRA as soon as it finishes cranking against the JUnit tests)



Aaron, I don't think this is a solution to the problem you are seeing. The
IPC handlers are tolerant to exceptions. In particular, they must not die in
the event of an exception during RPC processing. Could you please get a
stack trace of the JobTracker threads (without your patch) when the TTs are
unable to talk to it. Access the url http://jt-host:jt-info-port/stacks
That will tell us what the handlers are up to.


Devaraj fwded the stacks that Aaron sent. As he suspected there is a 
deadlock in RPC server. I will file a blocker for 0.18 and above. This 
deadlock is more likely on a busy network.


Raghu.



Re: TaskTrackers disengaging from JobTracker

2008-10-30 Thread Raghu Angadi

Raghu Angadi wrote:
Devaraj fwded the stacks that Aaron sent. As he suspected there is a 
deadlock in RPC server. I will file a blocker for 0.18 and above. This 
deadlock is more likely on a busy network.




Aaron,

Could you try the patch attached to 
https://issues.apache.org/jira/browse/HADOOP-4552 ?


Thanks,
Raghu.


TaskTrackers disengaging from JobTracker

2008-10-29 Thread Aaron Kimball
Hi all,

I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case that all 29
TaskTrackers will get disconnected from the JobTracker. All the tracker
daemon processes are still running on each machine; but the JobTracker will
say 0 nodes available on the web status screen. Restarting MapReduce fixes
this for another 90--120 minutes.

This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but
we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call failed on local exception
at java.lang.Throwable.init(Throwable.java:67)
at org.apache.hadoop.ipc.Client.call(Client.java:718)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
at sun.nio.ch.IOUtil.read(IOUtil.java:207)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
at java.io.FilterInputStream.read(FilterInputStream.java:127)
at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
at java.io.DataInputStream.readInt(DataInputStream.java:381)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


As well as a few of these warnings:
2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS
((40-40+0)1) on [EMAIL PROTECTED]:50060
2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
THREADS: [EMAIL PROTECTED]:50060



The NameNode and DataNodes are completely fine. Can't be a DNS issue,
because all DNS is served through /etc/hosts files. NameNode and JobTracker
are on the same machine.

Any help is appreciated
Thanks
- Aaron Kimball


Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Arun C Murthy
It's possible that the JobTracker is under duress and unable to  
respond to the TaskTrackers... what do the JobTracker logs say?


Arun

On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:


Hi all,

I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case that  
all 29
TaskTrackers will get disconnected from the JobTracker. All the  
tracker
daemon processes are still running on each machine; but the  
JobTracker will
say 0 nodes available on the web status screen. Restarting  
MapReduce fixes

this for another 90--120 minutes.

This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763 
, but

we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker:  
Caught

exception: java.io.IOException: Call failed on local exception
   at java.lang.Throwable.init(Throwable.java:67)
   at org.apache.hadoop.ipc.Client.call(Client.java:718)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
   at
org 
.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java: 
1045)

   at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java: 
928)

   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
   at
org.apache.hadoop.net.SocketInputStream 
$Reader.performIO(SocketInputStream.java:55)

   at
org 
.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java: 
140)

   at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 
150)

   at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 
123)

   at java.io.FilterInputStream.read(FilterInputStream.java:127)
   at
org.apache.hadoop.ipc.Client$Connection 
$PingInputStream.read(Client.java:272)

   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
   at java.io.DataInputStream.readInt(DataInputStream.java:381)
   at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java: 
499)

   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


As well as a few of these warnings:
2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON  
THREADS

((40-40+0)1) on [EMAIL PROTECTED]:50060
2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
THREADS: [EMAIL PROTECTED]:50060



The NameNode and DataNodes are completely fine. Can't be a DNS issue,
because all DNS is served through /etc/hosts files. NameNode and  
JobTracker

are on the same machine.

Any help is appreciated
Thanks
- Aaron Kimball




Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Aaron Kimball
The system load and memory consumption on the JT are both very close to
idle states -- it's not overworked, I don't think

I may have an idea of the problem, though. Digging back up a ways into the
JT logs, I see this:

2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 9001, call killJob(job_200810290855_0025) from
10.1.143.245:48253: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



This exception is then repeated for all the IPC server handlers. So I think
the problem is that all the handler threads are dying one by one due to this
NPE.

This something I can fix myself, or is a patch available?

- Aaron

On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote:

 It's possible that the JobTracker is under duress and unable to respond to
 the TaskTrackers... what do the JobTracker logs say?

 Arun


 On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:

  Hi all,

 I'm working with a 30 node Hadoop cluster that has just started
 demonstrating some weird behavior. It's run without incident for a few
 weeks.. and now:

 The cluster will run smoothly for 90--120 minutes or so, handling jobs
 continually during this time. Then suddenly it will be the case that all
 29
 TaskTrackers will get disconnected from the JobTracker. All the tracker
 daemon processes are still running on each machine; but the JobTracker
 will
 say 0 nodes available on the web status screen. Restarting MapReduce
 fixes
 this for another 90--120 minutes.

 This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
 but
 we're running on 0.18.1.

 I found this in a TaskTracker log:

 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
 exception: java.io.IOException: Call failed on local exception
   at java.lang.Throwable.init(Throwable.java:67)
   at org.apache.hadoop.ipc.Client.call(Client.java:718)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
   at

 org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045)
   at
 org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
 Caused by: java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
   at

 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
   at

 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
   at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
   at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
   at java.io.FilterInputStream.read(FilterInputStream.java:127)
   at

 org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
   at java.io.DataInputStream.readInt(DataInputStream.java:381)
   at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


 As well as a few of these warnings:
 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
 THREADS
 ((40-40+0)1) on [EMAIL PROTECTED]:50060
 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
 THREADS: [EMAIL PROTECTED]:50060



 The NameNode and DataNodes are completely fine. Can't be a DNS issue,
 because all DNS is served through /etc/hosts files. NameNode and
 JobTracker
 are on the same machine.

 Any help is appreciated
 Thanks
 - Aaron Kimball





Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Aaron Kimball
Could the version of Java being used matter? I just realized this cluster
runs IBM Java, not Sun:

java version 1.6.0
Java(TM) SE Runtime Environment (build pxi3260sr2-20080818_01(SR2))
IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 Linux x86-32
jvmxi3260-20080816_2209
3 (JIT enabled, AOT enabled)
J9VM - 20080816_022093_lHdSMr
JIT  - r9_20080721_1330ifx2
GC   - 20080724_AA)
JCL  - 20080808_02

- Aaron

On Wed, Oct 29, 2008 at 2:43 PM, Aaron Kimball [EMAIL PROTECTED] wrote:

 The system load and memory consumption on the JT are both very close to
 idle states -- it's not overworked, I don't think

 I may have an idea of the problem, though. Digging back up a ways into the
 JT logs, I see this:

 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: 
 error: java.io.IOException: java.lang.NullPointerException

 java.io.IOException: java.lang.NullPointerException
   at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)

   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
   at java.lang.reflect.Method.invoke(Method.java:599)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



 This exception is then repeated for all the IPC server handlers. So I think
 the problem is that all the handler threads are dying one by one due to this
 NPE.

 This something I can fix myself, or is a patch available?

 - Aaron


 On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote:

 It's possible that the JobTracker is under duress and unable to respond to
 the TaskTrackers... what do the JobTracker logs say?

 Arun


 On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:

  Hi all,

 I'm working with a 30 node Hadoop cluster that has just started
 demonstrating some weird behavior. It's run without incident for a few
 weeks.. and now:

 The cluster will run smoothly for 90--120 minutes or so, handling jobs
 continually during this time. Then suddenly it will be the case that all
 29
 TaskTrackers will get disconnected from the JobTracker. All the tracker
 daemon processes are still running on each machine; but the JobTracker
 will
 say 0 nodes available on the web status screen. Restarting MapReduce
 fixes
 this for another 90--120 minutes.

 This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
 but
 we're running on 0.18.1.

 I found this in a TaskTracker log:

 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker:
 Caught
 exception: java.io.IOException: Call failed on local exception
   at java.lang.Throwable.init(Throwable.java:67)
   at org.apache.hadoop.ipc.Client.call(Client.java:718)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
   at

 org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045)
   at
 org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
 Caused by: java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
   at

 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
   at

 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
   at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
   at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
   at java.io.FilterInputStream.read(FilterInputStream.java:127)
   at

 org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
   at java.io.DataInputStream.readInt(DataInputStream.java:381)
   at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


 As well as a few of these warnings:
 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
 THREADS
 ((40-40+0)1) on [EMAIL PROTECTED]:50060
 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
 THREADS: [EMAIL PROTECTED]:50060



 The NameNode and DataNodes are completely fine. Can't be a DNS issue,
 because all DNS is served through /etc/hosts files. NameNode and
 

Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Devaraj Das


On 10/30/08 3:13 AM, Aaron Kimball [EMAIL PROTECTED] wrote:

 The system load and memory consumption on the JT are both very close to
 idle states -- it's not overworked, I don't think
 
 I may have an idea of the problem, though. Digging back up a ways into the
 JT logs, I see this:
 
 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 4 on 9001, call killJob(job_200810290855_0025) from
 10.1.143.245:48253: error: java.io.IOException:
 java.lang.NullPointerException
 java.io.IOException: java.lang.NullPointerException
 at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
 ava:37)
 at java.lang.reflect.Method.invoke(Method.java:599)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
 
 
 
 This exception is then repeated for all the IPC server handlers. So I think
 the problem is that all the handler threads are dying one by one due to this
 NPE.
 

This should not happen. IPC handler catches Throwable and handles that.
Could you give more details like the kind of jobs (long/short) you are
running, how many tasks they have, etc.

 This something I can fix myself, or is a patch available?
 
 - Aaron
 
 On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote:
 
 It's possible that the JobTracker is under duress and unable to respond to
 the TaskTrackers... what do the JobTracker logs say?
 
 Arun
 
 
 On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:
 
  Hi all,
 
 I'm working with a 30 node Hadoop cluster that has just started
 demonstrating some weird behavior. It's run without incident for a few
 weeks.. and now:
 
 The cluster will run smoothly for 90--120 minutes or so, handling jobs
 continually during this time. Then suddenly it will be the case that all
 29
 TaskTrackers will get disconnected from the JobTracker. All the tracker
 daemon processes are still running on each machine; but the JobTracker
 will
 say 0 nodes available on the web status screen. Restarting MapReduce
 fixes
 this for another 90--120 minutes.
 
 This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
 but
 we're running on 0.18.1.
 
 I found this in a TaskTracker log:
 
 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
 exception: java.io.IOException: Call failed on local exception
   at java.lang.Throwable.init(Throwable.java:67)
   at org.apache.hadoop.ipc.Client.call(Client.java:718)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
   at
 
 
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045
)
   at
 org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
 Caused by: java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
   at
 
 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j
 ava:55)
   at
 
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
   at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
   at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
   at java.io.FilterInputStream.read(FilterInputStream.java:127)
   at
 
 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272
)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
   at java.io.DataInputStream.readInt(DataInputStream.java:381)
   at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
 
 
 As well as a few of these warnings:
 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
 THREADS
 ((40-40+0)1) on [EMAIL PROTECTED]:50060
 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
 THREADS: [EMAIL PROTECTED]:50060
 
 
 
 The NameNode and DataNodes are completely fine. Can't be a DNS issue,
 because all DNS is served through /etc/hosts files. NameNode and
 JobTracker
 are on the same machine.
 
 Any help is appreciated
 Thanks
 - Aaron Kimball
 
 
 




Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Aaron Kimball
It's a cluster being used for a university course; there are 30 students 
all running code which (to be polite) probably tests the limits of 
Hadoop's failure recovery logic. :)


The current assignment is PageRank over Wikipedia; a 20 GB input corpus. 
Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 
reduce tasks.


I wrote a patch to address the NPE in JobTracker.killJob() and compiled 
it against TRUNK. I've put this on the cluster and it's now been holding 
steady for the last hour or so.. so that plus whatever other differences 
there are between 18.1 and TRUNK may have fixed things. (I'll submit the 
patch to the JIRA as soon as it finishes cranking against the JUnit tests)


- Aaron


Devaraj Das wrote:


On 10/30/08 3:13 AM, Aaron Kimball [EMAIL PROTECTED] wrote:


The system load and memory consumption on the JT are both very close to
idle states -- it's not overworked, I don't think

I may have an idea of the problem, though. Digging back up a ways into the
JT logs, I see this:

2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 9001, call killJob(job_200810290855_0025) from
10.1.143.245:48253: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j

ava:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



This exception is then repeated for all the IPC server handlers. So I think
the problem is that all the handler threads are dying one by one due to this
NPE.



This should not happen. IPC handler catches Throwable and handles that.
Could you give more details like the kind of jobs (long/short) you are
running, how many tasks they have, etc.


This something I can fix myself, or is a patch available?

- Aaron

On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote:


It's possible that the JobTracker is under duress and unable to respond to
the TaskTrackers... what do the JobTracker logs say?

Arun


On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:

 Hi all,

I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case that all
29
TaskTrackers will get disconnected from the JobTracker. All the tracker
daemon processes are still running on each machine; but the JobTracker
will
say 0 nodes available on the web status screen. Restarting MapReduce
fixes
this for another 90--120 minutes.

This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
but
we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call failed on local exception
  at java.lang.Throwable.init(Throwable.java:67)
  at org.apache.hadoop.ipc.Client.call(Client.java:718)
  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
  at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
  at



org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045
)

  at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
  at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
  at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
  at sun.nio.ch.FileDispatcher.read0(Native Method)
  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
  at sun.nio.ch.IOUtil.read(IOUtil.java:207)
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
  at

org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j
ava:55)
  at

org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
  at java.io.FilterInputStream.read(FilterInputStream.java:127)
  at



org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272
)

  at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
  at java.io.DataInputStream.readInt(DataInputStream.java:381)
  at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
  at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


As 

Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Aaron Kimball
Just as I wrote that, Murphy's law struck :) This did not fix the issue 
after all.


I think the problem is occurring because a huge amount of network 
bandwidth is being consumed by the jobs. What settings (timeouts, thread 
counts, etc), if any, ought I dial up to correct for this?


Thanks,
- Aaron

Aaron Kimball wrote:
It's a cluster being used for a university course; there are 30 students 
all running code which (to be polite) probably tests the limits of 
Hadoop's failure recovery logic. :)


The current assignment is PageRank over Wikipedia; a 20 GB input corpus. 
Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 
reduce tasks.


I wrote a patch to address the NPE in JobTracker.killJob() and compiled 
it against TRUNK. I've put this on the cluster and it's now been holding 
steady for the last hour or so.. so that plus whatever other differences 
there are between 18.1 and TRUNK may have fixed things. (I'll submit the 
patch to the JIRA as soon as it finishes cranking against the JUnit tests)


- Aaron


Devaraj Das wrote:


On 10/30/08 3:13 AM, Aaron Kimball [EMAIL PROTECTED] wrote:


The system load and memory consumption on the JT are both very close to
idle states -- it's not overworked, I don't think

I may have an idea of the problem, though. Digging back up a ways 
into the

JT logs, I see this:

2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 9001, call killJob(job_200810290855_0025) from
10.1.143.245:48253: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) 

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j 


ava:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



This exception is then repeated for all the IPC server handlers. So I 
think
the problem is that all the handler threads are dying one by one due 
to this

NPE.



This should not happen. IPC handler catches Throwable and handles that.
Could you give more details like the kind of jobs (long/short) you are
running, how many tasks they have, etc.


This something I can fix myself, or is a patch available?

- Aaron

On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] 
wrote:


It's possible that the JobTracker is under duress and unable to 
respond to

the TaskTrackers... what do the JobTracker logs say?

Arun


On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:

 Hi all,

I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case 
that all

29
TaskTrackers will get disconnected from the JobTracker. All the 
tracker

daemon processes are still running on each machine; but the JobTracker
will
say 0 nodes available on the web status screen. Restarting MapReduce
fixes
this for another 90--120 minutes.

This looks similar to 
https://issues.apache.org/jira/browse/HADOOP-1763,

but
we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: 
Caught

exception: java.io.IOException: Call failed on local exception
  at java.lang.Throwable.init(Throwable.java:67)
  at org.apache.hadoop.ipc.Client.call(Client.java:718)
  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
  at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
  at


org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045 


)

  at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) 


  at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
  at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
  at sun.nio.ch.FileDispatcher.read0(Native Method)
  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
  at sun.nio.ch.IOUtil.read(IOUtil.java:207)
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
  at

org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j 


ava:55)
  at

org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) 


  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) 


  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) 


  at java.io.FilterInputStream.read(FilterInputStream.java:127)
  at



Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Devaraj Das
 
 I wrote a patch to address the NPE in JobTracker.killJob() and compiled
 it against TRUNK. I've put this on the cluster and it's now been holding
 steady for the last hour or so.. so that plus whatever other differences
 there are between 18.1 and TRUNK may have fixed things. (I'll submit the
 patch to the JIRA as soon as it finishes cranking against the JUnit tests)
 

Aaron, I don't think this is a solution to the problem you are seeing. The
IPC handlers are tolerant to exceptions. In particular, they must not die in
the event of an exception during RPC processing. Could you please get a
stack trace of the JobTracker threads (without your patch) when the TTs are
unable to talk to it. Access the url http://jt-host:jt-info-port/stacks
That will tell us what the handlers are up to.

 - Aaron
 
 
 Devaraj Das wrote:
 
 On 10/30/08 3:13 AM, Aaron Kimball [EMAIL PROTECTED] wrote:
 
 The system load and memory consumption on the JT are both very close to
 idle states -- it's not overworked, I don't think
 
 I may have an idea of the problem, though. Digging back up a ways into the
 JT logs, I see this:
 
 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 4 on 9001, call killJob(job_200810290855_0025) from
 10.1.143.245:48253: error: java.io.IOException:
 java.lang.NullPointerException
 java.io.IOException: java.lang.NullPointerException
 at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45
)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
 .j
 ava:37)
 at java.lang.reflect.Method.invoke(Method.java:599)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
 
 
 
 This exception is then repeated for all the IPC server handlers. So I think
 the problem is that all the handler threads are dying one by one due to this
 NPE.
 
 
 This should not happen. IPC handler catches Throwable and handles that.
 Could you give more details like the kind of jobs (long/short) you are
 running, how many tasks they have, etc.
 
 This something I can fix myself, or is a patch available?
 
 - Aaron
 
 On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote:
 
 It's possible that the JobTracker is under duress and unable to respond to
 the TaskTrackers... what do the JobTracker logs say?
 
 Arun
 
 
 On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:
 
  Hi all,
 I'm working with a 30 node Hadoop cluster that has just started
 demonstrating some weird behavior. It's run without incident for a few
 weeks.. and now:
 
 The cluster will run smoothly for 90--120 minutes or so, handling jobs
 continually during this time. Then suddenly it will be the case that all
 29
 TaskTrackers will get disconnected from the JobTracker. All the tracker
 daemon processes are still running on each machine; but the JobTracker
 will
 say 0 nodes available on the web status screen. Restarting MapReduce
 fixes
 this for another 90--120 minutes.
 
 This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
 but
 we're running on 0.18.1.
 
 I found this in a TaskTracker log:
 
 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
 exception: java.io.IOException: Call failed on local exception
   at java.lang.Throwable.init(Throwable.java:67)
   at org.apache.hadoop.ipc.Client.call(Client.java:718)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
   at
 
 
 org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045
 
 )
   at
 org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
 Caused by: java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
   at
 
 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream
 .j
 ava:55)
   at
 
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:14
 0)
   at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
   at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
   at java.io.FilterInputStream.read(FilterInputStream.java:127)
   at
 
 
 org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272
 
 )
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
   at