The system load and memory consumption on the JT are both very close to
"idle" states -- it's not overworked, I don't think

I may have an idea of the problem, though. Digging back up a ways into the
JT logs, I see this:

2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 9001, call killJob(job_200810290855_0025) from
10.1.143.245:48253: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
        at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
        at java.lang.reflect.Method.invoke(Method.java:599)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



This exception is then repeated for all the IPC server handlers. So I think
the problem is that all the handler threads are dying one by one due to this
NPE.

This something I can fix myself, or is a patch available?

- Aaron

On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:

> It's possible that the JobTracker is under duress and unable to respond to
> the TaskTrackers... what do the JobTracker logs say?
>
> Arun
>
>
> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:
>
>  Hi all,
>>
>> I'm working with a 30 node Hadoop cluster that has just started
>> demonstrating some weird behavior. It's run without incident for a few
>> weeks.. and now:
>>
>> The cluster will run smoothly for 90--120 minutes or so, handling jobs
>> continually during this time. Then suddenly it will be the case that all
>> 29
>> TaskTrackers will get disconnected from the JobTracker. All the tracker
>> daemon processes are still running on each machine; but the JobTracker
>> will
>> say "0 nodes available" on the web status screen. Restarting MapReduce
>> fixes
>> this for another 90--120 minutes.
>>
>> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
>> but
>> we're running on 0.18.1.
>>
>> I found this in a TaskTracker log:
>>
>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
>> exception: java.io.IOException: Call failed on local exception
>>   at java.lang.Throwable.<init>(Throwable.java:67)
>>   at org.apache.hadoop.ipc.Client.call(Client.java:718)
>>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
>>   at
>>
>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045)
>>   at
>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
>>   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
>>   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
>> Caused by: java.io.IOException: Connection reset by peer
>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
>>   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>>   at
>>
>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>   at
>>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>>   at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>>   at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>>   at java.io.FilterInputStream.read(FilterInputStream.java:127)
>>   at
>>
>> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272)
>>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
>>   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
>>   at java.io.DataInputStream.readInt(DataInputStream.java:381)
>>   at
>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
>>   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
>>
>>
>> As well as a few of these warnings:
>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
>> THREADS
>> ((40-40+0)<1) on [EMAIL PROTECTED]:50060
>> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
>> THREADS: [EMAIL PROTECTED]:50060
>>
>>
>>
>> The NameNode and DataNodes are completely fine. Can't be a DNS issue,
>> because all DNS is served through /etc/hosts files. NameNode and
>> JobTracker
>> are on the same machine.
>>
>> Any help is appreciated
>> Thanks
>> - Aaron Kimball
>>
>
>

Reply via email to