> 
> I wrote a patch to address the NPE in JobTracker.killJob() and compiled
> it against TRUNK. I've put this on the cluster and it's now been holding
> steady for the last hour or so.. so that plus whatever other differences
> there are between 18.1 and TRUNK may have fixed things. (I'll submit the
> patch to the JIRA as soon as it finishes cranking against the JUnit tests)
> 

Aaron, I don't think this is a solution to the problem you are seeing. The
IPC handlers are tolerant to exceptions. In particular, they must not die in
the event of an exception during RPC processing. Could you please get a
stack trace of the JobTracker threads (without your patch) when the TTs are
unable to talk to it. Access the url http://<jt-host>:<jt-info-port>/stacks
That will tell us what the handlers are up to.

> - Aaron
> 
> 
> Devaraj Das wrote:
>> 
>> On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote:
>> 
>>> The system load and memory consumption on the JT are both very close to
>>> "idle" states -- it's not overworked, I don't think
>>> 
>>> I may have an idea of the problem, though. Digging back up a ways into the
>>> JT logs, I see this:
>>> 
>>> 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 4 on 9001, call killJob(job_200810290855_0025) from
>>> 10.1.143.245:48253: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>> at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at 
>>> 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45>>>
)
>>> at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>>> .j
>>> ava:37)
>>> at java.lang.reflect.Method.invoke(Method.java:599)
>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
>>> 
>>> 
>>> 
>>> This exception is then repeated for all the IPC server handlers. So I think
>>> the problem is that all the handler threads are dying one by one due to this
>>> NPE.
>>> 
>> 
>> This should not happen. IPC handler catches Throwable and handles that.
>> Could you give more details like the kind of jobs (long/short) you are
>> running, how many tasks they have, etc.
>> 
>>> This something I can fix myself, or is a patch available?
>>> 
>>> - Aaron
>>> 
>>> On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>>> 
>>>> It's possible that the JobTracker is under duress and unable to respond to
>>>> the TaskTrackers... what do the JobTracker logs say?
>>>> 
>>>> Arun
>>>> 
>>>> 
>>>> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:
>>>> 
>>>>  Hi all,
>>>>> I'm working with a 30 node Hadoop cluster that has just started
>>>>> demonstrating some weird behavior. It's run without incident for a few
>>>>> weeks.. and now:
>>>>> 
>>>>> The cluster will run smoothly for 90--120 minutes or so, handling jobs
>>>>> continually during this time. Then suddenly it will be the case that all
>>>>> 29
>>>>> TaskTrackers will get disconnected from the JobTracker. All the tracker
>>>>> daemon processes are still running on each machine; but the JobTracker
>>>>> will
>>>>> say "0 nodes available" on the web status screen. Restarting MapReduce
>>>>> fixes
>>>>> this for another 90--120 minutes.
>>>>> 
>>>>> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
>>>>> but
>>>>> we're running on 0.18.1.
>>>>> 
>>>>> I found this in a TaskTracker log:
>>>>> 
>>>>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
>>>>> exception: java.io.IOException: Call failed on local exception
>>>>>   at java.lang.Throwable.<init>(Throwable.java:67)
>>>>>   at org.apache.hadoop.ipc.Client.call(Client.java:718)
>>>>>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>>>>   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
>>>>>   at
>>>>> 
>>>>> 
>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>
>> >>
>> )
>>>>>   at
>>>>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
>>>>>   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
>>>>>   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
>>>>> Caused by: java.io.IOException: Connection reset by peer
>>>>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
>>>>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
>>>>>   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
>>>>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>>>>>   at
>>>>> 
>>>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream
>>>>> .j
>>>>> ava:55)
>>>>>   at
>>>>> 
>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:14
>>>>> 0)
>>>>>   at
>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>>>>>   at
>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>>>>>   at java.io.FilterInputStream.read(FilterInputStream.java:127)
>>>>>   at
>>>>> 
>>>>> 
>> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>
>> >>
>> )
>>>>>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
>>>>>   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
>>>>>   at java.io.DataInputStream.readInt(DataInputStream.java:381)
>>>>>   at
>>>>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
>>>>>   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
>>>>> 
>>>>> 
>>>>> As well as a few of these warnings:
>>>>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
>>>>> THREADS
>>>>> ((40-40+0)<1) on [EMAIL PROTECTED]:50060
>>>>> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
>>>>> THREADS: [EMAIL PROTECTED]:50060
>>>>> 
>>>>> 
>>>>> 
>>>>> The NameNode and DataNodes are completely fine. Can't be a DNS issue,
>>>>> because all DNS is served through /etc/hosts files. NameNode and
>>>>> JobTracker
>>>>> are on the same machine.
>>>>> 
>>>>> Any help is appreciated
>>>>> Thanks
>>>>> - Aaron Kimball
>>>>> 
>>>> 
>> 
>> 


Reply via email to