Re: TaskTrackers disengaging from JobTracker

Billy Sehmel Thu, 30 Oct 2008 00:40:59 -0700

Aaron,

Does this have to do with the cloud we got at SoftLayer or is it something
at the university like you said?


I think I provisioned 6 - 10 with 1G connections let me log in and check
really quick.. no they're all correct..

        Speed: 1000Mb/s




On Wed, Oct 29, 2008 at 9:49 PM, Aaron Kimball <[EMAIL PROTECTED]> wrote:

> Just as I wrote that, Murphy's law struck :) This did not fix the issue
> after all.
>
> I think the problem is occurring because a huge amount of network bandwidth
> is being consumed by the jobs. What settings (timeouts, thread counts, etc),
> if any, ought I dial up to correct for this?
>
> Thanks,
> - Aaron
>
>
> Aaron Kimball wrote:
>
>> It's a cluster being used for a university course; there are 30 students
>> all running code which (to be polite) probably tests the limits of Hadoop's
>> failure recovery logic. :)
>>
>> The current assignment is PageRank over Wikipedia; a 20 GB input corpus.
>> Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50
>> reduce tasks.
>>
>> I wrote a patch to address the NPE in JobTracker.killJob() and compiled it
>> against TRUNK. I've put this on the cluster and it's now been holding steady
>> for the last hour or so.. so that plus whatever other differences there are
>> between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the
>> JIRA as soon as it finishes cranking against the JUnit tests)
>>
>> - Aaron
>>
>>
>> Devaraj Das wrote:
>>
>>>
>>> On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote:
>>>
>>>  The system load and memory consumption on the JT are both very close to
>>>> "idle" states -- it's not overworked, I don't think
>>>>
>>>> I may have an idea of the problem, though. Digging back up a ways into
>>>> the
>>>> JT logs, I see this:
>>>>
>>>> 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>> handler 4 on 9001, call killJob(job_200810290855_0025) from
>>>> 10.1.143.245:48253: error: java.io.IOException:
>>>> java.lang.NullPointerException
>>>> java.io.IOException: java.lang.NullPointerException
>>>> at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
>>>>
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
>>>>
>>>> ava:37)
>>>> at java.lang.reflect.Method.invoke(Method.java:599)
>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
>>>>
>>>>
>>>>
>>>> This exception is then repeated for all the IPC server handlers. So I
>>>> think
>>>> the problem is that all the handler threads are dying one by one due to
>>>> this
>>>> NPE.
>>>>
>>>>
>>> This should not happen. IPC handler catches Throwable and handles that.
>>> Could you give more details like the kind of jobs (long/short) you are
>>> running, how many tasks they have, etc.
>>>
>>>  This something I can fix myself, or is a patch available?
>>>>
>>>> - Aaron
>>>>
>>>> On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>  It's possible that the JobTracker is under duress and unable to respond
>>>>> to
>>>>> the TaskTrackers... what do the JobTracker logs say?
>>>>>
>>>>> Arun
>>>>>
>>>>>
>>>>> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:
>>>>>
>>>>>  Hi all,
>>>>>
>>>>>> I'm working with a 30 node Hadoop cluster that has just started
>>>>>> demonstrating some weird behavior. It's run without incident for a few
>>>>>> weeks.. and now:
>>>>>>
>>>>>> The cluster will run smoothly for 90--120 minutes or so, handling jobs
>>>>>> continually during this time. Then suddenly it will be the case that
>>>>>> all
>>>>>> 29
>>>>>> TaskTrackers will get disconnected from the JobTracker. All the
>>>>>> tracker
>>>>>> daemon processes are still running on each machine; but the JobTracker
>>>>>> will
>>>>>> say "0 nodes available" on the web status screen. Restarting MapReduce
>>>>>> fixes
>>>>>> this for another 90--120 minutes.
>>>>>>
>>>>>> This looks similar to
>>>>>> https://issues.apache.org/jira/browse/HADOOP-1763,
>>>>>> but
>>>>>> we're running on 0.18.1.
>>>>>>
>>>>>> I found this in a TaskTracker log:
>>>>>>
>>>>>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker:
>>>>>> Caught
>>>>>> exception: java.io.IOException: Call failed on local exception
>>>>>>  at java.lang.Throwable.<init>(Throwable.java:67)
>>>>>>  at org.apache.hadoop.ipc.Client.call(Client.java:718)
>>>>>>  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>>>>>  at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
>>>>>>  at
>>>>>>
>>>>>>
>>>>>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>>
>>>
>>> )
>>>
>>>>  at
>>>>>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
>>>>>>
>>>>>>  at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
>>>>>>  at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
>>>>>> Caused by: java.io.IOException: Connection reset by peer
>>>>>>  at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>>>>  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
>>>>>>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
>>>>>>  at sun.nio.ch.IOUtil.read(IOUtil.java:207)
>>>>>>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>>>>>>  at
>>>>>>
>>>>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j
>>>>>>
>>>>>> ava:55)
>>>>>>  at
>>>>>>
>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>>>>>>
>>>>>>  at
>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>>>>>>
>>>>>>  at
>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>>>>>>
>>>>>>  at java.io.FilterInputStream.read(FilterInputStream.java:127)
>>>>>>  at
>>>>>>
>>>>>>
>>>>>> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>>>
>>>
>>> )
>>>
>>>>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
>>>>>>  at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
>>>>>>  at java.io.DataInputStream.readInt(DataInputStream.java:381)
>>>>>>  at
>>>>>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
>>>>>>
>>>>>>  at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
>>>>>>
>>>>>>
>>>>>> As well as a few of these warnings:
>>>>>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
>>>>>> THREADS
>>>>>> ((40-40+0)<1) on [EMAIL PROTECTED]:50060
>>>>>> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
>>>>>> THREADS: [EMAIL PROTECTED]:50060
>>>>>>
>>>>>>
>>>>>>
>>>>>> The NameNode and DataNodes are completely fine. Can't be a DNS issue,
>>>>>> because all DNS is served through /etc/hosts files. NameNode and
>>>>>> JobTracker
>>>>>> are on the same machine.
>>>>>>
>>>>>> Any help is appreciated
>>>>>> Thanks
>>>>>> - Aaron Kimball
>>>>>>
>>>>>>
>>>>>
>>>
>>>

Re: TaskTrackers disengaging from JobTracker

Reply via email to