Re: Please help me with heartbeat storm

Eremikhin Alexey Sat, 25 May 2013 12:28:09 -0700

Hi Roland

Here are my conf.
SLES11 SP1
hadoop 1.0.4
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)


It seems nothing repeats but hadoop version 8)

On 25.05.2013 19:44, Roland von Herget wrote:

Hi Alexey,

I don't know the solution to this problem, but I can second this, I'mseeing nearly the same:My TaskTrackers are flooding the JobTracker with heartbeats, thisstarts after the first mapred job and can be repaired by restartingthe TaskTracker.The TT nodes have high system cpu usage stats, the JT is not sufferingfrom this.


my environment:
debian 6.0.7
hadoop 1.0.4
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

What's your environment?

--Roland

On Fri, May 24, 2013 at 3:10 PM, Eremikhin Alexey<[email protected] <mailto:[email protected]>> wrote:


    Hi all,
    I have 29 servers hadoop cluster in almost default configuration.
    After installing Hadoop 1.0.4 I've noticed that JT and some TT
    waste CPU.
    I started stracing its behaviour and found that some TT send
    heartbeats in an unlimited ways.
    It means hundreds in a second.

    Daemon restart solves the issue, but even easiest Hive MR returns
    issue back.

    Here is the filtered strace of heartbeating process

    hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep
    6065 | grep write


    [pid  6065] 13:07:34.801106 write(70,
    
"\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
    
<http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
    284) = 284
    [pid  6065] 13:07:34.807968 write(70,
    
"\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
    
<http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
    284 <unfinished ...>
    [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
    [pid  6065] 13:07:34.814473 write(70,
    
"\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
    
<http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
    284 <unfinished ...>
    [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
    [pid  6065] 13:07:34.820960 write(70,
    
"\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
    
<http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
    284 <unfinished ...>


    Please help me to stop this storming 8(

Re: Please help me with heartbeat storm

Reply via email to