Re: Please help me with heartbeat storm

Eremikhin Alexey Mon, 27 May 2013 01:43:12 -0700

Hi!

Tried 5 seconds. Less number of nodes get into storm, but still they do.
Additionaly update of ntp service helped a little.

Initially almost 50% got into storming each MR job. But after ntp updateand and increasing heart-beatto 5 seconds level is around 10%.



On 26/05/13 10:43, murali adireddy wrote:

Hi ,

Just try this one.

in the file "hdfs-site.xml" try to add the below property"dfs.heartbeat.interval" and value in seconds.


Default value is '3' seconds. In your case increase value.

<property>
 <name>dfs.heartbeat.interval</name>
 <value>3</value>
</property>

You can find more properties and default values in the below link.

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml


Please let me know is the above solution worked for you ..?

On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey<[email protected] <mailto:[email protected]>> wrote:


    Hi all,
    I have 29 servers hadoop cluster in almost default configuration.
    After installing Hadoop 1.0.4 I've noticed that JT and some TT
    waste CPU.
    I started stracing its behaviour and found that some TT send
    heartbeats in an unlimited ways.
    It means hundreds in a second.

    Daemon restart solves the issue, but even easiest Hive MR returns
    issue back.

    Here is the filtered strace of heartbeating process

    hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep
    6065 | grep write


    [pid  6065] 13:07:34.801106 write(70,
    
"\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
    
<http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
    284) = 284
    [pid  6065] 13:07:34.807968 write(70,
    
"\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
    
<http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
    284 <unfinished ...>
    [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
    [pid  6065] 13:07:34.814473 write(70,
    
"\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
    
<http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
    284 <unfinished ...>
    [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
    [pid  6065] 13:07:34.820960 write(70,
    
"\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
    
<http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
    284 <unfinished ...>


    Please help me to stop this storming 8(

Re: Please help me with heartbeat storm

Reply via email to