Re: Please help me with heartbeat storm

murali adireddy Sat, 25 May 2013 23:44:03 -0700

Hi ,

Just try this one.


in the file "hdfs-site.xml" try to add the below property
"dfs.heartbeat.interval" and value  in seconds.

Default value is '3' seconds. In your case increase value.

<property>
 <name>dfs.heartbeat.interval</name>
 <value>3</value>
</property>

You can find more properties and default values in the below link.

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml


Please let me know is the above solution worked for you ..?




On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <[email protected]
> wrote:

> Hi all,
> I have 29 servers hadoop cluster in almost default configuration.
> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
> I started stracing its behaviour and found that some TT send heartbeats in
> an unlimited ways.
> It means hundreds in a second.
>
> Daemon restart solves the issue, but even easiest Hive MR returns issue
> back.
>
> Here is the filtered strace of heartbeating process
>
> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
> grep write
>
>
> [pid  6065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\300\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\30", 284) = 284
> [pid  6065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\312\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\31", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
> [pid  6065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\32", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
> [pid  6065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\33", 284 <unfinished
> ...>
>
>
> Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Reply via email to