This might be relevant: https://issues.apache.org/jira/browse/MAPREDUCE-4478
"There are two configuration items to control the TaskTracker's heartbeat interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other is* mapreduce.tasktracker.outofband.heartbeat.damper*. If we set * mapreduce.tasktracker.outofband.heartbeat* with true and set* mapreduce.tasktracker.outofband.heartbeat.damper* with default value (1000000), TaskTracker may send heartbeat without any interval." Philippe ------------------------------- *Philippe Signoret* On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan < rajesh.balamo...@gmail.com> wrote: > Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) = > 3000 always. This is the reason why you are seeing so many heartbeats. *You > might want to set it to 1 or 5.* This would increase the time taken to > send the heartbeat from TT to JT. > > > ~Rajesh.B > > > On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey < > a.eremi...@corp.badoo.com> wrote: > >> Hi! >> >> Tried 5 seconds. Less number of nodes get into storm, but still they do. >> Additionaly update of ntp service helped a little. >> >> Initially almost 50% got into storming each MR job. But after ntp update >> and and increasing heart-beatto 5 seconds level is around 10%. >> >> >> On 26/05/13 10:43, murali adireddy wrote: >> >> Hi , >> >> Just try this one. >> >> in the file "hdfs-site.xml" try to add the below property >> "dfs.heartbeat.interval" and value in seconds. >> >> Default value is '3' seconds. In your case increase value. >> >> <property> >> <name>dfs.heartbeat.interval</name> >> <value>3</value> >> </property> >> >> You can find more properties and default values in the below link. >> >> >> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml >> >> >> Please let me know is the above solution worked for you ..? >> >> >> >> >> On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey < >> a.eremi...@corp.badoo.com> wrote: >> >>> Hi all, >>> I have 29 servers hadoop cluster in almost default configuration. >>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU. >>> I started stracing its behaviour and found that some TT send heartbeats >>> in an unlimited ways. >>> It means hundreds in a second. >>> >>> Daemon restart solves the issue, but even easiest Hive MR returns issue >>> back. >>> >>> Here is the filtered strace of heartbeating process >>> >>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1 | grep 6065 | >>> grep write >>> >>> >>> [pid 6065] 13:07:34.801106 write(70, >>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/ >>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30", >>> 284) = 284 >>> [pid 6065] 13:07:34.807968 write(70, >>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/ >>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31", >>> 284 <unfinished ...> >>> [pid 6065] 13:07:34.808080 <... write resumed> ) = 284 >>> [pid 6065] 13:07:34.814473 write(70, >>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/ >>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32", >>> 284 <unfinished ...> >>> [pid 6065] 13:07:34.814595 <... write resumed> ) = 284 >>> [pid 6065] 13:07:34.820960 write(70, >>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/ >>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33", >>> 284 <unfinished ...> >>> >>> >>> Please help me to stop this storming 8( >>> >>> >> >> > > > -- > ~Rajesh.B >