Re: Please help me with heartbeat storm

Roland von Herget Thu, 30 May 2013 06:00:52 -0700

Hi Philippe,

thanks a lot, that's the solution. I've disable *
mapreduce.tasktracker.outofband.heartbeat* and now everything is fine!


Thanks again,
Roland


On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret <
[email protected]> wrote:

> This might be relevant:
> https://issues.apache.org/jira/browse/MAPREDUCE-4478
>
> "There are two configuration items to control the TaskTracker's heartbeat
> interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other is
> *mapreduce.tasktracker.outofband.heartbeat.damper*. If we set *
> mapreduce.tasktracker.outofband.heartbeat* with true and set*
> mapreduce.tasktracker.outofband.heartbeat.damper* with default value
> (1000000), TaskTracker may send heartbeat without any interval."
>
>
> Philippe
>
> -------------------------------
> *Philippe Signoret*
>
>
> On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <
> [email protected]> wrote:
>
>> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000)
>> = 3000 always. This is the reason why you are seeing so many heartbeats.
>> *You might want to set it to 1 or 5.* This would increase the time taken
>> to send the heartbeat from TT to JT.
>>
>>
>> ~Rajesh.B
>>
>>
>> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <
>> [email protected]> wrote:
>>
>>>  Hi!
>>>
>>> Tried 5 seconds. Less number of nodes get into storm, but still they do.
>>> Additionaly update of ntp service helped a little.
>>>
>>> Initially almost 50% got into storming each MR job. But after ntp update
>>> and and increasing heart-beatto 5 seconds level is around 10%.
>>>
>>>
>>> On 26/05/13 10:43, murali adireddy wrote:
>>>
>>> Hi ,
>>>
>>>  Just try this one.
>>>
>>>  in the file "hdfs-site.xml" try to add the below property
>>> "dfs.heartbeat.interval" and value  in seconds.
>>>
>>>  Default value is '3' seconds. In your case increase value.
>>>
>>>  <property>
>>>  <name>dfs.heartbeat.interval</name>
>>>  <value>3</value>
>>> </property>
>>>
>>>  You can find more properties and default values in the below link.
>>>
>>>
>>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>>
>>>
>>>  Please let me know is the above solution worked for you ..?
>>>
>>>
>>>
>>>
>>>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
>>> [email protected]> wrote:
>>>
>>>> Hi all,
>>>> I have 29 servers hadoop cluster in almost default configuration.
>>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste
>>>> CPU.
>>>> I started stracing its behaviour and found that some TT send heartbeats
>>>> in an unlimited ways.
>>>> It means hundreds in a second.
>>>>
>>>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>>>> back.
>>>>
>>>> Here is the filtered strace of heartbeating process
>>>>
>>>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>>>> grep write
>>>>
>>>>
>>>> [pid  6065] 13:07:34.801106 write(70,
>>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>>> 284) = 284
>>>> [pid  6065] 13:07:34.807968 write(70,
>>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>>> [pid  6065] 13:07:34.814473 write(70,
>>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>>> [pid  6065] 13:07:34.820960 write(70,
>>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>>> 284 <unfinished ...>
>>>>
>>>>
>>>> Please help me to stop this storming 8(
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>

Re: Please help me with heartbeat storm

Reply via email to