On Tue, Jun 24, 2014 at 11:20:48PM +0300, Pasi Kärkkäinen wrote:
> Hello!
> 
> I've been seeing heartbeat cluster problems in Linux-based Vyatta and more 
> recent VyOS networking/router appliances.
> These are currently based on Debian Squeeze, and thus are using:
> 
> Package: heartbeat
> Version: 1:3.0.3-2

Please use 3.0.5:
http://hg.linux-ha.org/heartbeat-STABLE_3_0/archive/37f57a36a2dd.tar.bz2

> VyOS bug report: http://bugzilla.vyos.net/show_bug.cgi?id=244
> 
> The problem is that when there are (unexpected) networking problems causing 
> multicast issues,
> which cause problems in the inter-cluster communications, the heartbeat 
> processes will die on the cluster nodes,
> which is bad, right? I assume heartbeat should never die, especially not 
> because of temporary networking issues..
> 
> I've also seen heartbeat dying because of temporary network maintenance 
> breaks..
> 
> Basicly first I'm seeing this kind of messages:
> 
> Jun 23 17:55:02 vyos03 heartbeat: [4119]: WARN: node vyos01: is dead
> Jun 23 17:59:23 vyos03 heartbeat: [4119]: CRIT: Cluster node vyos01 returning 
> after partition.
> Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Deadtime value may be too 
> small.
> Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Late heartbeat: Node vyos01: 
> interval 273580 ms
> Jun 23 17:59:23 vyos03 harc[4961]: info: Running /etc/ha.d//rc.d/status status
> Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Releasing resource group: 
> vyos01 IPaddr2-vyatta::10.0.0.10/24/eth1
> Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Running 
> /etc/ha.d/resource.d/IPaddr2-vyatta 10.0.0.10/24/eth1 stop
> Jun 23 17:59:26 vyos03 heartbeat: [4119]: WARN: 1 lost packet(s) for [vyos01] 
> [421:423]
> Jun 23 17:59:39 vyos03 heartbeat: [4119]: WARN: Logging daemon is disabled 
> --enabling logging daemon is recommended
> Jun 23 17:59:40 vyos03 harc[5102]: info: Running /etc/ha.d//rc.d/status status
> 
> Which seem normal in the case of networking problem.. But then later:
> 
> Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> filling up (494 messages in queue)
> Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> filling up (495 messages in queue)
> Jun 23 19:31:23 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> filling up (496 messages in queue)
> Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> filling up (497 messages in queue)
> Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> filling up (498 messages in queue)
> Jun 23 19:31:25 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> filling up (499 messages in queue)
> Jun 23 19:31:26 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> filling up (500 messages in queue)
> Jun 23 19:31:42 vyos03 heartbeat: last message repeated 25 times
> 
> 
> The "hist queue" size keeps increasing, and when it gets to 500 messages bad 
> things start happening..
> 
> 
> Jun 23 19:31:43 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> filling up (500 messages in queue)
> Jun 23 19:31:49 vyos03 heartbeat: last message repeated 9 times
> Jun 23 19:31:49 vyos03 heartbeat: [10921]: ERROR: lowseq cannnot be greater 
> than ackseq
> Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown: Master 
> Control process died.
> Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10921 with 
> SIGTERM
> Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10924 with 
> SIGTERM
> Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10925 with 
> SIGTERM
> Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown(MCP 
> dead): Killing ourselves.
> 
> At this point clustering has failed, because the heartbeat services/processes 
> aren't running anymore..
> 
> Has anyone else seen this? 

It has been fixed years ago ...

> It seems the bug gets triggered at 500 messages in the hist queue,
> and then I always see the "ERROR: lowseq cannnot be greater than ackseq" and 
> then heartbeat dies..

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to