On Tue, Jun 24, 2014 at 11:20:48PM +0300, Pasi Kärkkäinen wrote: > Hello! > > I've been seeing heartbeat cluster problems in Linux-based Vyatta and more > recent VyOS networking/router appliances. > These are currently based on Debian Squeeze, and thus are using: > > Package: heartbeat > Version: 1:3.0.3-2
Please use 3.0.5: http://hg.linux-ha.org/heartbeat-STABLE_3_0/archive/37f57a36a2dd.tar.bz2 > VyOS bug report: http://bugzilla.vyos.net/show_bug.cgi?id=244 > > The problem is that when there are (unexpected) networking problems causing > multicast issues, > which cause problems in the inter-cluster communications, the heartbeat > processes will die on the cluster nodes, > which is bad, right? I assume heartbeat should never die, especially not > because of temporary networking issues.. > > I've also seen heartbeat dying because of temporary network maintenance > breaks.. > > Basicly first I'm seeing this kind of messages: > > Jun 23 17:55:02 vyos03 heartbeat: [4119]: WARN: node vyos01: is dead > Jun 23 17:59:23 vyos03 heartbeat: [4119]: CRIT: Cluster node vyos01 returning > after partition. > Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Deadtime value may be too > small. > Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Late heartbeat: Node vyos01: > interval 273580 ms > Jun 23 17:59:23 vyos03 harc[4961]: info: Running /etc/ha.d//rc.d/status status > Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Releasing resource group: > vyos01 IPaddr2-vyatta::10.0.0.10/24/eth1 > Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Running > /etc/ha.d/resource.d/IPaddr2-vyatta 10.0.0.10/24/eth1 stop > Jun 23 17:59:26 vyos03 heartbeat: [4119]: WARN: 1 lost packet(s) for [vyos01] > [421:423] > Jun 23 17:59:39 vyos03 heartbeat: [4119]: WARN: Logging daemon is disabled > --enabling logging daemon is recommended > Jun 23 17:59:40 vyos03 harc[5102]: info: Running /etc/ha.d//rc.d/status status > > Which seem normal in the case of networking problem.. But then later: > > Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > filling up (494 messages in queue) > Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > filling up (495 messages in queue) > Jun 23 19:31:23 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > filling up (496 messages in queue) > Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > filling up (497 messages in queue) > Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > filling up (498 messages in queue) > Jun 23 19:31:25 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > filling up (499 messages in queue) > Jun 23 19:31:26 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > filling up (500 messages in queue) > Jun 23 19:31:42 vyos03 heartbeat: last message repeated 25 times > > > The "hist queue" size keeps increasing, and when it gets to 500 messages bad > things start happening.. > > > Jun 23 19:31:43 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > filling up (500 messages in queue) > Jun 23 19:31:49 vyos03 heartbeat: last message repeated 9 times > Jun 23 19:31:49 vyos03 heartbeat: [10921]: ERROR: lowseq cannnot be greater > than ackseq > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown: Master > Control process died. > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10921 with > SIGTERM > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10924 with > SIGTERM > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10925 with > SIGTERM > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown(MCP > dead): Killing ourselves. > > At this point clustering has failed, because the heartbeat services/processes > aren't running anymore.. > > Has anyone else seen this? It has been fixed years ago ... > It seems the bug gets triggered at 500 messages in the hist queue, > and then I always see the "ERROR: lowseq cannnot be greater than ackseq" and > then heartbeat dies.. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems