Lars, I found a post from someone with almost the same problem as myself. They claim to have tracked it down to a multicast problem and a shutdown and reboot of both systems resolved the problem. As this system is in production I am unable to try this until either a scheduled outage in the coming month or an unexpected failure of the current and only active node ;-) I will update this thread at the end of June to let everyone know how it went
Thanks Chris Wilson On May 21, 2013, at 6:14 PM, "Lars Ellenberg" <lars.ellenb...@linbit.com> wrote: > On Fri, May 17, 2013 at 03:17:26PM +0000, Wilson, Christopher (IT) wrote: >> Lars, >> >> Thank you for the response. Unfortunately I am in a situation where >> upgrading Heartbeat is not an option as this cluster is a currently >> unsupported black box lustre environment from HP. > > :-( > >> All nodes are locked into a specific HP branded heartbeat RPM package >> at that revision. > > :-(( > >> The directory does indeed exist and has the correct >> ownership and permissions. The curious thing is that the strace on >> heartbeat never mentions these sockets. Essentially this not happening >> seems to be the cause of CRM and associated processes failing to start >> because of the socket file /var/run/heartbeat/register not existing. > > If you start heartbeat, > after some initial timeouts, > it should log "Comm_now_up(): updating status to active", > immediately after that it will try to unlink -- if exist -- > and recreate those sockets. > If it fails to create the sockets, it will abort. > > So if you have a running heartbeat master control process, > it has been able to create those sockets during its startup. > > Unless, well, "your" heartbeat is different than "my" heartbeat. > In which case you need to either use my heartbeat, > or go to those that screwed up yours ;-/ > > > Maybe you have masked the sockets by mounting a tmpfs over? > > If you start heartbeat first, then mount -t tmpfs tmpfs /var/run/ > later, obviously the sockets will no longer be found... > > Or a later "cleanup" by some init script or misguided daemon > did rm -rf /var/run/* ? > > What does "lsof -p <pid of heartbeat master control process>" say? > > Lars > >> On May 17, 2013, at 11:02 AM, "Lars Ellenberg" <lars.ellenb...@linbit.com> >> wrote: >> >>> On Thu, May 16, 2013 at 08:05:39PM +0000, Wilson, Christopher (IT) wrote: >>>> I have a heartbeat 2.1.3-1 cluster and it was running fine until a recent >>>> network outage. Since then one node has been getting errors such as >>> >>> You do realize that there is heartbeat 3 and pacemaker? >>> >>>> heartbeat: [3824]: ERROR: Message hist queue is filling up (500 messages >>>> in queue) >>> >>> I don't think this ^^^ message has anything to do with >>> those "missing sockets" below. >>> >>>> I have looked through other mailing lists on the internet and have found >>>> that it most likely stems from missing sockets in /var/run/heartbeat >>>> (notably /var/run/heartbeat/register) >>>> I have uninstalled the rpm and re-installed it, rebooted the machine and >>>> run an strace on the heartbeat process to no avail. >>>> It appears that heartbeat does not try to create the socket files if they >>>> are missing. >>>> >>>> Could someone help me understand which component of heartbeat is >>>> responsible for creating socket files? >>> >>> Heartbeat (the core process itself) is creating those sockets. >>> It does not (in that version, anyways) create the *directory* >>> /var/run/heartbeat. >>> So you need to put a mkdir in your init script, if you have /var/run on >>> tmpfs or similar. >>> >>> heartbeat 3 has that covered, btw. > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems