On Fri, May 17, 2013 at 03:17:26PM +0000, Wilson, Christopher (IT) wrote:
> Lars,
> 
> Thank you for the response. Unfortunately I am in a situation where
> upgrading Heartbeat is not an option as this cluster is a currently
> unsupported black box lustre environment from HP.

 :-(

> All nodes are locked into a specific HP branded heartbeat RPM package
> at that revision.

:-((

> The directory does indeed exist and has the correct
> ownership and permissions. The curious thing is that the strace on
> heartbeat never mentions these sockets. Essentially this not happening
> seems to be the cause of CRM and associated processes failing to start
> because of the socket file /var/run/heartbeat/register not existing.

If you start heartbeat,
after some initial timeouts,
it should log "Comm_now_up(): updating status to active",
immediately after that it will try to unlink -- if exist --
and recreate those sockets.
If it fails to create the sockets, it will abort.

So if you have a running heartbeat master control process,
it has been able to create those sockets during its startup.

Unless, well, "your" heartbeat is different than "my" heartbeat.
In which case you need to either use my heartbeat,
or go to those that screwed up yours ;-/


Maybe you have masked the sockets by mounting a tmpfs over?

If you start heartbeat first, then mount -t tmpfs tmpfs /var/run/
later, obviously the sockets will no longer be found...

Or a later "cleanup" by some init script or misguided daemon
did rm -rf /var/run/* ?

What does "lsof -p <pid of heartbeat master control process>" say?

        Lars

> On May 17, 2013, at 11:02 AM, "Lars Ellenberg" <lars.ellenb...@linbit.com> 
> wrote:
> 
> > On Thu, May 16, 2013 at 08:05:39PM +0000, Wilson, Christopher (IT) wrote:
> >> I have a heartbeat 2.1.3-1 cluster and it was running fine until a recent 
> >> network outage. Since then one node has been getting errors such as
> > 
> > You do realize that there is heartbeat 3 and pacemaker?
> > 
> >> heartbeat: [3824]: ERROR: Message hist queue is filling up (500 messages 
> >> in queue)
> > 
> > I don't think this ^^^ message has anything to do with
> > those "missing sockets" below.
> > 
> >> I have looked through other mailing lists on the internet and have found 
> >> that it most likely stems from missing sockets in /var/run/heartbeat 
> >> (notably /var/run/heartbeat/register)
> >> I have uninstalled the rpm and re-installed it, rebooted the machine and 
> >> run an strace on the heartbeat process to no avail.
> >> It appears that heartbeat does not try to create the socket files if they 
> >> are missing.
> >> 
> >> Could someone help me understand which component of heartbeat is 
> >> responsible for creating socket files?
> > 
> > Heartbeat (the core process itself) is creating those sockets.
> > It does not (in that version, anyways) create the *directory* 
> > /var/run/heartbeat.
> > So you need to put a mkdir in your init script, if you have /var/run on 
> > tmpfs or similar.
> > 
> > heartbeat 3 has that covered, btw.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to