
I found a post from someone with almost the same problem as myself. They claim 
to have tracked it down to a multicast problem and a shutdown and reboot of 
both systems resolved the problem. As this system is in production I am unable 
to try this until either a scheduled outage in the coming month or an 
unexpected failure of the current and only active node ;-) I will update this 
thread at the end of June to let everyone know how it went 


Chris Wilson

On May 21, 2013, at 6:14 PM, "Lars Ellenberg" <> wrote:

> On Fri, May 17, 2013 at 03:17:26PM +0000, Wilson, Christopher (IT) wrote:
>> Lars,
>> Thank you for the response. Unfortunately I am in a situation where
>> upgrading Heartbeat is not an option as this cluster is a currently
>> unsupported black box lustre environment from HP.
> :-(
>> All nodes are locked into a specific HP branded heartbeat RPM package
>> at that revision.
> :-((
>> The directory does indeed exist and has the correct
>> ownership and permissions. The curious thing is that the strace on
>> heartbeat never mentions these sockets. Essentially this not happening
>> seems to be the cause of CRM and associated processes failing to start
>> because of the socket file /var/run/heartbeat/register not existing.
> If you start heartbeat,
> after some initial timeouts,
> it should log "Comm_now_up(): updating status to active",
> immediately after that it will try to unlink -- if exist --
> and recreate those sockets.
> If it fails to create the sockets, it will abort.
> So if you have a running heartbeat master control process,
> it has been able to create those sockets during its startup.
> Unless, well, "your" heartbeat is different than "my" heartbeat.
> In which case you need to either use my heartbeat,
> or go to those that screwed up yours ;-/
> Maybe you have masked the sockets by mounting a tmpfs over?
> If you start heartbeat first, then mount -t tmpfs tmpfs /var/run/
> later, obviously the sockets will no longer be found...
> Or a later "cleanup" by some init script or misguided daemon
> did rm -rf /var/run/* ?
> What does "lsof -p <pid of heartbeat master control process>" say?
>    Lars
>> On May 17, 2013, at 11:02 AM, "Lars Ellenberg" <> 
>> wrote:
>>> On Thu, May 16, 2013 at 08:05:39PM +0000, Wilson, Christopher (IT) wrote:
>>>> I have a heartbeat 2.1.3-1 cluster and it was running fine until a recent 
>>>> network outage. Since then one node has been getting errors such as
>>> You do realize that there is heartbeat 3 and pacemaker?
>>>> heartbeat: [3824]: ERROR: Message hist queue is filling up (500 messages 
>>>> in queue)
>>> I don't think this ^^^ message has anything to do with
>>> those "missing sockets" below.
>>>> I have looked through other mailing lists on the internet and have found 
>>>> that it most likely stems from missing sockets in /var/run/heartbeat 
>>>> (notably /var/run/heartbeat/register)
>>>> I have uninstalled the rpm and re-installed it, rebooted the machine and 
>>>> run an strace on the heartbeat process to no avail.
>>>> It appears that heartbeat does not try to create the socket files if they 
>>>> are missing.
>>>> Could someone help me understand which component of heartbeat is 
>>>> responsible for creating socket files?
>>> Heartbeat (the core process itself) is creating those sockets.
>>> It does not (in that version, anyways) create the *directory* 
>>> /var/run/heartbeat.
>>> So you need to put a mkdir in your init script, if you have /var/run on 
>>> tmpfs or similar.
>>> heartbeat 3 has that covered, btw.
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> See also:
Linux-HA mailing list
See also:

Reply via email to