Lars,

I found a post from someone with almost the same problem as myself. They claim 
to have tracked it down to a multicast problem and a shutdown and reboot of 
both systems resolved the problem. As this system is in production I am unable 
to try this until either a scheduled outage in the coming month or an 
unexpected failure of the current and only active node ;-) I will update this 
thread at the end of June to let everyone know how it went 

Thanks

Chris Wilson

On May 21, 2013, at 6:14 PM, "Lars Ellenberg" <lars.ellenb...@linbit.com> wrote:

> On Fri, May 17, 2013 at 03:17:26PM +0000, Wilson, Christopher (IT) wrote:
>> Lars,
>> 
>> Thank you for the response. Unfortunately I am in a situation where
>> upgrading Heartbeat is not an option as this cluster is a currently
>> unsupported black box lustre environment from HP.
> 
> :-(
> 
>> All nodes are locked into a specific HP branded heartbeat RPM package
>> at that revision.
> 
> :-((
> 
>> The directory does indeed exist and has the correct
>> ownership and permissions. The curious thing is that the strace on
>> heartbeat never mentions these sockets. Essentially this not happening
>> seems to be the cause of CRM and associated processes failing to start
>> because of the socket file /var/run/heartbeat/register not existing.
> 
> If you start heartbeat,
> after some initial timeouts,
> it should log "Comm_now_up(): updating status to active",
> immediately after that it will try to unlink -- if exist --
> and recreate those sockets.
> If it fails to create the sockets, it will abort.
> 
> So if you have a running heartbeat master control process,
> it has been able to create those sockets during its startup.
> 
> Unless, well, "your" heartbeat is different than "my" heartbeat.
> In which case you need to either use my heartbeat,
> or go to those that screwed up yours ;-/
> 
> 
> Maybe you have masked the sockets by mounting a tmpfs over?
> 
> If you start heartbeat first, then mount -t tmpfs tmpfs /var/run/
> later, obviously the sockets will no longer be found...
> 
> Or a later "cleanup" by some init script or misguided daemon
> did rm -rf /var/run/* ?
> 
> What does "lsof -p <pid of heartbeat master control process>" say?
> 
>    Lars
> 
>> On May 17, 2013, at 11:02 AM, "Lars Ellenberg" <lars.ellenb...@linbit.com> 
>> wrote:
>> 
>>> On Thu, May 16, 2013 at 08:05:39PM +0000, Wilson, Christopher (IT) wrote:
>>>> I have a heartbeat 2.1.3-1 cluster and it was running fine until a recent 
>>>> network outage. Since then one node has been getting errors such as
>>> 
>>> You do realize that there is heartbeat 3 and pacemaker?
>>> 
>>>> heartbeat: [3824]: ERROR: Message hist queue is filling up (500 messages 
>>>> in queue)
>>> 
>>> I don't think this ^^^ message has anything to do with
>>> those "missing sockets" below.
>>> 
>>>> I have looked through other mailing lists on the internet and have found 
>>>> that it most likely stems from missing sockets in /var/run/heartbeat 
>>>> (notably /var/run/heartbeat/register)
>>>> I have uninstalled the rpm and re-installed it, rebooted the machine and 
>>>> run an strace on the heartbeat process to no avail.
>>>> It appears that heartbeat does not try to create the socket files if they 
>>>> are missing.
>>>> 
>>>> Could someone help me understand which component of heartbeat is 
>>>> responsible for creating socket files?
>>> 
>>> Heartbeat (the core process itself) is creating those sockets.
>>> It does not (in that version, anyways) create the *directory* 
>>> /var/run/heartbeat.
>>> So you need to put a mkdir in your init script, if you have /var/run on 
>>> tmpfs or similar.
>>> 
>>> heartbeat 3 has that covered, btw.
> 
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to