Re: [Linux-HA] FW cluster fails at 4am

Andrew Beekhof Tue, 07 Jan 2014 17:51:29 -0800

On 7 Jan 2014, at 10:52 am, Tracy Reed <tr...@ultraviolet.org> wrote:


> On Sat, Dec 28, 2013 at 12:42:28AM PST, Jefferson Ogata spake thusly:
>> Is it possible that it's a coincidence of log rotation after patching? In
>> certain circumstances i've had library replacement or subsequent prelink
>> activity on libraries lead to a crash of some services during log rotation.
>> This hasn't happened to me with pacemaker/cman/corosync, but it might
>> conceivably explain why it only happens to you once in a while.
> 
> I just caught the cluster in the middle of crashing again and noticed it had a
> system load of 9. Although it isn't clear why.

See my other reply:

"Consider though, what effect 63 IPaddr monitor operations running at the same 
time might have on your system."

> A backup was running but after
> the cluster failed over the backup continues and the load went to very nearly
> zero. So it doesn't seem like the backup was causing the issue. But the system
> was noticeably performance impacted. I've never noticed this situation before.
> 
> One thing I really need to learn more about is how the cluster knows when
> something has failed and it needs to fail over.

We ask the resources by calling their script with $action=monitor
For node-level failures, corosync tells us. 

> I first setup a linux-ha
> firewall cluster back around 2001 and we used simple heartbeat and some 
> scripts
> to pass around IP addresses and start/stop the firewall. It would ping its
> upstream gateway and communicate with its partner via a serial cable. If the
> active node couldn't ping its upstream it killed the local heartbeat and the
> partner took over. If the active node wasn't sending heartbeats the passive
> node took over. Once working it stayed working and was much much simpler than
> the current arrangement.
> 
> I have no idea how the current system actually communicates or what the
> criteria for failover really are.
> 
> What are the chances that the box gets overloaded and drops a packet and the
> partner takes over?
> 
> What if I had an IP conflict with another box on the network and one of my VIP
> IP addresses didn't behave as expected?
> 
> What would any of these look like in the logs? One of my biggest difficulties
> in diagnosing this is that the logs are huge and noisy. It is hard to tell 
> what
> is normal, what is an error, and what is the actual test that failed which
> caused the failover.
> 
>> You might take a look at the pacct data in /var/account/ for the time of the
>> crash; it should indicate exit status for the dying process as well as what
>> other processes were started around the same time.
> 
> Process accounting wasn't running but /var/log/audit/audit.log is which has 
> the
> same info. What dying process are we talking about here? I haven't been able 
> to
> identify any processes which died.

I think there was an assumption that your resources were long running daemons.

> 
>> Yes, you're supposed to switch to cman. Not sure if it's related to your
>> problem, tho.
> 
> I suspect the cman issue is unrelated

Quite likely.

> so I'm not going to mess with it until I
> get the current issue figure out. I've had two more crashes since I started
> this thread: One around 3am and one just this afternoon around 1pm. A backup
> was running but after the cluster failed over the backup kept running and the
> load returned to normal (practically zero).
> 
> -- 
> Tracy Reed
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] FW cluster fails at 4am

Reply via email to