On 7 Jan 2014, at 10:52 am, Tracy Reed <tr...@ultraviolet.org> wrote:
> On Sat, Dec 28, 2013 at 12:42:28AM PST, Jefferson Ogata spake thusly: >> Is it possible that it's a coincidence of log rotation after patching? In >> certain circumstances i've had library replacement or subsequent prelink >> activity on libraries lead to a crash of some services during log rotation. >> This hasn't happened to me with pacemaker/cman/corosync, but it might >> conceivably explain why it only happens to you once in a while. > > I just caught the cluster in the middle of crashing again and noticed it had a > system load of 9. Although it isn't clear why. See my other reply: "Consider though, what effect 63 IPaddr monitor operations running at the same time might have on your system." > A backup was running but after > the cluster failed over the backup continues and the load went to very nearly > zero. So it doesn't seem like the backup was causing the issue. But the system > was noticeably performance impacted. I've never noticed this situation before. > > One thing I really need to learn more about is how the cluster knows when > something has failed and it needs to fail over. We ask the resources by calling their script with $action=monitor For node-level failures, corosync tells us. > I first setup a linux-ha > firewall cluster back around 2001 and we used simple heartbeat and some > scripts > to pass around IP addresses and start/stop the firewall. It would ping its > upstream gateway and communicate with its partner via a serial cable. If the > active node couldn't ping its upstream it killed the local heartbeat and the > partner took over. If the active node wasn't sending heartbeats the passive > node took over. Once working it stayed working and was much much simpler than > the current arrangement. > > I have no idea how the current system actually communicates or what the > criteria for failover really are. > > What are the chances that the box gets overloaded and drops a packet and the > partner takes over? > > What if I had an IP conflict with another box on the network and one of my VIP > IP addresses didn't behave as expected? > > What would any of these look like in the logs? One of my biggest difficulties > in diagnosing this is that the logs are huge and noisy. It is hard to tell > what > is normal, what is an error, and what is the actual test that failed which > caused the failover. > >> You might take a look at the pacct data in /var/account/ for the time of the >> crash; it should indicate exit status for the dying process as well as what >> other processes were started around the same time. > > Process accounting wasn't running but /var/log/audit/audit.log is which has > the > same info. What dying process are we talking about here? I haven't been able > to > identify any processes which died. I think there was an assumption that your resources were long running daemons. > >> Yes, you're supposed to switch to cman. Not sure if it's related to your >> problem, tho. > > I suspect the cman issue is unrelated Quite likely. > so I'm not going to mess with it until I > get the current issue figure out. I've had two more crashes since I started > this thread: One around 3am and one just this afternoon around 1pm. A backup > was running but after the cluster failed over the backup kept running and the > load returned to normal (practically zero). > > -- > Tracy Reed > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems