Re: [Linux-HA] FW cluster fails at 4am
On 7 Jan 2014, at 10:52 am, Tracy Reed wrote: > On Sat, Dec 28, 2013 at 12:42:28AM PST, Jefferson Ogata spake thusly: >> Is it possible that it's a coincidence of log rotation after patching? In >> certain circumstances i've had library replacement or subsequent prelink >> activity on libraries lead to a crash of some services during log rotation. >> This hasn't happened to me with pacemaker/cman/corosync, but it might >> conceivably explain why it only happens to you once in a while. > > I just caught the cluster in the middle of crashing again and noticed it had a > system load of 9. Although it isn't clear why. See my other reply: "Consider though, what effect 63 IPaddr monitor operations running at the same time might have on your system." > A backup was running but after > the cluster failed over the backup continues and the load went to very nearly > zero. So it doesn't seem like the backup was causing the issue. But the system > was noticeably performance impacted. I've never noticed this situation before. > > One thing I really need to learn more about is how the cluster knows when > something has failed and it needs to fail over. We ask the resources by calling their script with $action=monitor For node-level failures, corosync tells us. > I first setup a linux-ha > firewall cluster back around 2001 and we used simple heartbeat and some > scripts > to pass around IP addresses and start/stop the firewall. It would ping its > upstream gateway and communicate with its partner via a serial cable. If the > active node couldn't ping its upstream it killed the local heartbeat and the > partner took over. If the active node wasn't sending heartbeats the passive > node took over. Once working it stayed working and was much much simpler than > the current arrangement. > > I have no idea how the current system actually communicates or what the > criteria for failover really are. > > What are the chances that the box gets overloaded and drops a packet and the > partner takes over? > > What if I had an IP conflict with another box on the network and one of my VIP > IP addresses didn't behave as expected? > > What would any of these look like in the logs? One of my biggest difficulties > in diagnosing this is that the logs are huge and noisy. It is hard to tell > what > is normal, what is an error, and what is the actual test that failed which > caused the failover. > >> You might take a look at the pacct data in /var/account/ for the time of the >> crash; it should indicate exit status for the dying process as well as what >> other processes were started around the same time. > > Process accounting wasn't running but /var/log/audit/audit.log is which has > the > same info. What dying process are we talking about here? I haven't been able > to > identify any processes which died. I think there was an assumption that your resources were long running daemons. > >> Yes, you're supposed to switch to cman. Not sure if it's related to your >> problem, tho. > > I suspect the cman issue is unrelated Quite likely. > so I'm not going to mess with it until I > get the current issue figure out. I've had two more crashes since I started > this thread: One around 3am and one just this afternoon around 1pm. A backup > was running but after the cluster failed over the backup kept running and the > load returned to normal (practically zero). > > -- > Tracy Reed > ___ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems signature.asc Description: Message signed with OpenPGP using GPGMail ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] FW cluster fails at 4am
On 28 Dec 2013, at 3:34 pm, Tracy Reed wrote: > Hello all, > > First, thanks in advance for any help anyone may provide. I've been battling > this problem off and on for months and it is driving me mad: > > Once every week or two my cluster fails. For reasons unknown it seems to > initiate a failover and then the shorewall service (lsb) does not get started > (or is stopped). The majority of the time it happens just after 4am. Although > it has happened at other times, although much less frequently. Tonight I am > going to have to be up at 4am to poke around on the cluster and observe what > is > happening, if anything. > > One theory is some sort of resource starvation such as CPU but I've stress > tested it and run a backup and a big file copy through the firewall at the > same > time and never get more than 1 core of cpu (almost all due to the backup) out > of 4 utilized and nothing interesting happening to pacemaker/resources. > > My setup is a bit complicated in that I have 63 IPaddr2 resources plus the > shorewall resource. Plus order and colocation rules to make sure it all sticks > together the IPs come up before shorewall. > > I am running the latest RHEL/CentOS RPMs in CentOS 6.5: > > [root@new-fw1 shorewall]# rpm -qa |grep -i corosync > corosync-1.4.1-17.el6.x86_64 > corosynclib-1.4.1-17.el6.x86_64 > [root@new-fw1 shorewall]# rpm -qa |grep -i pacemaker > pacemaker-1.1.10-14.el6.x86_64 > pacemaker-libs-1.1.10-14.el6.x86_64 > pacemaker-cluster-libs-1.1.10-14.el6.x86_64 > pacemaker-cli-1.1.10-14.el6.x86_64 > > I am a little concerned about how pacemaker manages the shorewall resource. It > usually fails to bring up shorewall after a failover event. Shorewall could > fail to start if the IP addresses shorewall is expecting to be on the > interfaces are not there yet. But I have dependencies to prevent this from > ever > happening such as: > > order shorewall-after-dmz-gw inf: dmz-gw shorewall > > I also wonder if the shorewall init script is properly LSB compatible. It > wasn't out of the box and I had to make a minor change. But now it does seem > to > be LSB compatible: > > [root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?" > > > Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:57:14 PST > 2013 > > Shorewall is running > State:Started (Fri Dec 27 04:11:14 PST 2013) from /etc/shorewall/ > > result: 0 > [root@new-fw2 ~]# /etc/init.d/shorewall stop ; echo "result: $?" > Shutting down shorewall: [ OK ] > result: 0 > [root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?" > Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:57:48 PST > 2013 > > Shorewall is stopped > State:Stopped (Fri Dec 27 16:57:47 PST 2013) > > result: 3 > [root@new-fw2 ~]# /etc/init.d/shorewall start ; echo "result: $?" > Starting shorewall:Shorewall is already running > [ OK ] > result: 0 > [root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?" > Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:58:04 PST > 2013 > > Shorewall is running > State:Started (Fri Dec 27 16:57:53 PST 2013) from /etc/shorewall/ > > result: 0 > > So it shouldn't be an LSB issue at this point... > > I have a very hard time making heads or tails of the > /var/log/cluster/corosync.log log files. For example, I just had this appear > in the log files: > > Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: > Leave spider2-eth0-40 (Started new-fw2.mydomain.com) > Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: > Leave spider2-eth0-41 (Started new-fw2.mydomain.com) > Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: > Leave corpsites (Started new-fw2.mydomain.com) > Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: > Leave dbrw(Started new-fw2.mydomain.com) > Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: > Leave mjhdev (Started new-fw2.mydomain.com) > Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: > Leave datapass1-ssl-eth0-2(Started new-fw2.mydomain.com) > Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: > Leave datapass2-ssl-eth0-2(Started new-fw2.mydomain.com) > Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: > Leave datapass2-ssl-eth0-1(Started new-fw2.mydomain.com) > Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: > Leave datapass2-ssl-eth0 (Started new-fw2.mydomain.com) > Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: > Leave rrdev2 (Started new-fw2.mydomain.com) > Dec 27 19:56:31 [1551] new-fw1.myd
Re: [Linux-HA] FW cluster fails at 4am
On Sat, Dec 28, 2013 at 12:42:28AM PST, Jefferson Ogata spake thusly: > Is it possible that it's a coincidence of log rotation after patching? In > certain circumstances i've had library replacement or subsequent prelink > activity on libraries lead to a crash of some services during log rotation. > This hasn't happened to me with pacemaker/cman/corosync, but it might > conceivably explain why it only happens to you once in a while. I just caught the cluster in the middle of crashing again and noticed it had a system load of 9. Although it isn't clear why. A backup was running but after the cluster failed over the backup continues and the load went to very nearly zero. So it doesn't seem like the backup was causing the issue. But the system was noticeably performance impacted. I've never noticed this situation before. One thing I really need to learn more about is how the cluster knows when something has failed and it needs to fail over. I first setup a linux-ha firewall cluster back around 2001 and we used simple heartbeat and some scripts to pass around IP addresses and start/stop the firewall. It would ping its upstream gateway and communicate with its partner via a serial cable. If the active node couldn't ping its upstream it killed the local heartbeat and the partner took over. If the active node wasn't sending heartbeats the passive node took over. Once working it stayed working and was much much simpler than the current arrangement. I have no idea how the current system actually communicates or what the criteria for failover really are. What are the chances that the box gets overloaded and drops a packet and the partner takes over? What if I had an IP conflict with another box on the network and one of my VIP IP addresses didn't behave as expected? What would any of these look like in the logs? One of my biggest difficulties in diagnosing this is that the logs are huge and noisy. It is hard to tell what is normal, what is an error, and what is the actual test that failed which caused the failover. > You might take a look at the pacct data in /var/account/ for the time of the > crash; it should indicate exit status for the dying process as well as what > other processes were started around the same time. Process accounting wasn't running but /var/log/audit/audit.log is which has the same info. What dying process are we talking about here? I haven't been able to identify any processes which died. > Yes, you're supposed to switch to cman. Not sure if it's related to your > problem, tho. I suspect the cman issue is unrelated so I'm not going to mess with it until I get the current issue figure out. I've had two more crashes since I started this thread: One around 3am and one just this afternoon around 1pm. A backup was running but after the cluster failed over the backup kept running and the load returned to normal (practically zero). -- Tracy Reed pgpQD3Hl7rhb5.pgp Description: PGP signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] FW cluster fails at 4am
On 2013-12-28 06:13, Tracy Reed wrote: On Fri, Dec 27, 2013 at 08:54:17PM PST, Jefferson Ogata spake thusly: Log rotation tends to run around that time on Red Hat. Check your logrotate configuration. Maybe something is rotating corosync logs and using the wrong signal to start a new log file. That was actually the first thing I looked at! I found /etc/logrotate.d/shorewall and removed it. But that seems to have had no effect on the problem. That file has been gone for 3 weeks, the machines rebooted (not that it should matter), and the problem has happened several times since then. I've searched all over and can't find anything. And it doesn't even happen every morning, just every week or two. Hard to nail down a real pattern other than usually (not always) 4am. Is it possible that it's a coincidence of log rotation after patching? In certain circumstances i've had library replacement or subsequent prelink activity on libraries lead to a crash of some services during log rotation. This hasn't happened to me with pacemaker/cman/corosync, but it might conceivably explain why it only happens to you once in a while. You might take a look at the pacct data in /var/account/ for the time of the crash; it should indicate exit status for the dying process as well as what other processes were started around the same time. Or, if not that, could it be some other cronned task? These firewall machines are standard CentOS boxes. The stock crons (logrotate etc) and a 5 minute nagios passive check are the only things on them as far as I can tell. Although I haven't quite figured out what causes logrotate to run at 4am. I know it is in the /etc/cron.daily/logrotate but what runs this at 4am? Is 4am some special hard-coded time in crond? I just noticed that there is an /etc/logrotate.d/cman which rotates /var/log/cluster/*log Could this somehow be an issue? I'm running pacemaker and corosync but I'm not running cman: # /etc/init.d/cman status cman is not running Should I be? I don't think it is necessary for this particular kind of cluster... But since it isn't running it shouldn't matter. Yes, you're supposed to switch to cman. Not sure if it's related to your problem, tho. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] FW cluster fails at 4am
On Fri, Dec 27, 2013 at 08:54:17PM PST, Jefferson Ogata spake thusly: > Log rotation tends to run around that time on Red Hat. Check your logrotate > configuration. Maybe something is rotating corosync logs and using the wrong > signal to start a new log file. That was actually the first thing I looked at! I found /etc/logrotate.d/shorewall and removed it. But that seems to have had no effect on the problem. That file has been gone for 3 weeks, the machines rebooted (not that it should matter), and the problem has happened several times since then. I've searched all over and can't find anything. And it doesn't even happen every morning, just every week or two. Hard to nail down a real pattern other than usually (not always) 4am. > Or, if not that, could it be some other cronned task? These firewall machines are standard CentOS boxes. The stock crons (logrotate etc) and a 5 minute nagios passive check are the only things on them as far as I can tell. Although I haven't quite figured out what causes logrotate to run at 4am. I know it is in the /etc/cron.daily/logrotate but what runs this at 4am? Is 4am some special hard-coded time in crond? I just noticed that there is an /etc/logrotate.d/cman which rotates /var/log/cluster/*log Could this somehow be an issue? I'm running pacemaker and corosync but I'm not running cman: # /etc/init.d/cman status cman is not running Should I be? I don't think it is necessary for this particular kind of cluster... But since it isn't running it shouldn't matter. Oddly, I just noticed this in my tail -f of the logs (no idea what triggered it, but I did run /etc/init.d/cman status on the other node) which actually mentions cman: # Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: crm_client_new:Connecting 0x2485d00 for uid=0 gid=0 pid=11103 id=b507c867-cbde-4508-8813-1439720f9c6b Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/crm_mon/2, version=0.391.5) Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: crm_compress_string: Compressed 201733 bytes into 12221 (ratio 16:1) in 69ms Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: crm_client_destroy: Destroying 0 events Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: crm_client_new: Connecting 0x2485d00 for uid=0 gid=0 pid=11105 id=7e95fe48-ca73-4f29-b2b7-e43596fab588 Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/cibadmin/2, version=0.391.5) Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: crm_compress_string: Compressed 201732 bytes into 12220 (ratio 16:1) in 61ms Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: crm_client_destroy: Destroying 0 events Set r/w permissions for uid=0, gid=0 on /var/log/cluster/corosync.log Dec 27 21:55:07 corosync [pcmk ] info: process_ais_conf: Reading configure Dec 27 21:55:07 corosync [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon. Dec 27 21:55:07 corosync [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of 'Clusters from Scratch' (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN Dec 27 21:55:07 corosync [pcmk ] info: config_find_init: Local handle: 7178156903111852040 for logging Dec 27 21:55:07 corosync [pcmk ] info: config_find_next: Processing additional logging options... Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Found 'off' for option: debug Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: to_logfile Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Found '/var/log/cluster/corosync.log' for option: logfile Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: to_syslog Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Defaulting to 'daemon' for option: syslog_facility Dec 27 21:55:07 corosync [pcmk ] info: config_find_init: Local handle: 5773499849093677065 for quorum Dec 27 21:55:07 corosync [pcmk ] info: config_find_next: No additional configuration supplied for: quorum Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: No default for option: provider Dec 27 21:55:07 corosync [pcmk ] info: config_find_init: Local handle: 7711695921217536010 for service Dec 27 21:55:07 corosync [pcmk ] info: config_find_next: Processing additional service options... Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Found '1' for option: ver Dec 27 21:55:07 corosync [pcmk ] info: process_ais_conf: Enabling MCP mode: Use the Pacemaker init script to complete Pacemaker startup Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Defaulting to
Re: [Linux-HA] FW cluster fails at 4am
On 2013-12-28 04:34, Tracy Reed wrote: First, thanks in advance for any help anyone may provide. I've been battling this problem off and on for months and it is driving me mad: Once every week or two my cluster fails. For reasons unknown it seems to initiate a failover and then the shorewall service (lsb) does not get started (or is stopped). The majority of the time it happens just after 4am. Although it has happened at other times, although much less frequently. Tonight I am going to have to be up at 4am to poke around on the cluster and observe what is happening, if anything. [snip] Log rotation tends to run around that time on Red Hat. Check your logrotate configuration. Maybe something is rotating corosync logs and using the wrong signal to start a new log file. Or, if not that, could it be some other cronned task? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems