Re: [Linux-HA] FW cluster fails at 4am

2014-01-07 Thread Andrew Beekhof

On 7 Jan 2014, at 10:52 am, Tracy Reed  wrote:

> On Sat, Dec 28, 2013 at 12:42:28AM PST, Jefferson Ogata spake thusly:
>> Is it possible that it's a coincidence of log rotation after patching? In
>> certain circumstances i've had library replacement or subsequent prelink
>> activity on libraries lead to a crash of some services during log rotation.
>> This hasn't happened to me with pacemaker/cman/corosync, but it might
>> conceivably explain why it only happens to you once in a while.
> 
> I just caught the cluster in the middle of crashing again and noticed it had a
> system load of 9. Although it isn't clear why.

See my other reply:

"Consider though, what effect 63 IPaddr monitor operations running at the same 
time might have on your system."

> A backup was running but after
> the cluster failed over the backup continues and the load went to very nearly
> zero. So it doesn't seem like the backup was causing the issue. But the system
> was noticeably performance impacted. I've never noticed this situation before.
> 
> One thing I really need to learn more about is how the cluster knows when
> something has failed and it needs to fail over.

We ask the resources by calling their script with $action=monitor
For node-level failures, corosync tells us. 

> I first setup a linux-ha
> firewall cluster back around 2001 and we used simple heartbeat and some 
> scripts
> to pass around IP addresses and start/stop the firewall. It would ping its
> upstream gateway and communicate with its partner via a serial cable. If the
> active node couldn't ping its upstream it killed the local heartbeat and the
> partner took over. If the active node wasn't sending heartbeats the passive
> node took over. Once working it stayed working and was much much simpler than
> the current arrangement.
> 
> I have no idea how the current system actually communicates or what the
> criteria for failover really are.
> 
> What are the chances that the box gets overloaded and drops a packet and the
> partner takes over?
> 
> What if I had an IP conflict with another box on the network and one of my VIP
> IP addresses didn't behave as expected?
> 
> What would any of these look like in the logs? One of my biggest difficulties
> in diagnosing this is that the logs are huge and noisy. It is hard to tell 
> what
> is normal, what is an error, and what is the actual test that failed which
> caused the failover.
> 
>> You might take a look at the pacct data in /var/account/ for the time of the
>> crash; it should indicate exit status for the dying process as well as what
>> other processes were started around the same time.
> 
> Process accounting wasn't running but /var/log/audit/audit.log is which has 
> the
> same info. What dying process are we talking about here? I haven't been able 
> to
> identify any processes which died.

I think there was an assumption that your resources were long running daemons.

> 
>> Yes, you're supposed to switch to cman. Not sure if it's related to your
>> problem, tho.
> 
> I suspect the cman issue is unrelated

Quite likely.

> so I'm not going to mess with it until I
> get the current issue figure out. I've had two more crashes since I started
> this thread: One around 3am and one just this afternoon around 1pm. A backup
> was running but after the cluster failed over the backup kept running and the
> load returned to normal (practically zero).
> 
> -- 
> Tracy Reed
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] FW cluster fails at 4am

2014-01-07 Thread Andrew Beekhof

On 28 Dec 2013, at 3:34 pm, Tracy Reed  wrote:

> Hello all,
> 
> First, thanks in advance for any help anyone may provide. I've been battling
> this problem off and on for months and it is driving me mad:  
> 
> Once every week or two my cluster fails. For reasons unknown it seems to
> initiate a failover and then the shorewall service (lsb) does not get started
> (or is stopped). The majority of the time it happens just after 4am. Although
> it has happened at other times, although much less frequently. Tonight I am
> going to have to be up at 4am to poke around on the cluster and observe what 
> is
> happening, if anything.
> 
> One theory is some sort of resource starvation such as CPU but I've stress
> tested it and run a backup and a big file copy through the firewall at the 
> same
> time and never get more than 1 core of cpu (almost all due to the backup) out
> of 4 utilized and nothing interesting happening to pacemaker/resources.
> 
> My setup is a bit complicated in that I have 63 IPaddr2 resources plus the
> shorewall resource. Plus order and colocation rules to make sure it all sticks
> together the IPs come up before shorewall.
> 
> I am running the latest RHEL/CentOS RPMs in CentOS 6.5:
> 
> [root@new-fw1 shorewall]# rpm -qa |grep -i corosync
> corosync-1.4.1-17.el6.x86_64
> corosynclib-1.4.1-17.el6.x86_64
> [root@new-fw1 shorewall]# rpm -qa |grep -i pacemaker
> pacemaker-1.1.10-14.el6.x86_64
> pacemaker-libs-1.1.10-14.el6.x86_64
> pacemaker-cluster-libs-1.1.10-14.el6.x86_64
> pacemaker-cli-1.1.10-14.el6.x86_64
> 
> I am a little concerned about how pacemaker manages the shorewall resource. It
> usually fails to bring up shorewall after a failover event. Shorewall could
> fail to start if the IP addresses shorewall is expecting to be on the
> interfaces are not there yet. But I have dependencies to prevent this from 
> ever
> happening such as:
> 
> order shorewall-after-dmz-gw inf: dmz-gw shorewall
> 
> I also wonder if the shorewall init script is properly LSB compatible. It
> wasn't out of the box and I had to make a minor change. But now it does seem 
> to
> be LSB compatible:
> 
> [root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?"
>   
>
> Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:57:14 PST 
> 2013
> 
> Shorewall is running
> State:Started (Fri Dec 27 04:11:14 PST 2013) from /etc/shorewall/
> 
> result: 0
> [root@new-fw2 ~]# /etc/init.d/shorewall stop ; echo "result: $?"
> Shutting down shorewall:   [  OK  ]
> result: 0
> [root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?"
> Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:57:48 PST 
> 2013
> 
> Shorewall is stopped
> State:Stopped (Fri Dec 27 16:57:47 PST 2013)
> 
> result: 3
> [root@new-fw2 ~]# /etc/init.d/shorewall start ; echo "result: $?"
> Starting shorewall:Shorewall is already running
> [  OK  ]
> result: 0
> [root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?"
> Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:58:04 PST 
> 2013
> 
> Shorewall is running
> State:Started (Fri Dec 27 16:57:53 PST 2013) from /etc/shorewall/
> 
> result: 0
> 
> So it shouldn't be an LSB issue at this point...
> 
> I have a very hard time making heads or tails of the
> /var/log/cluster/corosync.log log files. For example, I just had this appear 
> in the log files:
> 
> Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: 
>  Leave   spider2-eth0-40 (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: 
>  Leave   spider2-eth0-41 (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: 
>  Leave   corpsites   (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: 
>  Leave   dbrw(Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: 
>  Leave   mjhdev  (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: 
>  Leave   datapass1-ssl-eth0-2(Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: 
>  Leave   datapass2-ssl-eth0-2(Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: 
>  Leave   datapass2-ssl-eth0-1(Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: 
>  Leave   datapass2-ssl-eth0  (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.compengine: info: LogActions: 
>  Leave   rrdev2  (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.myd

Re: [Linux-HA] FW cluster fails at 4am

2014-01-06 Thread Tracy Reed
On Sat, Dec 28, 2013 at 12:42:28AM PST, Jefferson Ogata spake thusly:
> Is it possible that it's a coincidence of log rotation after patching? In
> certain circumstances i've had library replacement or subsequent prelink
> activity on libraries lead to a crash of some services during log rotation.
> This hasn't happened to me with pacemaker/cman/corosync, but it might
> conceivably explain why it only happens to you once in a while.

I just caught the cluster in the middle of crashing again and noticed it had a
system load of 9. Although it isn't clear why. A backup was running but after
the cluster failed over the backup continues and the load went to very nearly
zero. So it doesn't seem like the backup was causing the issue. But the system
was noticeably performance impacted. I've never noticed this situation before.

One thing I really need to learn more about is how the cluster knows when
something has failed and it needs to fail over. I first setup a linux-ha
firewall cluster back around 2001 and we used simple heartbeat and some scripts
to pass around IP addresses and start/stop the firewall. It would ping its
upstream gateway and communicate with its partner via a serial cable. If the
active node couldn't ping its upstream it killed the local heartbeat and the
partner took over. If the active node wasn't sending heartbeats the passive
node took over. Once working it stayed working and was much much simpler than
the current arrangement.

I have no idea how the current system actually communicates or what the
criteria for failover really are.

What are the chances that the box gets overloaded and drops a packet and the
partner takes over?

What if I had an IP conflict with another box on the network and one of my VIP
IP addresses didn't behave as expected?

What would any of these look like in the logs? One of my biggest difficulties
in diagnosing this is that the logs are huge and noisy. It is hard to tell what
is normal, what is an error, and what is the actual test that failed which
caused the failover.

> You might take a look at the pacct data in /var/account/ for the time of the
> crash; it should indicate exit status for the dying process as well as what
> other processes were started around the same time.

Process accounting wasn't running but /var/log/audit/audit.log is which has the
same info. What dying process are we talking about here? I haven't been able to
identify any processes which died.

> Yes, you're supposed to switch to cman. Not sure if it's related to your
> problem, tho.

I suspect the cman issue is unrelated so I'm not going to mess with it until I
get the current issue figure out. I've had two more crashes since I started
this thread: One around 3am and one just this afternoon around 1pm. A backup
was running but after the cluster failed over the backup kept running and the
load returned to normal (practically zero).

-- 
Tracy Reed


pgpQD3Hl7rhb5.pgp
Description: PGP signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] FW cluster fails at 4am

2013-12-28 Thread Jefferson Ogata

On 2013-12-28 06:13, Tracy Reed wrote:

On Fri, Dec 27, 2013 at 08:54:17PM PST, Jefferson Ogata spake thusly:

Log rotation tends to run around that time on Red Hat. Check your logrotate
configuration. Maybe something is rotating corosync logs and using the wrong
signal to start a new log file.


That was actually the first thing I looked at! I found
/etc/logrotate.d/shorewall and removed it. But that seems to have had no effect
on the problem. That file has been gone for 3 weeks, the machines rebooted (not
that it should matter), and the problem has happened several times since then.

I've searched all over and can't find anything. And it doesn't even happen
every morning, just every week or two. Hard to nail down a real pattern other
than usually (not always) 4am.


Is it possible that it's a coincidence of log rotation after patching? 
In certain circumstances i've had library replacement or subsequent 
prelink activity on libraries lead to a crash of some services during 
log rotation. This hasn't happened to me with pacemaker/cman/corosync, 
but it might conceivably explain why it only happens to you once in a while.


You might take a look at the pacct data in /var/account/ for the time of 
the crash; it should indicate exit status for the dying process as well 
as what other processes were started around the same time.



Or, if not that, could it be some other cronned task?


These firewall machines are standard CentOS boxes. The stock crons (logrotate
etc) and a 5 minute nagios passive check are the only things on them as far as
I can tell. Although I haven't quite figured out what causes logrotate to run
at 4am. I know it is in the /etc/cron.daily/logrotate but what runs this at
4am? Is 4am some special hard-coded time in crond?

I just noticed that there is an /etc/logrotate.d/cman which rotates
/var/log/cluster/*log Could this somehow be an issue? I'm running pacemaker and
corosync but I'm not running cman:

# /etc/init.d/cman status
cman is not running

Should I be? I don't think it is necessary for this particular kind of
cluster... But since it isn't running it shouldn't matter.


Yes, you're supposed to switch to cman. Not sure if it's related to your 
problem, tho.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] FW cluster fails at 4am

2013-12-27 Thread Tracy Reed
On Fri, Dec 27, 2013 at 08:54:17PM PST, Jefferson Ogata spake thusly:
> Log rotation tends to run around that time on Red Hat. Check your logrotate
> configuration. Maybe something is rotating corosync logs and using the wrong
> signal to start a new log file.

That was actually the first thing I looked at! I found
/etc/logrotate.d/shorewall and removed it. But that seems to have had no effect
on the problem. That file has been gone for 3 weeks, the machines rebooted (not
that it should matter), and the problem has happened several times since then.

I've searched all over and can't find anything. And it doesn't even happen
every morning, just every week or two. Hard to nail down a real pattern other
than usually (not always) 4am.

> Or, if not that, could it be some other cronned task?

These firewall machines are standard CentOS boxes. The stock crons (logrotate
etc) and a 5 minute nagios passive check are the only things on them as far as
I can tell. Although I haven't quite figured out what causes logrotate to run
at 4am. I know it is in the /etc/cron.daily/logrotate but what runs this at
4am? Is 4am some special hard-coded time in crond? 

I just noticed that there is an /etc/logrotate.d/cman which rotates
/var/log/cluster/*log Could this somehow be an issue? I'm running pacemaker and
corosync but I'm not running cman:

# /etc/init.d/cman status
cman is not running

Should I be? I don't think it is necessary for this particular kind of
cluster... But since it isn't running it shouldn't matter.

Oddly, I just noticed this in my tail -f of the logs (no idea what triggered
it, but I did run /etc/init.d/cman status on the other node) which actually
mentions cman:

# Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: 
crm_client_new:Connecting 0x2485d00 for uid=0 gid=0 pid=11103 
id=b507c867-cbde-4508-8813-1439720f9c6b
Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: 
cib_process_request: Completed cib_query operation for section 'all': 
OK (rc=0, origin=local/crm_mon/2, version=0.391.5)
Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: 
crm_compress_string: Compressed 201733 bytes into 12221 (ratio 16:1) in 
69ms
Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: 
crm_client_destroy:  Destroying 0 events
Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: 
crm_client_new:  Connecting 0x2485d00 for uid=0 gid=0 pid=11105 
id=7e95fe48-ca73-4f29-b2b7-e43596fab588
Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: 
cib_process_request: Completed cib_query operation for section 'all': 
OK (rc=0, origin=local/cibadmin/2, version=0.391.5)
Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: 
crm_compress_string: Compressed 201732 bytes into 12220 (ratio 16:1) in 
61ms
Dec 27 21:55:05 [1541] new-fw2.mydomain.comcib: info: 
crm_client_destroy:  Destroying 0 events
Set r/w permissions for uid=0, gid=0 on /var/log/cluster/corosync.log
Dec 27 21:55:07 corosync [pcmk  ] info: process_ais_conf: Reading configure
Dec 27 21:55:07 corosync [pcmk  ] ERROR: process_ais_conf: You have configured 
a cluster using the Pacemaker plugin for Corosync. The plugin is not supported 
in this environment and will be removed very soon.
Dec 27 21:55:07 corosync [pcmk  ] ERROR: process_ais_conf:  Please see Chapter 
8 of 'Clusters from Scratch' (http://www.clusterlabs.org/doc) for details on 
using Pacemaker with CMAN
Dec 27 21:55:07 corosync [pcmk  ] info: config_find_init: Local handle: 
7178156903111852040 for logging
Dec 27 21:55:07 corosync [pcmk  ] info: config_find_next: Processing additional 
logging options...
Dec 27 21:55:07 corosync [pcmk  ] info: get_config_opt: Found 'off' for option: 
debug
Dec 27 21:55:07 corosync [pcmk  ] info: get_config_opt: Found 'yes' for option: 
to_logfile
Dec 27 21:55:07 corosync [pcmk  ] info: get_config_opt: Found 
'/var/log/cluster/corosync.log' for option: logfile
Dec 27 21:55:07 corosync [pcmk  ] info: get_config_opt: Found 'yes' for option: 
to_syslog
Dec 27 21:55:07 corosync [pcmk  ] info: get_config_opt: Defaulting to 'daemon' 
for option: syslog_facility
Dec 27 21:55:07 corosync [pcmk  ] info: config_find_init: Local handle: 
5773499849093677065 for quorum
Dec 27 21:55:07 corosync [pcmk  ] info: config_find_next: No additional 
configuration supplied for: quorum
Dec 27 21:55:07 corosync [pcmk  ] info: get_config_opt: No default for option: 
provider
Dec 27 21:55:07 corosync [pcmk  ] info: config_find_init: Local handle: 
7711695921217536010 for service
Dec 27 21:55:07 corosync [pcmk  ] info: config_find_next: Processing additional 
service options...
Dec 27 21:55:07 corosync [pcmk  ] info: get_config_opt: Found '1' for option: 
ver
Dec 27 21:55:07 corosync [pcmk  ] info: process_ais_conf: Enabling MCP mode: 
Use the Pacemaker init script to complete Pacemaker startup
Dec 27 21:55:07 corosync [pcmk  ] info: get_config_opt: Defaulting to 

Re: [Linux-HA] FW cluster fails at 4am

2013-12-27 Thread Jefferson Ogata

On 2013-12-28 04:34, Tracy Reed wrote:

First, thanks in advance for any help anyone may provide. I've been battling
this problem off and on for months and it is driving me mad:

Once every week or two my cluster fails. For reasons unknown it seems to
initiate a failover and then the shorewall service (lsb) does not get started
(or is stopped). The majority of the time it happens just after 4am. Although
it has happened at other times, although much less frequently. Tonight I am
going to have to be up at 4am to poke around on the cluster and observe what is
happening, if anything.

[snip]

Log rotation tends to run around that time on Red Hat. Check your 
logrotate configuration. Maybe something is rotating corosync logs and 
using the wrong signal to start a new log file.


Or, if not that, could it be some other cronned task?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems