Hello all,

I'm seeing unexpected CARP failovers upon rebooting what's expected to
be the backup firewall in a pair, which seems to have been introduced in
OpenBSD 5.1. A description of my setup follows...

enrhfw1 and enrhfw2 are a pair of OpenBSD 5.1-RELEASE systems (eg:
OpenBSD enrhfw2.datcon.co.uk 5.1 GENERIC.MP#207 amd64) using CARP and
PFSync, with preemption enabled, and the pfsync syncdev being a
dedicated crossover cable between the two (recommended by the FAQ as
being best-practice, and therefore presumably a common setup). All CARP
interfaces on enrhfw1 have a configured advskew of 0, and on enrhfw2
they have an advskew of 100 (so under normal conditions, enrhfw1 will be
master and enrhfw2 will be backup).

Here's what happens when I reboot enrhfw2, since upgrading them both to
OpenBSD 5.1. These are the relevant logs from both, with
net.inet.carp.log=3:

[Looks like the syncdev interface status on enrhfw1 cycles 3 times while
the server being rebooted has its BIOS splash screen up. The first of
these triggers enrhfw1 to request a pfsync bulk update from its
currently-rebooting peer]:

May 28 14:02:48 enrhfw1 /bsd: carp: pfsync0 demoted group carp by 1 to 1 
(pfsyncdev)
May 28 14:02:48 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by 1 to 1 
(pfsyncdev)
May 28 14:02:50 enrhfw1 /bsd: carp: pfsync0 demoted group carp by 1 to 2 
(pfsync 
bulk start)
May 28 14:02:50 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by 1 to 2 
(pfsync bulk start)
May 28 14:02:50 enrhfw1 /bsd: carp: pfsync0 demoted group carp by -1 to 1 
(pfsyncdev)
May 28 14:02:50 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by -1 to 1 
(pfsyncdev)
May 28 14:02:54 enrhfw1 /bsd: carp: pfsync0 demoted group carp by 1 to 2 
(pfsyncdev)
May 28 14:02:54 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by 1 to 2 
(pfsyncdev)
May 28 14:02:57 enrhfw1 /bsd: carp: pfsync0 demoted group carp by -1 to 1 
(pfsyncdev)
May 28 14:02:57 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by -1 to 1 
(pfsyncdev)
May 28 14:03:03 enrhfw1 /bsd: carp: pfsync0 demoted group carp by 1 to 2 
(pfsyncdev)
May 28 14:03:03 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by 1 to 2 
(pfsyncdev)
May 28 14:03:05 enrhfw1 /bsd: carp: pfsync0 demoted group carp by -1 to 1 
(pfsyncdev)
May 28 14:03:05 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by -1 to 1 
(pfsyncdev)

[This last cycle appears to be when the OpenBSD kernel loads and the
syncdev on enrhfw2 is then initialized by netstart.sh]:

May 28 14:03:40 enrhfw1 /bsd: carp: pfsync0 demoted group carp by 1 to 2 
(pfsyncdev)
May 28 14:03:40 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by 1 to 2 
(pfsyncdev)
May 28 14:03:45 enrhfw1 /bsd: carp: pfsync0 demoted group carp by -1 to 1 
(pfsyncdev)
May 28 14:03:45 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by -1 to 1 
(pfsyncdev)

[Here the rc scripts on enrhfw2 initialize all its interfaces, and
request a pfsync bulk update from enrhfw1 (although note that at this
stage enrhfw1 is still in the process of attempting a bulk update in the
reverse direction!). The rc script's "carp interlock" is in place at
this stage]:

May 28 14:03:45 enrhfw2 /bsd: carp: carp0 demoted group carp by 1 to 129 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp1 demoted group carp by 1 to 130 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp2 demoted group carp by 1 to 131 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp3 demoted group carp by 1 to 132 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp4 demoted group carp by 1 to 133 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp5 demoted group carp by 1 to 134 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: pfsync0 demoted group carp by 1 to 135 
(pfsync bulk start)
May 28 14:03:45 enrhfw2 /bsd: carp: pfsync0 demoted group pfsync by 1 to 1 
(pfsync bulk start)
May 28 14:03:45 enrhfw2 /bsd: carp: carp5 demoted group carp by -1 to 134 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp5 demoted group smartph by -1 to 0 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp0 demoted group carp by -1 to 133 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp0 demoted group int by -1 to 0 (carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp1 demoted group carp by -1 to 132 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp1 demoted group pubdmz by -1 to 0 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp3 demoted group carp by -1 to 131 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp3 demoted group extmgmt by -1 to 0 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp4 demoted group carp by -1 to 130 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp4 demoted group trext by -1 to 0 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp2 demoted group carp by -1 to 129 
(carpdev)
May 28 14:03:45 enrhfw2 /bsd: carp: carp2 demoted group ext by -1 to 0 (carpdev)
May 28 14:03:55 enrhfw2 /bsd: carp: pfsync0 demoted group carp by -1 to 128 
(pfsync bulk done)
May 28 14:03:55 enrhfw2 /bsd: carp: pfsync0 demoted group pfsync by -1 to 0 
(pfsync bulk done)

[Presumably the rc scripts on enrhfw2 complete at 14:04:03, such that
the carp interlock is removed, this results in the expected backup
system assuming master status, as enrhfw1 is still demoted due to the
bulk update]:

May 28 14:04:03 enrhfw2 /bsd: carp0: state transition: BACKUP -> MASTER
May 28 14:04:03 enrhfw2 /bsd: carp5: state transition: BACKUP -> MASTER
May 28 14:04:03 enrhfw2 /bsd: carp2: state transition: BACKUP -> MASTER
May 28 14:04:03 enrhfw2 /bsd: carp3: state transition: BACKUP -> MASTER
May 28 14:04:03 enrhfw2 /bsd: carp4: state transition: BACKUP -> MASTER
May 28 14:04:03 enrhfw2 /bsd: carp1: state transition: BACKUP -> MASTER

May 28 14:04:03 enrhfw1 /bsd: carp0: state transition: MASTER -> BACKUP
May 28 14:04:03 enrhfw1 /bsd: carp5: state transition: MASTER -> BACKUP
May 28 14:04:03 enrhfw1 /bsd: carp2: state transition: MASTER -> BACKUP
May 28 14:04:03 enrhfw1 /bsd: carp3: state transition: MASTER -> BACKUP
May 28 14:04:03 enrhfw1 /bsd: carp1: state transition: MASTER -> BACKUP
May 28 14:04:03 enrhfw1 /bsd: carp4: state transition: MASTER -> BACKUP

[Finally the bulk update on enrhfw1 completes many minutes later, and
it un-demotes itself and returns to master status]:

May 28 14:07:16 enrhfw1 /bsd: carp: pfsync0 demoted group carp by -1 to 0 
(pfsync bulk done)
May 28 14:07:16 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by -1 to 0 
(pfsync bulk done)
May 28 14:07:16 enrhfw1 /bsd: carp0: state transition: BACKUP -> MASTER
May 28 14:07:16 enrhfw1 /bsd: carp5: state transition: BACKUP -> MASTER
May 28 14:07:16 enrhfw1 /bsd: carp2: state transition: BACKUP -> MASTER
May 28 14:07:16 enrhfw1 /bsd: carp3: state transition: BACKUP -> MASTER
May 28 14:07:16 enrhfw1 /bsd: carp4: state transition: BACKUP -> MASTER
May 28 14:07:16 enrhfw1 /bsd: carp1: state transition: BACKUP -> MASTER

So essentially, the backup system becomes master shortly after it is
rebooted, and stays that way for a couple of minutes. That doesn't seem
like expected behaviour to me (and I don't think it did that prior to
upgrading from OpenBSD 5.0 to 5.1). My guess is this was introduced by
the changes to if_pfsync.c in revision 1.180. Is this in fact the
intended behaviour now, or am I missing some required configuration to
ensure this doesn't happen?

This doesn't actually cause any problems by itself, other than not being
what I'd expect. However on one of my other pairs of OpenBSD systems
(which differs from the pair described here only slightly), the
secondary often seems to take over before it's in a suitable state to do
so. That's quite probably an entirely different issue (which I'm only
partway to tracking down), but the fact that a failover event now always
occurs upon reboot of the secondary system really exacerbates that. I've
deliberately not detailed that other problem for now, to avoid confusing
matters regarding this first observation, but will do so afterwards.
Also note that I do have a workaround for both these problems (namely,
artificially extending the CARP interlock by sleeping in /etc/rc.local
until having detected that the peer is not demoted).

Thanks for any insight,
Alex

Reply via email to