Hello all, I'm seeing unexpected CARP failovers upon rebooting what's expected to be the backup firewall in a pair, which seems to have been introduced in OpenBSD 5.1. A description of my setup follows...
enrhfw1 and enrhfw2 are a pair of OpenBSD 5.1-RELEASE systems (eg: OpenBSD enrhfw2.datcon.co.uk 5.1 GENERIC.MP#207 amd64) using CARP and PFSync, with preemption enabled, and the pfsync syncdev being a dedicated crossover cable between the two (recommended by the FAQ as being best-practice, and therefore presumably a common setup). All CARP interfaces on enrhfw1 have a configured advskew of 0, and on enrhfw2 they have an advskew of 100 (so under normal conditions, enrhfw1 will be master and enrhfw2 will be backup). Here's what happens when I reboot enrhfw2, since upgrading them both to OpenBSD 5.1. These are the relevant logs from both, with net.inet.carp.log=3: [Looks like the syncdev interface status on enrhfw1 cycles 3 times while the server being rebooted has its BIOS splash screen up. The first of these triggers enrhfw1 to request a pfsync bulk update from its currently-rebooting peer]: May 28 14:02:48 enrhfw1 /bsd: carp: pfsync0 demoted group carp by 1 to 1 (pfsyncdev) May 28 14:02:48 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by 1 to 1 (pfsyncdev) May 28 14:02:50 enrhfw1 /bsd: carp: pfsync0 demoted group carp by 1 to 2 (pfsync bulk start) May 28 14:02:50 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by 1 to 2 (pfsync bulk start) May 28 14:02:50 enrhfw1 /bsd: carp: pfsync0 demoted group carp by -1 to 1 (pfsyncdev) May 28 14:02:50 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by -1 to 1 (pfsyncdev) May 28 14:02:54 enrhfw1 /bsd: carp: pfsync0 demoted group carp by 1 to 2 (pfsyncdev) May 28 14:02:54 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by 1 to 2 (pfsyncdev) May 28 14:02:57 enrhfw1 /bsd: carp: pfsync0 demoted group carp by -1 to 1 (pfsyncdev) May 28 14:02:57 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by -1 to 1 (pfsyncdev) May 28 14:03:03 enrhfw1 /bsd: carp: pfsync0 demoted group carp by 1 to 2 (pfsyncdev) May 28 14:03:03 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by 1 to 2 (pfsyncdev) May 28 14:03:05 enrhfw1 /bsd: carp: pfsync0 demoted group carp by -1 to 1 (pfsyncdev) May 28 14:03:05 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by -1 to 1 (pfsyncdev) [This last cycle appears to be when the OpenBSD kernel loads and the syncdev on enrhfw2 is then initialized by netstart.sh]: May 28 14:03:40 enrhfw1 /bsd: carp: pfsync0 demoted group carp by 1 to 2 (pfsyncdev) May 28 14:03:40 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by 1 to 2 (pfsyncdev) May 28 14:03:45 enrhfw1 /bsd: carp: pfsync0 demoted group carp by -1 to 1 (pfsyncdev) May 28 14:03:45 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by -1 to 1 (pfsyncdev) [Here the rc scripts on enrhfw2 initialize all its interfaces, and request a pfsync bulk update from enrhfw1 (although note that at this stage enrhfw1 is still in the process of attempting a bulk update in the reverse direction!). The rc script's "carp interlock" is in place at this stage]: May 28 14:03:45 enrhfw2 /bsd: carp: carp0 demoted group carp by 1 to 129 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp1 demoted group carp by 1 to 130 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp2 demoted group carp by 1 to 131 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp3 demoted group carp by 1 to 132 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp4 demoted group carp by 1 to 133 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp5 demoted group carp by 1 to 134 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: pfsync0 demoted group carp by 1 to 135 (pfsync bulk start) May 28 14:03:45 enrhfw2 /bsd: carp: pfsync0 demoted group pfsync by 1 to 1 (pfsync bulk start) May 28 14:03:45 enrhfw2 /bsd: carp: carp5 demoted group carp by -1 to 134 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp5 demoted group smartph by -1 to 0 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp0 demoted group carp by -1 to 133 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp0 demoted group int by -1 to 0 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp1 demoted group carp by -1 to 132 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp1 demoted group pubdmz by -1 to 0 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp3 demoted group carp by -1 to 131 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp3 demoted group extmgmt by -1 to 0 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp4 demoted group carp by -1 to 130 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp4 demoted group trext by -1 to 0 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp2 demoted group carp by -1 to 129 (carpdev) May 28 14:03:45 enrhfw2 /bsd: carp: carp2 demoted group ext by -1 to 0 (carpdev) May 28 14:03:55 enrhfw2 /bsd: carp: pfsync0 demoted group carp by -1 to 128 (pfsync bulk done) May 28 14:03:55 enrhfw2 /bsd: carp: pfsync0 demoted group pfsync by -1 to 0 (pfsync bulk done) [Presumably the rc scripts on enrhfw2 complete at 14:04:03, such that the carp interlock is removed, this results in the expected backup system assuming master status, as enrhfw1 is still demoted due to the bulk update]: May 28 14:04:03 enrhfw2 /bsd: carp0: state transition: BACKUP -> MASTER May 28 14:04:03 enrhfw2 /bsd: carp5: state transition: BACKUP -> MASTER May 28 14:04:03 enrhfw2 /bsd: carp2: state transition: BACKUP -> MASTER May 28 14:04:03 enrhfw2 /bsd: carp3: state transition: BACKUP -> MASTER May 28 14:04:03 enrhfw2 /bsd: carp4: state transition: BACKUP -> MASTER May 28 14:04:03 enrhfw2 /bsd: carp1: state transition: BACKUP -> MASTER May 28 14:04:03 enrhfw1 /bsd: carp0: state transition: MASTER -> BACKUP May 28 14:04:03 enrhfw1 /bsd: carp5: state transition: MASTER -> BACKUP May 28 14:04:03 enrhfw1 /bsd: carp2: state transition: MASTER -> BACKUP May 28 14:04:03 enrhfw1 /bsd: carp3: state transition: MASTER -> BACKUP May 28 14:04:03 enrhfw1 /bsd: carp1: state transition: MASTER -> BACKUP May 28 14:04:03 enrhfw1 /bsd: carp4: state transition: MASTER -> BACKUP [Finally the bulk update on enrhfw1 completes many minutes later, and it un-demotes itself and returns to master status]: May 28 14:07:16 enrhfw1 /bsd: carp: pfsync0 demoted group carp by -1 to 0 (pfsync bulk done) May 28 14:07:16 enrhfw1 /bsd: carp: pfsync0 demoted group pfsync by -1 to 0 (pfsync bulk done) May 28 14:07:16 enrhfw1 /bsd: carp0: state transition: BACKUP -> MASTER May 28 14:07:16 enrhfw1 /bsd: carp5: state transition: BACKUP -> MASTER May 28 14:07:16 enrhfw1 /bsd: carp2: state transition: BACKUP -> MASTER May 28 14:07:16 enrhfw1 /bsd: carp3: state transition: BACKUP -> MASTER May 28 14:07:16 enrhfw1 /bsd: carp4: state transition: BACKUP -> MASTER May 28 14:07:16 enrhfw1 /bsd: carp1: state transition: BACKUP -> MASTER So essentially, the backup system becomes master shortly after it is rebooted, and stays that way for a couple of minutes. That doesn't seem like expected behaviour to me (and I don't think it did that prior to upgrading from OpenBSD 5.0 to 5.1). My guess is this was introduced by the changes to if_pfsync.c in revision 1.180. Is this in fact the intended behaviour now, or am I missing some required configuration to ensure this doesn't happen? This doesn't actually cause any problems by itself, other than not being what I'd expect. However on one of my other pairs of OpenBSD systems (which differs from the pair described here only slightly), the secondary often seems to take over before it's in a suitable state to do so. That's quite probably an entirely different issue (which I'm only partway to tracking down), but the fact that a failover event now always occurs upon reboot of the secondary system really exacerbates that. I've deliberately not detailed that other problem for now, to avoid confusing matters regarding this first observation, but will do so afterwards. Also note that I do have a workaround for both these problems (namely, artificially extending the CARP interlock by sleeping in /etc/rc.local until having detected that the peer is not demoted). Thanks for any insight, Alex