Hi list! I've ran into a situation with PF which I don't quite understand.
The situation is as follows; I have 2 OpenBSD firewalls connected to an upstream provider which forwards traffic to us via equal cost multi path routing (ECMP). The firewalls are connected via a crossover cable over wich pfsync is configured. On the inside the firewalls are each connected with 2 cables (with LACP) to 2 different switches which are in an MLAG configuration (so these 2 switches function as 1 switch). The OpenBSD firewalls are running OpenBSD 6.0 with all patches applied. It looks like this (public IP's changed): OUTSIDE / UPSTREAM GW: 192.168.116.21 GW: 192.168.216.21 + ^ | | vlan1604 | | vlan2604 192.168.116.22 | | 192.168.216.22 | | +---v---+ +----+--+ | FW 1 +------+ FW 2 | +---+---+ +----+--+ vlan1003 | ^ vlan1003 17.214.19.49 | | 17.214.19.50 +---------------+ INSIDE Now on both firewalls I have this really simple ruleset: ------------------------- # cat /etc/pf.conf set skip on lo0 # Interface connected with crossover cable to other firewall for # pfsync. set skip on em1 block log pass log quick proto tcp to port 22 ------------------------- Which results in the following PF rules: ------------------------- # pfctl -sr block drop log all pass log quick proto tcp from any to any port = 22 flags S/SA ------------------------- Now when I SSH from the outside world to 17.214.19.50 the traffic flows as indicated in the diagram (altough its ECMP upstream seems to prefer FW 1 so traffic always ends up there): [Internet] Me (62.187.45.178) | V [FW1]vlan1604 | V [FW1]vlan1003 | V [FW2]vlan1003 | V [FW2]vlan2604 | V [Internet] Me And this works. However after about 30 seconds I lose connection to the 17.214.19.50 host because PF can't match the traffic on FW1 vlan1003 to the established state. I'm typing random stuff in to the SSH session to keep it active and then it just hangs. This looks like this (public IP's changed): ------------------------- # tcpdump -nettti pflog0 port 22 and host 17.214.19.50 tcpdump: WARNING: snaplen raised from 116 to 160 tcpdump: listening on pflog0, link-type PFLOG Oct 20 10:30:11.299997 rule 1/(match) pass in on vlan1604: 62.187.45.178.64072 > 17.214.19.50.22: S 4112726507:4112726507(0) win 29200 <mss 1460,sackOK,timestamp 6451222 0,nop,wscale 7> (DF) Oct 20 10:30:11.300026 rule 1/(match) pass out on vlan1003: 62.187.45.178.64072 > 17.214.19.50.22: S 4112726507:4112726507(0) win 29200 <mss 1460,sackOK,timestamp 6451222 0,nop,wscale 7> (DF) Oct 20 10:30:44.330002 rule 0/(match) block out on vlan1003: 62.187.45.178.64072 > 17.214.19.50.22: P 4112740387:4112740427(40) ack 2507834833 win 594 <nop,nop,timestamp 6484253 2782905123> (DF) [tos 0x10] Oct 20 10:30:44.425886 rule 0/(match) block out on vlan1003: 62.187.45.178.64072 > 17.214.19.50.22: P 40:80(40) ack 1 win 594 <nop,nop,timestamp 6484349 2782905123> (DF) [tos 0x10] Oct 20 10:30:44.436021 rule 0/(match) block out on vlan1003: 62.187.45.178.64072 > 17.214.19.50.22: P 40:80(40) ack 1 win 594 <nop,nop,timestamp 6484359 2782905123> (DF) [tos 0x10] Oct 20 10:30:44.514107 rule 0/(match) block out on vlan1003: 62.187.45.178.64072 > 17.214.19.50.22: P 80:120(40) ack 1 win 594 <nop,nop,timestamp 6484437 2782905123> (DF) [tos 0x10] Oct 20 10:30:44.618079 rule 0/(match) block out on vlan1003: 62.187.45.178.64072 > 17.214.19.50.22: P 120:160(40) ack 1 win 594 <nop,nop,timestamp 6484541 2782905123> (DF) [tos 0x10] ------------------------- It seems that PF all of a sudden doesn't see the SSH traffic as part of the established connection anymore. The state table of PF show that the state was correctly added to the state table and synced between the firewalls and it also still there: ----------------------------------- # pfctl -ss all carp 17.214.19.49 -> 17.214.19.50 SINGLE:NO_TRAFFIC all carp 10.100.0.2 -> 10.100.0.3 SINGLE:NO_TRAFFIC all carp 10.100.2.2 -> 10.100.2.3 SINGLE:NO_TRAFFIC all tcp 17.214.19.49:22 <- 62.187.45.178:65149 ESTABLISHED:CLOSING all tcp 17.214.19.49:22 <- 62.187.45.178:58883 ESTABLISHED:CLOSING all tcp 17.214.19.49:22 <- 62.187.45.178:59505 ESTABLISHED:ESTABLISHED all tcp 17.214.19.49:22 <- 62.187.45.178:63889 ESTABLISHED:FIN_WAIT_2 all tcp 17.214.19.49:22 <- 62.187.45.178:63963 ESTABLISHED:ESTABLISHED all tcp 17.214.19.49:22 <- 62.187.45.178:63235 ESTABLISHED:ESTABLISHED all tcp 17.214.19.50:22 <- 62.187.45.178:54705 FIN_WAIT_2:FIN_WAIT_2 all tcp 17.214.19.50:22 <- 62.187.45.178:64072 ESTABLISHED:ESTABLISHED all tcp 17.214.19.50:22 <- 119.249.54.68:38527 TIME_WAIT:TIME_WAIT all tcp 17.214.19.49:22 <- 221.194.47.224:60327 TIME_WAIT:TIME_WAIT all tcp 17.214.19.50:22 <- 221.194.47.224:53897 TIME_WAIT:TIME_WAIT ----------------------------------- The relevant PF state here is (as indentified in the pflog tcpdump as the SSH session that disconnected): all tcp 17.214.19.50:22 <- 62.187.45.178:64072 ESTABLISHED:ESTABLISHED which seems okay. What I also find odd is that PF allows the packet to traverse the vlan1604 (external) interface and then decides that it can't traverse the vlan1003 (internal) interface. Why isn't it a problem for the vlan1604 interface? It should be noted that the vlan1003 interfaces sits on a trunk interface (trunk0, configured as LACP). I don't see how but this might be related. I'm at a loss here as I really can't explain the behavior I'm seeing of PF here. Am I missing something? Could this be a bug? Regards, Jasper