Hi list!

I've ran into a situation with PF which I don't quite understand. 

The situation is as follows; I have 2 OpenBSD firewalls connected to an
upstream provider which forwards traffic to us via equal cost multi
path routing (ECMP). The firewalls are connected via a crossover cable
over wich pfsync is configured. On the inside the firewalls are each
connected with 2 cables (with LACP) to 2 different switches which 
are in an MLAG configuration (so these 2 switches function as 1 switch).
The OpenBSD firewalls are running OpenBSD 6.0 with all patches applied.

It looks like this (public IP's changed):

                 OUTSIDE / UPSTREAM                

  GW: 192.168.116.21      GW: 192.168.216.21
               +               ^
               |               |
      vlan1604 |               | vlan2604
192.168.116.22 |               | 192.168.216.22
               |               |
           +---v---+      +----+--+
           | FW 1  +------+ FW 2  |
           +---+---+      +----+--+
     vlan1003  |               ^   vlan1003
 17.214.19.49  |               |   17.214.19.50
               +---------------+

                    INSIDE

Now on both firewalls I have this really simple ruleset:

-------------------------
# cat /etc/pf.conf
                                                                                
                                     
set skip on lo0
# Interface connected with crossover cable to other firewall for
# pfsync.
set skip on em1

block log

pass log quick proto tcp to port 22
-------------------------

Which results in the following PF rules:
-------------------------
# pfctl -sr
                                                                                
                                            
block drop log all
pass log quick proto tcp from any to any port = 22 flags S/SA
-------------------------

Now when I SSH from the outside world to 17.214.19.50 the traffic flows
as indicated in the diagram (altough its ECMP upstream seems to prefer
FW 1 so traffic always ends up there): 

[Internet] Me (62.187.45.178)
     |
     V
[FW1]vlan1604 
     |
     V
[FW1]vlan1003
     |
     V
[FW2]vlan1003 
     |
     V
[FW2]vlan2604 
     |
     V
[Internet] Me 

And this works. However after about 30 seconds I lose connection to the
17.214.19.50 host because PF can't match the traffic on FW1 vlan1003 
to the established state. I'm typing random stuff in to the SSH session
to keep it active and then it just hangs. This looks like this 
(public IP's changed):

-------------------------
# tcpdump -nettti pflog0 port 22 and host 17.214.19.50 
tcpdump: WARNING: snaplen raised from 116 to 160
tcpdump: listening on pflog0, link-type PFLOG
Oct 20 10:30:11.299997 rule 1/(match) pass in on vlan1604: 62.187.45.178.64072 >
17.214.19.50.22: S 4112726507:4112726507(0) win 29200 <mss 1460,sackOK,timestamp
6451222 0,nop,wscale 7> (DF)
Oct 20 10:30:11.300026 rule 1/(match) pass out on vlan1003: 62.187.45.178.64072
> 17.214.19.50.22: S 4112726507:4112726507(0) win 29200 <mss
1460,sackOK,timestamp 6451222 0,nop,wscale 7> (DF)



Oct 20 10:30:44.330002 rule 0/(match) block out on vlan1003: 62.187.45.178.64072
> 17.214.19.50.22: P 4112740387:4112740427(40) ack 2507834833 win 594
<nop,nop,timestamp 6484253 2782905123> (DF) [tos 0x10]
Oct 20 10:30:44.425886 rule 0/(match) block out on vlan1003: 62.187.45.178.64072
> 17.214.19.50.22: P 40:80(40) ack 1 win 594 <nop,nop,timestamp 6484349
2782905123> (DF) [tos 0x10]
Oct 20 10:30:44.436021 rule 0/(match) block out on vlan1003: 62.187.45.178.64072
> 17.214.19.50.22: P 40:80(40) ack 1 win 594 <nop,nop,timestamp 6484359
2782905123> (DF) [tos 0x10]
Oct 20 10:30:44.514107 rule 0/(match) block out on vlan1003: 62.187.45.178.64072
> 17.214.19.50.22: P 80:120(40) ack 1 win 594 <nop,nop,timestamp 6484437
2782905123> (DF) [tos 0x10]
Oct 20 10:30:44.618079 rule 0/(match) block out on vlan1003: 62.187.45.178.64072
> 17.214.19.50.22: P 120:160(40) ack 1 win 594 <nop,nop,timestamp 6484541
2782905123> (DF) [tos 0x10]
-------------------------

It seems that PF all of a sudden doesn't see the SSH traffic as part
of the established connection anymore. The state table of PF show that 
the state was correctly added to the state table and synced between 
the firewalls and it also still there:

-----------------------------------
# pfctl -ss
                                                                                
                                            
all carp 17.214.19.49 -> 17.214.19.50           SINGLE:NO_TRAFFIC
all carp 10.100.0.2 -> 10.100.0.3               SINGLE:NO_TRAFFIC
all carp 10.100.2.2 -> 10.100.2.3               SINGLE:NO_TRAFFIC
all tcp 17.214.19.49:22 <- 62.187.45.178:65149  ESTABLISHED:CLOSING
all tcp 17.214.19.49:22 <- 62.187.45.178:58883  ESTABLISHED:CLOSING
all tcp 17.214.19.49:22 <- 62.187.45.178:59505  ESTABLISHED:ESTABLISHED
all tcp 17.214.19.49:22 <- 62.187.45.178:63889  ESTABLISHED:FIN_WAIT_2
all tcp 17.214.19.49:22 <- 62.187.45.178:63963  ESTABLISHED:ESTABLISHED
all tcp 17.214.19.49:22 <- 62.187.45.178:63235  ESTABLISHED:ESTABLISHED
all tcp 17.214.19.50:22 <- 62.187.45.178:54705  FIN_WAIT_2:FIN_WAIT_2
all tcp 17.214.19.50:22 <- 62.187.45.178:64072  ESTABLISHED:ESTABLISHED
all tcp 17.214.19.50:22 <- 119.249.54.68:38527  TIME_WAIT:TIME_WAIT
all tcp 17.214.19.49:22 <- 221.194.47.224:60327 TIME_WAIT:TIME_WAIT
all tcp 17.214.19.50:22 <- 221.194.47.224:53897 TIME_WAIT:TIME_WAIT
-----------------------------------

The relevant PF state here is (as indentified in the pflog tcpdump
as the SSH session that disconnected):

all tcp 17.214.19.50:22 <- 62.187.45.178:64072  ESTABLISHED:ESTABLISHED

which seems okay. 

What I also find odd is that PF allows the packet to
traverse the vlan1604 (external) interface and then decides that it 
can't traverse the vlan1003 (internal) interface. Why isn't it a
problem for the vlan1604 interface? It should be noted that the 
vlan1003 interfaces sits on a trunk interface (trunk0, configured as 
LACP). I don't see how but this might be related.

I'm at a loss here as I really can't explain the behavior I'm seeing
of PF here. Am I missing something? Could this be a bug?

Regards,

Jasper

Reply via email to