Re: IP6 redirects through relayd no longer working reliably

2023-06-28 Thread Markus Wernig
Just for the record: The problem was caused by a malfunctioning upstream 
gateway, which did no longer respond properly to neighbor solicitation 
requests.


The SYN ACK from the server was dropped because the firewall had already 
removed the state created by the SYN.


On 6/23/23 22:51, Markus Wernig wrote:

pflog shows that the IPv6 SYN-ACK replies from the backend servers are 
being dropped by pf. But weirdly the blocks are logged over 30 seconds 
after the SYN is allowed through:






IP6 redirects through relayd no longer working reliably

2023-06-23 Thread Markus Wernig

Hi all

(Sorry for flooding, this seems related to the question I asked earlier. 
Please bear with me.)


I am using relayd on 7.3-release as an IP loadbalancer in front of some 
dualstack backend hosts. This setup has worked for some years now.


After upgrading to 7.3 about 4 weeks ago I noticed a steady decline of 
IPv6 sessions coming into the backend servers, up to the point where 
none arrive at all (for 2 days now).


Now users start complaining that their connections to the servers 
(public IP) are either timing out or are established only after a very 
long time (usually the tcp start timeout when the client switches from 
IPv6 to trying IPv4). The IPv4 connections succeed immediately.


pflog shows that the IPv6 SYN-ACK replies from the backend servers are 
being dropped by pf. But weirdly the blocks are logged over 30 seconds 
after the SYN is allowed through:



Jun 20 14:12:49.489707 rule 2/(match) [uid 0, pid 85766] pass out on 
vlanX: [Client.IP6].50210 > [Server.IP6].443:
S 2508622700:2508622700(0) win 64800 <[|tcp]> [flowlabel 0xd4400] (len 
32, hlim 52)
Jun 20 14:12:49.493267 rule 2/(match) [uid 0, pid 85766] pass out on 
vlanX: [Client.IP6].50211 > [Server.IP6].443:
S 806421981:806421981(0) win 64800 <[|tcp]> [flowlabel 0x162e5] (len 32, 
hlim 52)
Jun 20 14:12:49.507508 rule 2/(match) [uid 0, pid 85766] pass out on 
vlanX: [Client.IP6].50212 > [Server.IP6].443:
S 3945655871:3945655871(0) win 64800 <[|tcp]> [flowlabel 0x8abc6] (len 
32, hlim 52)
Jun 20 14:12:49.517783 rule 2/(match) [uid 0, pid 85766] pass out on 
vlanX: [Client.IP6].50213 > [Server.IP6].443: S 1191028748:1191028748(0) 
win 64800 <[|tcp]> [flowlabel 0xa7d6] (len 32, hlim 52)


Jun 20 14:13:20.943370 rule 2/(match) [uid 0, pid 85766] block in on 
vlanX: [Server.IP6].443 > [Client.IP6].50213: S 3650589557:3650589557(0) 
ack 209077342 win 64800 <[|tcp]> [flowlabel 0xd922c] (len 32, hlim 64)
Jun 20 14:13:20.943433 rule 2/(match) [uid 0, pid 85766] block in on 
vlanX: [Server.IP6].443 > [Client.IP6].50212: S 2068945110:2068945110(0) 
ack 2313561433 win 64800 <[|tcp]> [flowlabel 0xf8c9c] (len 32, hlim 64)
Jun 20 14:13:20.943476 rule 2/(match) [uid 0, pid 85766] block in on 
vlanX: [Server.IP6].443 > [Client.IP6].50211: S 3395939328:3395939328(0) 
ack 1849611325 win 64800 <[|tcp]> [flowlabel 0xb519e] (len 32, hlim 64)
Jun 20 14:13:20.943518 rule 2/(match) [uid 0, pid 85766] block in on 
vlanX: [Server.IP6].443 > [Client.IP6].50210: S 106368970:106368970(0) 
ack 1534267447 win 64800 <[|tcp]> [flowlabel 0xca19a] (len 32, hlim 64)


(The rule 2 that is logged is the rule number of the relayd/* anchor.)

tcpdump on vlanX shows the backend server sends the SYN-ACK immediately.

The IPv4 addresses are natted from public to rfc-1918 space and work.

For IPv6, the address of backend server.A is used as the public IP 
(service.pub). Only if server.A becomes unavailable, are packets 
redirected to server.B.


relayd.conf:
...
table  {
   Server.A.IP6 retry 2
}
table  {
   Server.B.IP6 retry 2
}
redirect "service.pub.80.v6" {
  listen on Server.A.IP6 tcp port 80 interface trunk0
  forward to  port 80 \
check http "/" host "server.A" code 200
  forward to  port 80 \
check http "/" host "server.B" code 200
}
redirect "service.pub.443.v6" {
  listen on Server.A.IP6 tcp port 443 interface trunk0
  forward to  port 443 \
check https "/" host "server.A" code 200
  forward to  port 443 \
check https "/" host "server.B" code 200
}

I am not 100% sure that the IPv6 failover actually worked before, but 
the connections to Server.A.IP6 were definitely working.

I do see the http and https checks succeed on both backend servers.

I've tried flushing the states and rebooting the firewall, to no avail.

relayctl shows all redirects/tables as active and all hosts as up:

2   redirectservice.pub.80.v6  active
3   table   server.A:80active (1 hosts)
3   hostServer.A.IP6   100.00% up
4   table   server.B:80active (1 hosts)
4   hostServer.B.IP6   100.00% up

3   redirectservice.pub.443.v6 active
5   table   server.A:443   active (1 hosts)
5   hostServer.A.IP6   100.00% up
6   table   server.B:443   active (1 hosts)
6   hostServer.B.IP6   100.00% up


Now I'm out of ideas on how to debug this further.

Has anyone been experiencing something similar?
Has something fundamental changed in relayd or pf that could cause this?
Does anybody spot an error in my configuration?

Thanks for any pointer!

Best regards
Markus