Hi, thanks for responding. It appeared as though my DMARC policy had caused problems with this mailing list, but I guess not.
Our backup system running 7.4 has not panicked or experienced the ix failure recently. This may be related to an STP issue we addressed on our switches that we discovered while testing a trunk failure. I don't know why that would be the cause, but that is the only thing that has changed with the network since the panics stopped. That said, since it is in a backup role, it is not experiencing much traffic other than what it generates itself. Our gateways are configured as follows. Note x and y are used where values differ between the 2 gateways. We have many vlan/carp/wg interfaces, but they all follow the same configuration patterns. All physical interfaces are listed below, but for the sake of brevity, I have only included one example of each pattern for vlan/carp/wg. If necessary, I can provide the literal config but would prefer to abstract some of the values if possible. em0 is only used for pxe booting and ipmi failover: /etc/hostname.em0 down em1 is used for pfsync: /etc/hostname.em1 up inet 169.254.0.x/30 /etc/hostname.pfsync0 syncdev em1 defer up veb1 is for management interfaces on the switches and one link connects the 2 gatways: /etc/hostname.veb1 add vport1 add em2 add em3 add em4 add em5 up /etc/hostname.em2-5 up /etc/hostname.vport1 up inet 10.1.1.x/24 /etc/hostname.carp1 vhid 1 pass $(cat /etc/carp/carp1.pk) carpdev vport1 advskew y inet 10.1.1.3/24 All VLANs run over trunk0 as well as a native network: /etc/hostname.trunk0 trunkproto failover trunkport ix0 trunkport ix1 up inet 10.1.2.x/24 /etc/hostname.ix0-1 up /etc/hostname.carp2 vhid 2 pass $(cat /etc/carp/carp2.pk) carpdev trunk0 advskew y up inet 10.1.2.3/24 There are several internal vlans, but they all follow the same pattern (each with carp): /etc/hostname.vlan10 vnetid 10 parent trunk0 up inet 10.1.10.x/24 /etc/hostname.carp10 vhid 10 pass $(cat /etc/carp/carp10.pk) carpdev vlan10 advskew y up inet 10.1.10.3/24 WAN is delivered over VLAN as well. The configuration is essentially the same as the internal VLANs: /etc/hostname.vlan3 vnetid 3 parent trunk0 up inet redacted/29 /etc/hostname.carp3 vhid 3 pass $(cat /etc/carp/carp3.pk) carpdev vlan3 advskew y inet redacted/29 This site serves as a remote hub, so there are several wireguard interfaces which connect other sites and some work-from-home staff. They follow one of 2 configuration patterns. Sites (note that some sites do not have static IPs, so endpoint is only specified on the remote side): /etc/hostname.wg11 wgkey $(cat /etc/wg/wg11.pk) wgpeer 5bsVLcnGnecuioC3KGlwiBiYHhOiGb9x9j37bYFu2SY= wgaip 10.1.11.20/32 wgaip 10.2.0.0/16 wgpka 5 wgpeer OjxxFGgfl2Wv1lgrdbJ0CuGIaTiRlnNSwvl9x8VhoUs= wgaip 10.1.11.30/32 wgaip 10.3.0.0/16 wgpka 5 wgpeer /m2pSzXs2m/IQi3FsedqBek6uAf59rnxWW5oNN7wDWc= wgaip 10.1.11.40/32 10.4.0.0/16 wgpka 5 wgport 60011 up inet 10.1.11.1/24 !route add 10.2.0.0/16 10.1.11.20 !route add 10.3.0.0/16 10.1.11.30 !route add 10.4.0.0/16 10.1.11.40 Work from home staff: /etc/hostname.wg111 wgkey $(cat /etc/wg/wg111.pk) wgpeer V9odqDEecoJVh8QJk9Erdfj937C16qS3ZcEW3GcLUR8= wgaip 10.1.111.100/32 wgpeer 9fKKiAdLfrd4IaJ6rWTmmVres/0Me1pSrW1Wee2FYGg= wgaip 10.1.111.101/32 wgpeer JLtkxpZjpdkyPuBmBEXjFcp4kK8VvtKzRv4fIp1+V10= wgaip 10.1.111.102/32 wgpeer DmVK49Sv6t8U9th+24Vq+pYqW7rikuWPpB0tguKFHTY= wgaip 10.1.111.103/32 wgpeer Gy1v60ZhY9MU2l3DxqApQceYrOCCag9YOlBUtoiB/Fk= wgaip 10.1.111.104/32 wgpeer K67vsSswUOHCF81wnoLrZT6tcUDslcjyDzPxGtWwy3I= wgaip 10.1.111.105/32 wgpeer f1GmX++jQZvZMd0zXXdG8GxWuK0F7+A1nZc9PJhZfG0= wgaip 10.1.111.106/32 wgport 60111 up inet 10.1.111.1/24 The switches are Mikrotik. All interfaces are pvid 1 and tagged with the vlans configured on the gateways except for the WAN uplinks which are untagged/pvid 3. The STP issue we experienced was that in the event of a trunk0 failover, all internal VLANs would work but WAN would not. I am not sure of the mechanics of this failure, but it was remedied by manually configuring the uplinks as edge ports. Note that since failover trunks in OpenBSD do not utilize gratuitous ARP to notify the switches of the failover, we use ifstated to ping on the event of either ix interfaces going down (which works). if ix0.link.down || ix0.link.up || ix1.link.down || ix1.link.up { run "ping -q -c 10 -w 1 10.1.0.x" run "logger -t ifstated a 'link state changed in trunk0'" } A curious detail of this failure was that, while WAN was unreachable, pinging out from the gateway was not even detected locally with tcpdump on vlan3. It was as if the VLAN was paralyzed by whatever was going on with STP on the switches. Again, I have no idea if that is related to the panics, but it was odd. I can reproduce this if it's helpful. On Tue, Jan 2, 2024 at 12:45 AM Jan Klemkow <j.klem...@wemelug.de> wrote: > On Thu, Dec 14, 2023 at 11:22:27PM +0000, John Clendenen wrote: > > >Synopsis: Panics occuring about once every 24 hours since 7.4 upgrade. > > Appears to be network related. Disabling LRO on ix devices helps. > > >Category: amd64 > > >Environment: > > System : OpenBSD 7.4 > > Details : OpenBSD 7.4 (GENERIC.MP) #2: Fri Dec 8 15:39:04 MST 2023 > > r...@syspatch-74-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/ > > GENERIC.MP > > > > Architecture: OpenBSD.amd64 > > Machine : amd64 > > >Description: > > We have 2 Supermicro 5018D-FN8T systems serving as HA gateways at a > > colocation. They run a moderately complex network stack including veb, > > trunk, vlan, carp and wireguard interfaces across ix and em hardware. > This > > configuration was deployed on 7.2 and upgraded to 7.3 soon after its > > release. The configuration has not changed much since and the systems > have > > been very reliable. After upgrading to 7.4, we see kernel panics about > once > > a day. We have re-imaged OpenBSD 7.3 on one of the units while we > continue > > to test 7.4 on the 2nd. Note that the panics are not consistent. The most > > recent one involves memcpy, pf and wireguard, but previous ones have > > involved the ix driver. This is the first bug report I've had to file for > > OpenBSD, so please forgive my inexperience. I have screenshots of the > most > > recent panic and trace. Happy to provide more info as needed. > > What combination of em(4), ix(4), veb(3), trunk(4), vlan(4), carp(4) and > wg(4) do you use? > > Could you provide your /etc/hostname.* files? > > > I couldn't determine exactly where to run the objdump per the ddb > > documentation. > > > > >How-To-Repeat: > > Not strictly repeatable but will occur roughly once a day while under > > moderate to heavy network load. > > > > >Fix: > > The first panics involved the ix driver so our first idea was to disable > > the TSO sysctl and LRO on the ix interfaces since those were changed in > 7.4 > > (note that we had not manually enabled either). This did have a positive > > impact in that panics stopped, but there were still errant behaviors. In > > the first case, traffic routed through the ix trunk was unreliable (about > > 50% ping), however traffic outside of ix interfaces was fine and local > > traffic to/from hosts on the ix networks was also fine. In the second > case, > > all traffic appeared normal, but the ix interfaces suddenly changed to no > > carrier (at about the same frequency as the panics). The no carrier > status > > appeared to be in error as the switches still showed the interfaces as > up. > > The no carrier status also persisted through reboots but would clear on > > power cycle. The 2 behaviors were seen in the 2 different systems > (system 1 > > exhibited the first behavior and system 2 exhibited the second). This was > > especially confusing because the hardware is identical, and the > > configuration is as similar as you'd expect in a CARP failover situation. >