Hi, thanks for responding. It appeared as though my DMARC policy had caused
problems with this mailing list, but I guess not.

Our backup system running 7.4 has not panicked or experienced the ix
failure recently. This may be related to an STP issue we addressed on our
switches that we discovered while testing a trunk failure. I don't know why
that would be the cause, but that is the only thing that has changed with
the network since the panics stopped. That said, since it is in a backup
role, it is not experiencing much traffic other than what it generates
itself.

Our gateways are configured as follows. Note x and y are used where values
differ between the 2 gateways. We have many vlan/carp/wg interfaces, but
they all follow the same configuration patterns. All physical interfaces
are listed below, but for the sake of brevity, I have only included one
example of each pattern for vlan/carp/wg. If necessary, I can provide the
literal config but would prefer to abstract some of the values if possible.

em0 is only used for pxe booting and ipmi failover:

/etc/hostname.em0
  down

em1 is used for pfsync:

/etc/hostname.em1
  up
  inet 169.254.0.x/30

/etc/hostname.pfsync0
  syncdev em1 defer
  up

veb1 is for management interfaces on the switches and one link connects the
2 gatways:

/etc/hostname.veb1
  add vport1
  add em2
  add em3
  add em4
  add em5
  up

/etc/hostname.em2-5
  up

/etc/hostname.vport1
  up
  inet 10.1.1.x/24

/etc/hostname.carp1
  vhid 1 pass $(cat /etc/carp/carp1.pk) carpdev vport1 advskew y
  inet 10.1.1.3/24

All VLANs run over trunk0 as well as a native network:

/etc/hostname.trunk0
  trunkproto failover
  trunkport ix0
  trunkport ix1
  up
  inet 10.1.2.x/24

/etc/hostname.ix0-1
  up

/etc/hostname.carp2
  vhid 2 pass $(cat /etc/carp/carp2.pk) carpdev trunk0 advskew y
  up
  inet 10.1.2.3/24

There are several internal vlans, but they all follow the same pattern
(each with carp):

/etc/hostname.vlan10
  vnetid 10 parent trunk0
  up
  inet 10.1.10.x/24

/etc/hostname.carp10
  vhid 10 pass $(cat /etc/carp/carp10.pk) carpdev vlan10 advskew y
  up
  inet 10.1.10.3/24

WAN is delivered over VLAN as well. The configuration is essentially the
same as the internal VLANs:

/etc/hostname.vlan3
  vnetid 3 parent trunk0
  up
  inet redacted/29

/etc/hostname.carp3
  vhid 3 pass $(cat /etc/carp/carp3.pk) carpdev vlan3 advskew y
  inet redacted/29

This site serves as a remote hub, so there are several wireguard interfaces
which connect other sites and some work-from-home staff. They follow one of
2 configuration patterns.

Sites (note that some sites do not have static IPs, so endpoint is only
specified on the remote side):

/etc/hostname.wg11
  wgkey $(cat /etc/wg/wg11.pk)
  wgpeer 5bsVLcnGnecuioC3KGlwiBiYHhOiGb9x9j37bYFu2SY= wgaip 10.1.11.20/32
wgaip 10.2.0.0/16 wgpka 5
  wgpeer OjxxFGgfl2Wv1lgrdbJ0CuGIaTiRlnNSwvl9x8VhoUs= wgaip 10.1.11.30/32
wgaip 10.3.0.0/16 wgpka 5
  wgpeer /m2pSzXs2m/IQi3FsedqBek6uAf59rnxWW5oNN7wDWc= wgaip 10.1.11.40/32
10.4.0.0/16 wgpka 5
  wgport 60011
  up
  inet 10.1.11.1/24
  !route add 10.2.0.0/16 10.1.11.20
  !route add 10.3.0.0/16 10.1.11.30
  !route add 10.4.0.0/16 10.1.11.40

Work from home staff:

/etc/hostname.wg111
  wgkey $(cat /etc/wg/wg111.pk)
  wgpeer V9odqDEecoJVh8QJk9Erdfj937C16qS3ZcEW3GcLUR8= wgaip 10.1.111.100/32
  wgpeer 9fKKiAdLfrd4IaJ6rWTmmVres/0Me1pSrW1Wee2FYGg= wgaip 10.1.111.101/32
  wgpeer JLtkxpZjpdkyPuBmBEXjFcp4kK8VvtKzRv4fIp1+V10= wgaip 10.1.111.102/32
  wgpeer DmVK49Sv6t8U9th+24Vq+pYqW7rikuWPpB0tguKFHTY= wgaip 10.1.111.103/32
  wgpeer Gy1v60ZhY9MU2l3DxqApQceYrOCCag9YOlBUtoiB/Fk= wgaip 10.1.111.104/32
  wgpeer K67vsSswUOHCF81wnoLrZT6tcUDslcjyDzPxGtWwy3I= wgaip 10.1.111.105/32
  wgpeer f1GmX++jQZvZMd0zXXdG8GxWuK0F7+A1nZc9PJhZfG0= wgaip 10.1.111.106/32
  wgport 60111
  up
  inet 10.1.111.1/24

The switches are Mikrotik. All interfaces are pvid 1 and tagged with the
vlans configured on the gateways except for the WAN uplinks which are
untagged/pvid 3. The STP issue we experienced was that in the event of a
trunk0 failover, all internal VLANs would work but WAN would not. I am not
sure of the mechanics of this failure, but it was remedied by manually
configuring the uplinks as edge ports. Note that since failover trunks in
OpenBSD do not utilize gratuitous ARP to notify the switches of the
failover, we use ifstated to ping on the event of either ix interfaces
going down (which works).

  if ix0.link.down || ix0.link.up || ix1.link.down || ix1.link.up {
          run "ping -q -c 10 -w 1 10.1.0.x"
          run "logger -t ifstated a 'link state changed in trunk0'"
  }

A curious detail of this failure was that, while WAN was unreachable,
pinging out from the gateway was not even detected locally with tcpdump on
vlan3. It was as if the VLAN was paralyzed by whatever was going on with
STP on the switches. Again, I have no idea if that is related to the
panics, but it was odd. I can reproduce this if it's helpful.

On Tue, Jan 2, 2024 at 12:45 AM Jan Klemkow <j.klem...@wemelug.de> wrote:

> On Thu, Dec 14, 2023 at 11:22:27PM +0000, John Clendenen wrote:
> > >Synopsis: Panics occuring about once every 24 hours since 7.4 upgrade.
> > Appears to be network related. Disabling LRO on ix devices helps.
> > >Category: amd64
> > >Environment:
> > System      : OpenBSD 7.4
> > Details     : OpenBSD 7.4 (GENERIC.MP) #2: Fri Dec  8 15:39:04 MST 2023
> > r...@syspatch-74-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/
> > GENERIC.MP
> >
> > Architecture: OpenBSD.amd64
> > Machine     : amd64
> > >Description:
> > We have 2 Supermicro 5018D-FN8T systems serving as HA gateways at a
> > colocation. They run a moderately complex network stack including veb,
> > trunk, vlan, carp and wireguard interfaces across ix and em hardware.
> This
> > configuration was deployed on 7.2 and upgraded to 7.3 soon after its
> > release. The configuration has not changed much since and the systems
> have
> > been very reliable. After upgrading to 7.4, we see kernel panics about
> once
> > a day. We have re-imaged OpenBSD 7.3 on one of the units while we
> continue
> > to test 7.4 on the 2nd. Note that the panics are not consistent. The most
> > recent one involves memcpy, pf and wireguard, but previous ones have
> > involved the ix driver. This is the first bug report I've had to file for
> > OpenBSD, so please forgive my inexperience. I have screenshots of the
> most
> > recent panic and trace. Happy to provide more info as needed.
>
> What combination of em(4), ix(4), veb(3), trunk(4), vlan(4), carp(4) and
> wg(4) do you use?
>
> Could you provide your /etc/hostname.* files?
>
> > I couldn't determine exactly where to run the objdump per the ddb
> > documentation.
> >
> > >How-To-Repeat:
> > Not strictly repeatable but will occur roughly once a day while under
> > moderate to heavy network load.
> >
> > >Fix:
> > The first panics involved the ix driver so our first idea was to disable
> > the TSO sysctl and LRO on the ix interfaces since those were changed in
> 7.4
> > (note that we had not manually enabled either). This did have a positive
> > impact in that panics stopped, but there were still errant behaviors. In
> > the first case, traffic routed through the ix trunk was unreliable (about
> > 50% ping), however traffic outside of ix interfaces was fine and local
> > traffic to/from hosts on the ix networks was also fine. In the second
> case,
> > all traffic appeared normal, but the ix interfaces suddenly changed to no
> > carrier (at about the same frequency as the panics). The no carrier
> status
> > appeared to be in error as the switches still showed the interfaces as
> up.
> > The no carrier status also persisted through reboots but would clear on
> > power cycle. The 2 behaviors were seen in the 2 different systems
> (system 1
> > exhibited the first behavior and system 2 exhibited the second). This was
> > especially confusing because the hardware is identical, and the
> > configuration is as similar as you'd expect in a CARP failover situation.
>

Reply via email to