On Thu, Dec 14, 2023 at 11:22:27PM +0000, John Clendenen wrote: > >Synopsis: Panics occuring about once every 24 hours since 7.4 upgrade. > Appears to be network related. Disabling LRO on ix devices helps. > >Category: amd64 > >Environment: > System : OpenBSD 7.4 > Details : OpenBSD 7.4 (GENERIC.MP) #2: Fri Dec 8 15:39:04 MST 2023 > r...@syspatch-74-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/ > GENERIC.MP > > Architecture: OpenBSD.amd64 > Machine : amd64 > >Description: > We have 2 Supermicro 5018D-FN8T systems serving as HA gateways at a > colocation. They run a moderately complex network stack including veb, > trunk, vlan, carp and wireguard interfaces across ix and em hardware. This > configuration was deployed on 7.2 and upgraded to 7.3 soon after its > release. The configuration has not changed much since and the systems have > been very reliable. After upgrading to 7.4, we see kernel panics about once > a day. We have re-imaged OpenBSD 7.3 on one of the units while we continue > to test 7.4 on the 2nd. Note that the panics are not consistent. The most > recent one involves memcpy, pf and wireguard, but previous ones have > involved the ix driver. This is the first bug report I've had to file for > OpenBSD, so please forgive my inexperience. I have screenshots of the most > recent panic and trace. Happy to provide more info as needed.
What combination of em(4), ix(4), veb(3), trunk(4), vlan(4), carp(4) and wg(4) do you use? Could you provide your /etc/hostname.* files? > I couldn't determine exactly where to run the objdump per the ddb > documentation. > > >How-To-Repeat: > Not strictly repeatable but will occur roughly once a day while under > moderate to heavy network load. > > >Fix: > The first panics involved the ix driver so our first idea was to disable > the TSO sysctl and LRO on the ix interfaces since those were changed in 7.4 > (note that we had not manually enabled either). This did have a positive > impact in that panics stopped, but there were still errant behaviors. In > the first case, traffic routed through the ix trunk was unreliable (about > 50% ping), however traffic outside of ix interfaces was fine and local > traffic to/from hosts on the ix networks was also fine. In the second case, > all traffic appeared normal, but the ix interfaces suddenly changed to no > carrier (at about the same frequency as the panics). The no carrier status > appeared to be in error as the switches still showed the interfaces as up. > The no carrier status also persisted through reboots but would clear on > power cycle. The 2 behaviors were seen in the 2 different systems (system 1 > exhibited the first behavior and system 2 exhibited the second). This was > especially confusing because the hardware is identical, and the > configuration is as similar as you'd expect in a CARP failover situation.