Today we completed two weeks with no kernel panics since I applied that patch. I think it is safe to consider this issue fixed. Many thanks to All.
On Fri, Nov 25, 2022 at 9:41 AM Josmar Pierri <jcpie...@gmail.com> wrote: > > I think that some feedback from me on this issue is appropriate now. > > After I applied this patch suggested by Hrvoje, our firewall is > enduring full traffic for 2 days without any crashing. > > FWIW: > - With 7.2 snapshot #849 it was crashing twice a day. > - With 7.2 release it was crashing randomly within a week. > - With 7.1 release it used to crash randomly twice a month. > > So far so good! > > On Tue, Nov 22, 2022 at 2:53 PM Hrvoje Popovski <hrv...@srce.hr> wrote: > > > > On 22.11.2022. 18:48, Josmar Pierri wrote: > > > I upgraded to 7.2 snapshot #849 early this morning, but it crashed > > > twice in a few hours. > > > This time, however, the panic message is different: > > > > > > > Could you compile kernel with this diff > > https://www.mail-archive.com/tech@openbsd.org/msg72582.html > > > > at least for me, that diff makes my firewall stable.. > > > > > > > > > > > uvm_fault(0xffffffff8236dcb8, 0x17, 0, 2) -> e > > > kernel: page fault trap, code=0 > > > Stopped at pfsync_q_del+0x96: movq %rdx,0x8(%rax) > > > TID PID UID PRFLAGS PFLAGS CPU COMMAND > > > 436110 83038 0 0x14000 0x200 3 softnet > > > 395295 39926 0 0x14000 0x200 0 softnet > > > 189958 2208 0 0x14000 0x200 2 softnet > > > * 65839 5423 0 0x14000 0x200 1 systqmp > > > pfsync_q_del(fffffd8401d63890) at pfsync_q_del+0x96 > > > pfsync_delete_state(fffffd8401d63890) at pfsync_delete_state+0x118 > > > pf_remove_state(fffffd8401d63890) at pfsync_remove_state+0x14b > > > pf_purge_expired_states(4031,40) at pf_purge_expired_states+0x242 > > > pf_purge_states(0) at pf_purge_states+0x1c > > > taskq_thread(ffffffff822a1a10) at taskq_thread+0x100 > > > end trace frame: 0x0, count: 9 > > > > > > This is all I could manage to get since the crash happened when I was > > > away (and that stupid Dell console timeout when idle, removing the USB > > > keyboard) > > > > > > I observed a thing that may or may not be related to this issue: The > > > "output fail" counter keeps steadily increasing both on aggregate and > > > the two member interfaces: > > > > > > :~# netstat -i -I aggr0 > > > Name Mtu Network Address Ipkts Ifail Opkts Ofail > > > Colls > > > aggr0 9200 <Link> fe:e1:ba:d0:91:13 224426940 0 200785282 > > > 357 0 > > > > > > At first I thought it could be something related to the switches but I > > > still haven't found anything wrong with them. > > > > > > > > > > > > On Mon, Nov 21, 2022 at 1:22 PM Hrvoje Popovski <hrv...@srce.hr> wrote: > > >> > > >> On 21.11.2022. 16:04, Josmar Pierri wrote: > > >>> Hi, > > >>> > > >>> I managed to get screenshots of a random kernel panic that we are > > >>> having on a server here. > > >>> They were taken using a console management tool embedded into the > > >>> server (Dell IDRAC) and are PNG images of the panic itself, trace of > > >>> all cpus and ps. > > >>> I'm not attaching them here right now because I don't know how the > > >>> list would react to them. > > >>> > > >>> I attached the output of: > > >>> 1 - sendbug -P > > >>> 2 - dmesg right after reboot > > >>> 3 - dmesg-boot > > >>> > > >>> This server has an aggr0 grouping bnxt0 and bnxt1, both at 10 Gbps. > > >>> Its task is to load-balance RDP traffic (TCP 3389) among 2 large pools > > >>> (more than 50 servers on each one) and 3 small ones using pf (tables) > > >>> for that. > > >>> > > >>> These panics happen at random times without an apparent cause. > > >>> > > >>> The panic message reads: > > >>> > > >>> ddb{3}> show panic > > >>> *cpu3: kernel diagnostic assertion "st->snapped == 0" failed: file > > >>> "/usr/src/sys/net/if_pfsync.c", line 1591 > > >>> cpu2: kernel diagnostic assertion "st->snapped == 0" failed: file > > >>> "/usr/src/sys/net/if_pfsync.c", line 1591 > > >>> cpu1: kernel diagnostic assertion "st->snapped == 0" failed: file > > >>> "/usr/src/sys/net/if_pfsync.c", line 1591 > > >>> ddb{3}> > > >>> > > >>> Please advise how I should proceed to submit the screenshots. > > >> > > >> Hi, > > >> > > >> I have similar setup with aggr grouping ix0 and ix1 and pfsync. If you > > >> have two firewalls, can you sysupgrade this one to latest snapshot ? > > >> > > >> I'm running snapshot after last hackathon with this diff > > >> https://www.mail-archive.com/tech@openbsd.org/msg72582.html > > >> > > >> and for now firewall seems to work just fine. > > >> > > >> > > >> > > > > >