On Thu, May 06, 2021 at 06:10:39PM +0200, Alexandr Nedvedicky wrote: > Hello, > > </snip> > > > to be honest I have no idea what could be causing problems on those two > > > fairly > > > distinct machines. The strange thing is that pf_test() currently does not > > > run in > > > parallel. I don't quite understand why reverting my earlier change helps > > > here. > > > > it could be two differents ways to trigger a bug somewhere else that > > your commit expose. > > > > the panic doesn't trigger in the same way on both machines: > > - Olivier's machine seems to trigger it quickly (after some minutes) > > - mine relatively slowly (~ once a day) > > Olivier's machine acts as AP, so it forwards packets between interfaces. > > If I remember correctly your machine is laptop/workstation, which > does not forward traffic.
it doesn't forward, but it acts as a bridge: 2 physical networks cards grouped in a bridge(4) with only few traffic (a network printer is on other side, the bridge(4) is here because I had a sparse network card and no physical-switch to put here). > the function, which we change back and forth here is > pf_state_key_link_reverse(), which is being called from pf_find_state() > here: > > 1085 > 1086 if (sk == NULL) { > 1087 if ((sk = RB_FIND(pf_state_tree, &pf_statetbl, > 1088 (struct pf_state_key *)key)) == NULL) > 1089 return (PF_DROP); > 1090 if (pd->dir == PF_OUT && pkt_sk && > 1091 pf_compare_state_keys(pkt_sk, sk, pd->kif, pd->dir) > == 0) > 1092 pf_state_key_link_reverse(sk, pkt_sk); > 1093 else if (pd->dir == PF_OUT && pd->m->m_pkthdr.pf.inp && > 1094 !pd->m->m_pkthdr.pf.inp->inp_pf_sk && !sk->inp) > 1095 pf_state_key_link_inpcb(sk, > pd->m->m_pkthdr.pf.inp); > 1096 } > 1097 > > the story in human words goes as follows: > > sk == NULL -> no matching state key was attached to packet, Thus we > have to search state key in state tree using RB_FIND() > > if we could find state key for packet in table, then we will try > to set up a 'shortcut', which can save us RB_FIND() later. > > 1090 - 1092 > the shortcut can be set up for outbound packet only (pd->dir PF_OUT), > which is also being forwarded (pkt_sk != NULL, indicates we are seeing > the packet for the second time pkt_sk holds state key for inbound > direction). pf_compare_state_keys() is sanity check, it leaves a > debug message on system console on failure. > > So if it is outbound forwarded packet, we've seen earlier, we > set up a reverse link to save one RB_FIND() operation on next > forwarded packet, which matches the same state. > > 1093 - 1095 > creates similar shortcut for local bound packets. We put reference > to state key into PCB linked to socket. This will save us RB_FIND() > operation for next local outbound, which matches the same state. > > > given the bug seems to be triggered/uncovered by pf_state_key_link_reverse() > is there any chance your laptop/workstation occasionally forwards packets? > like doing NAT for vmd/qemu virtual machine? I don't know bridge(4) internals, but it could make sense that it is using such functions. > if it is not the case then the question is how does it come we run > pf_state_key_link_reverse()? which same as why pkt_sk is not NULL at line > 1090. > > > > > I could try to run with your commit and see if I could trigger it more > > easily or found some elements influencing it. I could try with GENERIC > > for example to see if I still trigger the same assert() or if it is > > more like Olivier. > > I need to think of how to further debug the thing. > > > > > my LAN was several hosts with the same kernel and only this machine > > trigger the panic, so it shouldn't be strictly linked to the > > environment. > > Thanks -- Sebastien Marie