unlock pf_purge
the main change here is to move pf_purge out from under the kernel lock. another part of the change is to limit the amount of work the state purging does to avoid hogging a cpu too much, and to also avoid holding NET_LOCK for too long. the main part is acheived by using systqmp to run the purge tasks instead of systq. from what i can tell, all the data structures we're walking over are protected by pf locks and/or the net lock. the pf state list in particular was recently reworked so iteration over the list can be done without interfering with insertions. this means we can scan the state list to collect expired states without affecting packet processing. however, if you have a lot of states (like me) you can still spend a lot of time scanning. i seem to sit at around 1 to 1.2 million states, and i run with the default scan interval of 10 seconds. that means im scanning about 100 to 120k states every second, which just takes time. doing the scan while holding the kernel lock means most other stuff cannot run at the same time. worse, the timeout that schedules the pf purge work also seems to schedule something in the softclock thread. if the same timeout also wakes up a process in an unlocked syscall, a significant amount of time in the system is spent spinning for no reason. significant meaning dozens of milliseconds, which is actually noticable if you're working interactively on the box. before this change pf purges often took 10 to 50ms. the softclock thread runs next to it often took a similar amount of time, presumably because they ended up spinning waiting for each other. after this change the pf_purges are more like 6 to 12ms, and dont block softclock. most of the variability in the runs now seems to come from contention on the net lock. the state purge task now checks the SHOULDYIELD flag, so if the task is taking too long it will return early and requeue itself, allowing the taskq machinery to yield and let other threads run. state purging is now also limited to removing 64 states per task run to avoid holding locks that would block packet processing for too long. if there's still states to scan in the interval after these 64 packets are collected, the task reschedules so it can keep scanning from where it left off. the other purge tasks for source nodes, rules, and fragments have been moved to their own timeout/task pair to simplify the time accounting. the box feels a lot more responsive with this change. i'm still kicking it around, but i wanted to get it out early. Index: pf.c === RCS file: /cvs/src/sys/net/pf.c,v retrieving revision 1.1132 diff -u -p -r1.1132 pf.c --- pf.c23 May 2022 11:17:35 - 1.1132 +++ pf.c7 Jun 2022 06:49:51 - @@ -120,10 +120,6 @@ u_char pf_tcp_secret[16]; int pf_tcp_secret_init; int pf_tcp_iss_off; -int pf_npurge; -struct task pf_purge_task = TASK_INITIALIZER(pf_purge, &pf_npurge); -struct timeout pf_purge_to = TIMEOUT_INITIALIZER(pf_purge_timeout, NULL); - enum pf_test_status { PF_TEST_FAIL = -1, PF_TEST_OK, @@ -1276,49 +1272,111 @@ pf_purge_expired_rules(void) } } -void -pf_purge_timeout(void *unused) -{ - /* XXX move to systqmp to avoid KERNEL_LOCK */ - task_add(systq, &pf_purge_task); -} +voidpf_purge_states(void *); +struct task pf_purge_states_task = +TASK_INITIALIZER(pf_purge_states, NULL); + +voidpf_purge_states_tick(void *); +struct timeout pf_purge_states_to = +TIMEOUT_INITIALIZER(pf_purge_states_tick, NULL); + +unsigned intpf_purge_expired_states(unsigned int, unsigned int); + +/* + * how many states to scan this interval. + * + * this is set when the timeout fires, and reduced by the task. the + * task will reschedule itself until the limit is reduced to zero, + * and then it adds the timeout again. + */ +unsigned int pf_purge_states_limit; + +/* + * limit how many states are processed with locks held per run of + * the state purge task. + */ +unsigned int pf_purge_states_collect = 64; void -pf_purge(void *xnloops) +pf_purge_states_tick(void *null) { - int *nloops = xnloops; + unsigned int limit = pf_status.states; + unsigned int interval = pf_default_rule.timeout[PFTM_INTERVAL]; + + if (limit == 0) { + timeout_add_sec(&pf_purge_states_to, 1); + return; + } /* * process a fraction of the state table every second -* Note: -* we no longer need PF_LOCK() here, because -* pf_purge_expired_states() uses pf_state_lock to maintain -* consistency. */ - if (pf_default_rule.timeout[PFTM_INTERVAL] > 0) - pf_purge_expired_states(1 + (pf_status.states - / pf_default_rule.timeout[PFTM_INTERVAL])); + if (in
Can libc funcs be optimizable for no return value?
I noticed that when I ran strlcpy in cc with both directly from libc and copied from source: “with and without needing a return value”, the libc strlcpy didn’t change the runtime, but the one from source did; dramatically (like 50% runtime difference over a several run loop with 15-20 or so characters), where the compiler obviously optimized stuff out. Is it possible that there’s a way you haven’t considered, that could permit the compiler to optimize the functions… (or pre-optimize for the instances with no return value and when there are different needs, it could choose between the two; if there aren’t dual needs, then only write one to static programs; I don’t know if this can be done trivially.) …which could improve performance without breaking security of functions which may or may not need a return value to function correctly?! Or is all the extra runtime from returning from the libc wrapper? -- -Luke
potential memory leak in if_vinput()
Hello, I've spotted this glitch while hunting down use after-free in 'veb' packet path. I believe the issue is rather hypothetical, there is no evidence the deemed memory leak ever occurred. Anyway I believe the if_vinput() should always consume packet by either passing it further when IFXP_MONITOR flag is set or just releasing it. thanks and regards sashan 8<---8<---8<--8< diff --git a/sys/net/if.c b/sys/net/if.c index f354c9d8a6c..db181586123 100644 --- a/sys/net/if.c +++ b/sys/net/if.c @@ -869,6 +869,8 @@ if_vinput(struct ifnet *ifp, struct mbuf *m) if (__predict_true(!ISSET(ifp->if_xflags, IFXF_MONITOR))) (*ifp->if_input)(ifp, m); + else + m_freem(m); } void
Re: Fix clearing of sleep timeouts
On Mon, Jun 06, 2022 at 06:47:32AM +1000, David Gwynne wrote: > On Sun, Jun 05, 2022 at 03:57:39PM +, Visa Hankala wrote: > > On Sun, Jun 05, 2022 at 12:27:32PM +0200, Martin Pieuchot wrote: > > > On 05/06/22(Sun) 05:20, Visa Hankala wrote: > > > > Encountered the following panic: > > > > > > > > panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" > > > > failed: file "/usr/src/sys/kern/kern_synch.c", line 373 > > > > Stopped at db_enter+0x10: popq%rbp > > > > TIDPIDUID PRFLAGS PFLAGS CPU COMMAND > > > > 423109 57118 55 0x3 02 link > > > > 330695 30276 55 0x3 03 link > > > > * 46366 85501 55 0x1003 0x40804001 link > > > > 188803 85501 55 0x1003 0x40820000K link > > > > db_enter() at db_enter+0x10 > > > > panic(81f25d2b) at panic+0xbf > > > > __assert(81f9a186,81f372c8,175,81f87c6c) at > > > > __assert+0x25 > > > > sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at > > > > sleep_setup+0x1d8 > > > > cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46 > > > > timeout_barrier(8000228a28b0) at timeout_barrier+0x109 > > > > timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2 > > > > sleep_finish(800022d64d90,1) at sleep_finish+0x16d > > > > tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2 > > > > sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at > > > > sys_nanosleep+0x12d > > > > syscall(800022d64f60) at syscall+0x374 > > > > > > > > The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously, > > > > soft-interrupt-driven timeouts could be deleted synchronously without > > > > blocking. Now, timeout_del_barrier() can sleep regardless of the type > > > > of the timeout. > > > > > > > > It looks that with small adjustments timeout_del_barrier() can sleep > > > > in sleep_finish(). The management of run queues is not affected because > > > > the timeout clearing happens after it. As timeout_del_barrier() does not > > > > rely on a timeout or signal catching, there should be no risk of > > > > unbounded recursion or unwanted signal side effects within the sleep > > > > machinery. In a way, a sleep with a timeout is higher-level than > > > > one without. > > > > > > I trust you on the analysis. However this looks very fragile to me. > > > > > > The use of timeout_del_barrier() which can sleep using the global sleep > > > queue is worrying me. > > > > I think the queue handling ends in sleep_finish() when SCHED_LOCK() > > is released. The timeout clearing is done outside of it. > > That's ok. > > > The extra sleeping point inside sleep_finish() is subtle. It should not > > matter in typical use. But is it permissible with the API? Also, if > > timeout_del_barrier() sleeps, the thread's priority can change. > > What other options do we have at this point? Spin? Allocate the timeout > dynamically so sleep_finish doesn't have to wait for it and let the > handler clean up? How would you stop the timeout handler waking up the > thread if it's gone back to sleep again for some other reason? In principle, each thread could have a sleep serial number. If the serial number was somehow associated with the timeout, the handler could bail out if the thread has moved on. However, implementing that association looks tricky. Dynamic allocation could provide a suitable context struct, but memory allocation in this code would be awful. Maybe there should be an extended version of timeout_add() that allows controlled updating of the timeout argument.. Does not sound appealing. Spinning might be an option in the future. The kernel lock complicates things, however. The spinning wait would have to release the kernel lock to avoid possible deadlocking. > Sleeping here is the least worst option. So it seems. > As for timeout_del_barrier, if prio is a worry we can provide an > advanced version of it that lets you pass the prio in. I'd also > like to change timeout_barrier so it queues the barrier task at the > head of the pending lists rather than at the tail. I think it is not important to preserve the priority here. Later scheduling events will override it anyway. > > > Note that sleep_finish() already can take an additional nap when > > signal catching is enabled. > > > > > > Note that endtsleep() can run and set P_TIMEOUT during > > > > timeout_del_barrier() when the thread is blocked in cond_wait(). > > > > To avoid unnecessary atomic read-modify-write operations, the clearing > > > > of P_TIMEOUT could be conditional, but maybe that is an unnecessary > > > > optimization at this point. > > > > > > I agree this optimization seems unnecessary at the moment. > > > > > > > While it should be possible to make the code use timeout_del() instead > > > > of timeout_del_barrier(), the outcome might not be outright better. For > > > > example, sleep_
Re: relayd panic
> On 6 Jun 2022, at 18:08, Claudio Jeker wrote: > > On Mon, Jun 06, 2022 at 12:03:06AM +0200, Alexandr Nedvedicky wrote: >> Hello, >> >> I'll commit one-liner diff on Tuesday morning (Jun 6th) Prague time without >> explicit OK, unless there will be no objection. > > The diff is OK claudio@. ok by me too. > >> regards >> sashan >> >> >> On Sun, Jun 05, 2022 at 09:44:45AM +0100, Stuart Henderson wrote: >>> I don't know this code well enough to give a meaningful OK, but this >>> should probably get committed. >>> >>> >>> On 2022/06/01 09:16, Alexandr Nedvedicky wrote: Hello, > r420-1# rcctl -f start relayd > relayd(ok) > r420-1# uvm_fault(0xfd862f82f990, 0x0, 0, 1) -> e > kernel: page fault trap, code=0 > Stopped at pf_find_or_create_ruleset+0x1c: movb 0(%rdi),%al > TID PID UID PRFLAGS PFLAGS CPU COMMAND > 431388 19003 0 0x2 0 5 relayd > 174608 32253 89 0x112 0 2 relayd > 395415 12468 0 0x2 0 4 relayd > 493579 11904 0 0x2 0 3 relayd > *101082 14967 89 0x1100012 0 0K relayd > pf_find_or_create_ruleset(0) at pf_find_or_create_ruleset+0x1c > pfr_add_tables(832d7cca800,1,80eaf43c,1000) at > pfr_add_tables+0x6ae > > pfioctl(4900,c450443d,80eaf000,3,80002272e7f0) at > pfioctl+0x1d9f > VOP_IOCTL(fd8551f82dd0,c450443d,80eaf000,3,fd862f7d60c0,800 > 02272e7f0) at VOP_IOCTL+0x5c > vn_ioctl(fd855ecec1e8,c450443d,80eaf000,80002272e7f0) at > vn_ioctl+0x75 > sys_ioctl(80002272e7f0,8000227d9980,8000227d99d0) at > sys_ioctl+0x2c4 > syscall(8000227d9a40) at syscall+0x374 > Xsyscall() at Xsyscall+0x128 > end of kernel it looks like we are dying here at line 239 due to NULL pointer deference: 232 struct pf_ruleset * 233 pf_find_or_create_ruleset(const char *path) 234 { 235 char *p, *aname, *r; 236 struct pf_ruleset *ruleset; 237 struct pf_anchor *anchor; 238 239 if (path[0] == 0) 240 return (&pf_main_ruleset); 241 242 while (*path == '/') 243 path++; 244 I've followed the same steps to reproduce the issue to check if diff below resolves the issue. The bug has been introduced by my recent change to pf_table.c [1] from May 10th: Modified files: sys/net : pf_ioctl.c pf_table.c Log message: move memory allocations in pfr_add_tables() out of NET_LOCK()/PF_LOCK() scope. bluhm@ helped a lot to put this diff into shape. besides using a regression test I've also did simple testing using a 'load anchor': netlock# cat /tmp/anchor.conf load anchor "test" from "/tmp/pf.conf" netlock# netlock# cat /tmp/pf.conf table { 192.168.1.1 } pass from netlock# netlock# pfctl -sA test netlock# pfctl -a test -sT try netlock# pfctl -a test -t try -T show 192.168.1.1 OK to commit fix below? thanks and regards sashan [1] https://urldefense.com/v3/__https://marc.info/?l=openbsd-cvs&m=16522243003&w=2__;!!ACWV5N9M2RV99hQ!LsTJPPsMku6N_u9xzJu6Tj6XpZWyLzLWPmbWr-Z-p845Y8r6LH4Ul8PyX8EmqI6alhF0JqadpBBF4mn53v-rQdY$ 8<---8<---8<--8< diff --git a/sys/net/pf_table.c b/sys/net/pf_table.c index 8315ea5dd3a..dfc49de5efe 100644 --- a/sys/net/pf_table.c +++ b/sys/net/pf_table.c @@ -1628,8 +1628,7 @@ pfr_add_tables(struct pfr_table *tbl, int size, int *nadd, int flags) if (r != NULL) continue; - q->pfrkt_rs = pf_find_or_create_ruleset( - q->pfrkt_root->pfrkt_anchor); + q->pfrkt_rs = pf_find_or_create_ruleset(q->pfrkt_anchor); /* * root tables are attached to main ruleset, * because ->pfrkt_anchor[0] == '\0' >> > > -- > :wq Claudio
Re: Fix clearing of sleep timeouts
On 06/06/22(Mon) 06:47, David Gwynne wrote: > On Sun, Jun 05, 2022 at 03:57:39PM +, Visa Hankala wrote: > > On Sun, Jun 05, 2022 at 12:27:32PM +0200, Martin Pieuchot wrote: > > > On 05/06/22(Sun) 05:20, Visa Hankala wrote: > > > > Encountered the following panic: > > > > > > > > panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" > > > > failed: file "/usr/src/sys/kern/kern_synch.c", line 373 > > > > Stopped at db_enter+0x10: popq%rbp > > > > TIDPIDUID PRFLAGS PFLAGS CPU COMMAND > > > > 423109 57118 55 0x3 02 link > > > > 330695 30276 55 0x3 03 link > > > > * 46366 85501 55 0x1003 0x40804001 link > > > > 188803 85501 55 0x1003 0x40820000K link > > > > db_enter() at db_enter+0x10 > > > > panic(81f25d2b) at panic+0xbf > > > > __assert(81f9a186,81f372c8,175,81f87c6c) at > > > > __assert+0x25 > > > > sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at > > > > sleep_setup+0x1d8 > > > > cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46 > > > > timeout_barrier(8000228a28b0) at timeout_barrier+0x109 > > > > timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2 > > > > sleep_finish(800022d64d90,1) at sleep_finish+0x16d > > > > tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2 > > > > sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at > > > > sys_nanosleep+0x12d > > > > syscall(800022d64f60) at syscall+0x374 > > > > > > > > The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously, > > > > soft-interrupt-driven timeouts could be deleted synchronously without > > > > blocking. Now, timeout_del_barrier() can sleep regardless of the type > > > > of the timeout. > > > > > > > > It looks that with small adjustments timeout_del_barrier() can sleep > > > > in sleep_finish(). The management of run queues is not affected because > > > > the timeout clearing happens after it. As timeout_del_barrier() does not > > > > rely on a timeout or signal catching, there should be no risk of > > > > unbounded recursion or unwanted signal side effects within the sleep > > > > machinery. In a way, a sleep with a timeout is higher-level than > > > > one without. > > > > > > I trust you on the analysis. However this looks very fragile to me. > > > > > > The use of timeout_del_barrier() which can sleep using the global sleep > > > queue is worrying me. > > > > I think the queue handling ends in sleep_finish() when SCHED_LOCK() > > is released. The timeout clearing is done outside of it. > > That's ok. > > > The extra sleeping point inside sleep_finish() is subtle. It should not > > matter in typical use. But is it permissible with the API? Also, if > > timeout_del_barrier() sleeps, the thread's priority can change. > > What other options do we have at this point? Spin? Allocate the timeout > dynamically so sleep_finish doesn't have to wait for it and let the > handler clean up? How would you stop the timeout handler waking up the > thread if it's gone back to sleep again for some other reason? > > Sleeping here is the least worst option. I agree. I don't think sleeping is bad here. My concern is about how sleeping is implemented. There's a single API built on top of a single global data structure which now calls itself recursively. I'm not sure how much work it would be to make cond_wait(9) use its own sleep queue... This is something independent from this fix though. > As for timeout_del_barrier, if prio is a worry we can provide an > advanced version of it that lets you pass the prio in. I'd also > like to change timeout_barrier so it queues the barrier task at the > head of the pending lists rather than at the tail. I doubt prio matter.
Re: ix(4): Add support for TCP Large Receive Offloading
On 4.6.2022. 21:23, Hrvoje Popovski wrote: > On 1.6.2022. 11:21, Jan Klemkow wrote: >> I moved the switch to ifconfig(8) in the diff below. >> >> # ifconfig ix0 tso >> # ifconfig ix0 -tso >> >> I named it tso (TCP segment offloading), so I can reuse this switch >> also for the sending part. TSO is the combination of LRO and LSO. >> >> LRO: Large Receive Offloading >> LSO: Large Send Offloading >> >> RSC (Receive Side Coalescing) is an alternative term for LRO, which is >> used by the spec of ix(4) NICs. >> Tests with other ix(4) NICs are welcome and needed! >>> We'll try and kick it around at work in the next week or so. > > Hi all, > > I've put this diff in production on clean source from this morning and > got panic. I'm not 100% sure if it's because of TSO because in a last > monts i had all kinds of diffs on production boxes. > Now I will run spanshot maybe clean spanshot will panic :)) And box panic with spanshot but with different message. Will send report to bugs@...
Re: relayd panic
On Mon, Jun 06, 2022 at 12:03:06AM +0200, Alexandr Nedvedicky wrote: > Hello, > > I'll commit one-liner diff on Tuesday morning (Jun 6th) Prague time without > explicit OK, unless there will be no objection. The diff is OK claudio@. > regards > sashan > > > On Sun, Jun 05, 2022 at 09:44:45AM +0100, Stuart Henderson wrote: > > I don't know this code well enough to give a meaningful OK, but this > > should probably get committed. > > > > > > On 2022/06/01 09:16, Alexandr Nedvedicky wrote: > > > Hello, > > > > > > > > > > r420-1# rcctl -f start relayd > > > > relayd(ok) > > > > r420-1# uvm_fault(0xfd862f82f990, 0x0, 0, 1) -> e > > > > kernel: page fault trap, code=0 > > > > Stopped at pf_find_or_create_ruleset+0x1c: movb0(%rdi),%al > > > > TIDPIDUID PRFLAGS PFLAGS CPU COMMAND > > > > 431388 19003 0 0x2 05 relayd > > > > 174608 32253 89 0x112 02 relayd > > > > 395415 12468 0 0x2 04 relayd > > > > 493579 11904 0 0x2 03 relayd > > > > *101082 14967 89 0x1100012 00K relayd > > > > pf_find_or_create_ruleset(0) at pf_find_or_create_ruleset+0x1c > > > > pfr_add_tables(832d7cca800,1,80eaf43c,1000) at > > > > pfr_add_tables+0x6ae > > > > > > > > pfioctl(4900,c450443d,80eaf000,3,80002272e7f0) at > > > > pfioctl+0x1d9f > > > > VOP_IOCTL(fd8551f82dd0,c450443d,80eaf000,3,fd862f7d60c0,800 > > > > 02272e7f0) at VOP_IOCTL+0x5c > > > > vn_ioctl(fd855ecec1e8,c450443d,80eaf000,80002272e7f0) at > > > > vn_ioctl+0x75 > > > > sys_ioctl(80002272e7f0,8000227d9980,8000227d99d0) at > > > > sys_ioctl+0x2c4 > > > > syscall(8000227d9a40) at syscall+0x374 > > > > Xsyscall() at Xsyscall+0x128 > > > > end of kernel > > > > > > it looks like we are dying here at line 239 due to NULL pointer > > > deference: > > > > > > 232 struct pf_ruleset * > > > 233 pf_find_or_create_ruleset(const char *path) > > > 234 { > > > 235 char*p, *aname, *r; > > > 236 struct pf_ruleset *ruleset; > > > 237 struct pf_anchor*anchor; > > > 238 > > > 239 if (path[0] == 0) > > > 240 return (&pf_main_ruleset); > > > 241 > > > 242 while (*path == '/') > > > 243 path++; > > > 244 > > > > > > I've followed the same steps to reproduce the issue to check if > > > diff below resolves the issue. The bug has been introduced by > > > my recent change to pf_table.c [1] from May 10th: > > > > > > Modified files: > > > sys/net: pf_ioctl.c pf_table.c > > > > > > Log message: > > > move memory allocations in pfr_add_tables() out of > > > NET_LOCK()/PF_LOCK() scope. bluhm@ helped a lot > > > to put this diff into shape. > > > > > > besides using a regression test I've also did simple testing > > > using a 'load anchor': > > > > > > netlock# cat /tmp/anchor.conf > > > > > > load anchor "test" from "/tmp/pf.conf" > > > netlock# > > > netlock# cat /tmp/pf.conf > > > > > > table { 192.168.1.1 } > > > pass from > > > netlock# > > > netlock# pfctl -sA > > > test > > > netlock# pfctl -a test -sT > > > try > > > netlock# pfctl -a test -t try -T show > > >192.168.1.1 > > > > > > OK to commit fix below? > > > > > > thanks and > > > regards > > > sashan > > > > > > [1] > > > https://urldefense.com/v3/__https://marc.info/?l=openbsd-cvs&m=16522243003&w=2__;!!ACWV5N9M2RV99hQ!LsTJPPsMku6N_u9xzJu6Tj6XpZWyLzLWPmbWr-Z-p845Y8r6LH4Ul8PyX8EmqI6alhF0JqadpBBF4mn53v-rQdY$ > > > > > > > > > 8<---8<---8<--8< > > > diff --git a/sys/net/pf_table.c b/sys/net/pf_table.c > > > index 8315ea5dd3a..dfc49de5efe 100644 > > > --- a/sys/net/pf_table.c > > > +++ b/sys/net/pf_table.c > > > @@ -1628,8 +1628,7 @@ pfr_add_tables(struct pfr_table *tbl, int size, int > > > *nadd, int flags) > > > if (r != NULL) > > > continue; > > > > > > - q->pfrkt_rs = pf_find_or_create_ruleset( > > > - q->pfrkt_root->pfrkt_anchor); > > > + q->pfrkt_rs = > > > pf_find_or_create_ruleset(q->pfrkt_anchor); > > > /* > > >* root tables are attached to main ruleset, > > >* because ->pfrkt_anchor[0] == '\0' > > > > -- :wq Claudio