Re: panic(9): set panicstr atomically
On Wed, May 12, 2021 at 07:08:39PM -0500, Scott Cheloha wrote: > Hi, > > In a separate mail thread, bluhm@ mentioned that panic(9) does not > cleanly handle multiple CPUs entering it simultaneously: > > https://marc.info/?l=openbsd-tech&m=161908805925325&w=2 > > I'm unsure which part of panic(9) is causing the problem he mentions, > but one obvious issue I see is that panicstr is not set atomically, > so two CPUs entering panic(9) simultaneously may clobber panicbuf. > > If we set panicstr atomically only one CPU will write panicbuf. > > Thoughts? I've seen panics caused by syzkaller where panicbuf looks scrambled by more than one thread writing to the same static buffer. Assigning panicstr before the vsnprintf() call therefore makes sense. This is also what NetBSD does, although not as an atomic operation. ok anton@
Re: iked(8): support for intermediate CAs and multiple CERT payloads
I can't test at the moment, but as you asked for comments too: this is *very* welcome, it's an important missing feature. Thanks! -- Sent from a phone, apologies for poor formatting. On 13 May 2021 06:40:49 Katsuhiro Ueno wrote: Hi, I would be happy if iked(8) supports intermediate CAs and sends the entire certificate chain to the clients. The diff attached adds supports for intermediate CAs and multiple CERT payloads to iked(8). What I would like to do is to use a LetsEncrypt certificate as a server certificate of IKEv2 EAP and establish VPN connections with Windows clients. However, I could not complete it because of the following reasons. * LetsEncrypt server certificate is issued by an intermediate CA and therefore the certificate of the intermediate CA is needed to check the validity of the server certificate. * Windows expects the IKEv2 server to send the intermediate CA's certificate in addition to the server certificate to check the validity. * On the other hand, iked(8) is not capable of dealing with certificate chains and sending multiple certificates (multiple CERT payloads) to the clients. Consequently, Windows fails to verify the certificate and therefore VPN connection cannot be established. To overcome this, I added an (ad-hoc) support for certificate chain and multiple CERT payloads. The diff attached is the changes that I made. It works fine for me but I am not sure whether or not it works for everyone and everywhere. Tests and comments are greatly appreciated. Many thanks, Katsuhiro Ueno
Re: acme-client: use field agnostic {get,set}_affine_coordinates()
I trust you know what you are doing. OK florian fwiw On 2021-05-13 07:46 +02, Theo Buehler wrote: > The _GFp() variants provide no benefit and are just wrappers around the > variants without _GFp(), so use the latter directly. > > Index: acctproc.c > === > RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v > retrieving revision 1.20 > diff -u -p -r1.20 acctproc.c > --- acctproc.c17 Jun 2019 15:20:10 - 1.20 > +++ acctproc.c13 May 2021 05:38:59 - > @@ -109,9 +109,9 @@ op_thumb_ec(EVP_PKEY *pkey) > warnx("BN_new"); > else if ((Y = BN_new()) == NULL) > warnx("BN_new"); > - else if (!EC_POINT_get_affine_coordinates_GFp(EC_KEY_get0_group(ec), > + else if (!EC_POINT_get_affine_coordinates(EC_KEY_get0_group(ec), > EC_KEY_get0_public_key(ec), X, Y, NULL)) > - warnx("EC_POINT_get_affine_coordinates_GFp"); > + warnx("EC_POINT_get_affine_coordinates"); > else if ((x = bn2string(X)) == NULL) > warnx("bn2string"); > else if ((y = bn2string(Y)) == NULL) > @@ -237,9 +237,9 @@ op_sign_ec(char **prot, EVP_PKEY *pkey, > warnx("BN_new"); > else if ((Y = BN_new()) == NULL) > warnx("BN_new"); > - else if (!EC_POINT_get_affine_coordinates_GFp(EC_KEY_get0_group(ec), > + else if (!EC_POINT_get_affine_coordinates(EC_KEY_get0_group(ec), > EC_KEY_get0_public_key(ec), X, Y, NULL)) > - warnx("EC_POINT_get_affine_coordinates_GFp"); > + warnx("EC_POINT_get_affine_coordinates"); > else if ((x = bn2string(X)) == NULL) > warnx("bn2string"); > else if ((y = bn2string(Y)) == NULL) > -- I'm not entirely sure you are real.
Re: macppc bsd.mp pmap's hash lock
My last diff (11 May 2021) still has a potential problem with memory barriers. I will mail a new diff if I think of a fix. On Tue, 11 May 2021 11:08:55 -0500 Dale Rahn wrote: > This structure really should be cache-line aligned, which should prevent it > from spilling across a page boundary. There is both a struct __ppc_lock and a function __ppc_lock(); my G5 got stuck at boot when the function crossed a page boundary. It might help to align parts of the function, but if I would do so, I might need to write the entire function in asm and count instructions. The potential problem in my last diff is here inside __ppc_lock(): 15eb8c: 7c c0 19 2d stwcx. r6,0,r3 15eb90: 40 c2 ff f0 bne+15eb80 <__ppc_lock+0x40> 15eb94: 7c 08 38 40 cmplw r8,r7 15eb98: 41 82 00 28 beq-15ebc0 <__ppc_lock+0x80> ... 15ebc0: 4c 00 01 2c isync After "stwcx." grabs the lock, we need a "bc; isync" memory barrier, which is the "bne+" and "isync" in the above disassembly. A page fault might act like an "isync", so if we can do the "bne+" before the page fault, we might be fine. The problem is that, if the "stwcx." and "bne+" are in different pages (1 in 1024 chance?), then I would get a page fault before the "bne+", so my code would skip the necessary memory barrier. I guess that I can fix it by adding another membar_enter() somewhere. > The powerpc pmap was originally designed to have 8 'way' locks so that > only a single way would get locked, thus (as long as one doesn't have way > more than 8 cores) any core should be able to choose some way to replace > and get the work done. This also would have allowed limited recursion. > However at some point those were taken out. Note that before locking one of > the ways of the hashtable, it would hold the lock on the pmap that it was > inserting the mapping from (which could be the kernel). Not certain if the > kernel pmap lock is recursive or not. That's a good history lesson for me; I'm too new. The pmap lock is "struct mutex pm_mtx" (powerpc/include/pmap.h), so it isn't recursive. The code in pmap.c pte_spill_r() doesn't hold the pmap lock if the page is in the direct map. The kernel text, including the function __ppc_lock(), is in the direct map. > The point was to allow multiple cores to access different regions of the > hashtable at the same time (minimal contention). However it would also > allow a core to reenter the lock_try on a different way where it should be > able to succeed (as long as the recursion doesn't go 8 deep?) If I understand the code, the max recursion is 2 levels. The kernel can grab the hash lock, get a page fault, then the page fault handler can do 2nd grab. The handler is in real mode (no address translation) and would never cause another page fault and a 3rd grab. If we would have the 8 locks, then it might deadlock if 8 cpus each hold a lock and try to grab a 2nd lock (but a macppc has at most 4 cpus: PPC_MAXPROCS in powerpc/include/cpu.h). --George > On Tue, May 11, 2021 at 12:42 AM George Koehler wrote: > > > I made a mistake in my last diff by deleting the memory barriers. > > Here is a diff that keeps membar_enter() and membar_exit(). > > > > With my last diff, my G5 froze after 15 hours of uptime, and again > > after 10 hours of uptime. I'm not sure whether the missing memory > > barriers caused the freeze, but I will be running this diff and hoping > > for fewer freezes. > > > > On Sat, 8 May 2021 18:59:35 +0200 (CEST) > > Mark Kettenis wrote: > > > > > Good find! On powerpc64 I avoid the issue because I guarantee that > > > the kernel mappings are never evicted from the hash. But doing so on > > > powerpc would require more serious development. I'm not sure we > > > really need a ticket lock for this, but since you already did the > > > work, let's stick with it for now. > > > > __ppc_lock (before and after this diff) doesn't give tickets like > > __mp_lock does, but __ppc_lock and __mp_lock are both recursive locks. > > I don't know an easy way to avoid the recursion, if some of the kernel > > mappings might not be in the hash. My __ppc_lock diff should be easy > > if I don't make mistakes like wrong memory barriers. > > > > --George > > > > Index: arch/powerpc/include/mplock.h > > === > > RCS file: /cvs/src/sys/arch/powerpc/include/mplock.h,v > > retrieving revision 1.4 > > diff -u -p -r1.4 mplock.h > > --- arch/powerpc/include/mplock.h 15 Apr 2020 08:09:00 - 1.4 > > +++ arch/powerpc/include/mplock.h 10 May 2021 23:33:00 - > > @@ -30,13 +30,13 @@ > > #define __USE_MI_MPLOCK > > > > /* > > + * __ppc_lock exists because pte_spill_r() can't use __mp_lock. > > * Really simple spinlock implementation with recursive capabilities. > > * Correctness is paramount, no fancyness allowed. > > */ > > > > struct __ppc_lock { > > - volatile struct cpu_info *mpl_cpu
acme-client: use field agnostic {get,set}_affine_coordinates()
The _GFp() variants provide no benefit and are just wrappers around the variants without _GFp(), so use the latter directly. Index: acctproc.c === RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v retrieving revision 1.20 diff -u -p -r1.20 acctproc.c --- acctproc.c 17 Jun 2019 15:20:10 - 1.20 +++ acctproc.c 13 May 2021 05:38:59 - @@ -109,9 +109,9 @@ op_thumb_ec(EVP_PKEY *pkey) warnx("BN_new"); else if ((Y = BN_new()) == NULL) warnx("BN_new"); - else if (!EC_POINT_get_affine_coordinates_GFp(EC_KEY_get0_group(ec), + else if (!EC_POINT_get_affine_coordinates(EC_KEY_get0_group(ec), EC_KEY_get0_public_key(ec), X, Y, NULL)) - warnx("EC_POINT_get_affine_coordinates_GFp"); + warnx("EC_POINT_get_affine_coordinates"); else if ((x = bn2string(X)) == NULL) warnx("bn2string"); else if ((y = bn2string(Y)) == NULL) @@ -237,9 +237,9 @@ op_sign_ec(char **prot, EVP_PKEY *pkey, warnx("BN_new"); else if ((Y = BN_new()) == NULL) warnx("BN_new"); - else if (!EC_POINT_get_affine_coordinates_GFp(EC_KEY_get0_group(ec), + else if (!EC_POINT_get_affine_coordinates(EC_KEY_get0_group(ec), EC_KEY_get0_public_key(ec), X, Y, NULL)) - warnx("EC_POINT_get_affine_coordinates_GFp"); + warnx("EC_POINT_get_affine_coordinates"); else if ((x = bn2string(X)) == NULL) warnx("bn2string"); else if ((y = bn2string(Y)) == NULL)
iked(8): support for intermediate CAs and multiple CERT payloads
Hi, I would be happy if iked(8) supports intermediate CAs and sends the entire certificate chain to the clients. The diff attached adds supports for intermediate CAs and multiple CERT payloads to iked(8). What I would like to do is to use a LetsEncrypt certificate as a server certificate of IKEv2 EAP and establish VPN connections with Windows clients. However, I could not complete it because of the following reasons. * LetsEncrypt server certificate is issued by an intermediate CA and therefore the certificate of the intermediate CA is needed to check the validity of the server certificate. * Windows expects the IKEv2 server to send the intermediate CA's certificate in addition to the server certificate to check the validity. * On the other hand, iked(8) is not capable of dealing with certificate chains and sending multiple certificates (multiple CERT payloads) to the clients. Consequently, Windows fails to verify the certificate and therefore VPN connection cannot be established. To overcome this, I added an (ad-hoc) support for certificate chain and multiple CERT payloads. The diff attached is the changes that I made. It works fine for me but I am not sure whether or not it works for everyone and everywhere. Tests and comments are greatly appreciated. Many thanks, Katsuhiro Ueno iked.diff Description: Binary data
isakmpd: simplify calls to {get,set}_affine_coordinates
There never was a real need for a distinction between the GFp and GF2m variants of EC_POINT_{get,set}_affine_coordinates_{GFp,GF2m}() in libcrypto. The EC_GROUP method has the correct function pointer set. Now that we have EC_POINT_{get,set}_affine_coordinates(), let's use them and avoid the complexity of using the GFp and GF2m variants. Index: dh.c === RCS file: /cvs/src/sbin/isakmpd/dh.c,v retrieving revision 1.21 diff -u -p -r1.21 dh.c --- dh.c8 Nov 2017 13:33:49 - 1.21 +++ dh.c13 May 2021 05:25:32 - @@ -535,16 +535,8 @@ ec_point2raw(struct group *group, const if ((ecgroup = EC_KEY_get0_group(group->ec)) == NULL) goto done; - if (EC_METHOD_get_field_type(EC_GROUP_method_of(ecgroup)) == - NID_X9_62_prime_field) { - if (!EC_POINT_get_affine_coordinates_GFp(ecgroup, - point, x, y, bnctx)) - goto done; - } else { - if (!EC_POINT_get_affine_coordinates_GF2m(ecgroup, - point, x, y, bnctx)) - goto done; - } + if (!EC_POINT_get_affine_coordinates(ecgroup, point, x, y, bnctx)) + goto done; xoff = xlen - BN_num_bytes(x); bzero(buf, xoff); @@ -603,16 +595,8 @@ ec_raw2point(struct group *group, u_int8 if ((point = EC_POINT_new(ecgroup)) == NULL) goto done; - if (EC_METHOD_get_field_type(EC_GROUP_method_of(ecgroup)) == - NID_X9_62_prime_field) { - if (!EC_POINT_set_affine_coordinates_GFp(ecgroup, - point, x, y, bnctx)) - goto done; - } else { - if (!EC_POINT_set_affine_coordinates_GF2m(ecgroup, - point, x, y, bnctx)) - goto done; - } + if (!EC_POINT_set_affine_coordinates(ecgroup, point, x, y, bnctx)) + goto done; ret = 0; done:
timeout(9): add TIMEOUT_MPSAFE flag
Hi, With the removal of the kernel lock from timeout_barrier(9), softclock() and the timeout thread do not need the kernel lock. However, many timeouts implicitly rely on the kernel lock. So to unlock softclock() and the timeout thread I propose adding a new flag, TIMEOUT_MPSAFE. The flag signifies that a given timeout doesn't need the kernel lock at all. We'll run all such timeouts without the kernel lock. Taking/releasing the kernel lock during timeout_run() on a per-timeout basis is way too slow. Instead, during softclock() we can aggregate TIMEOUT_MPSAFE timeouts onto secondary queues and process them all at once after dropping the kernel lock. This will minimize locking overhead. Normal timeouts are put onto timeout_todo_mpsafe and are run during softclock(). Process timeouts are put onto timeout_proc_mpsafe and run from the timeout thread. The funky locking dance in softclock() and softclock_thread() is needed to keep from violating the locking hierarchy. The timeout_mutex is above the kernel lock, so we need to leave the timeout_mutex, then drop the kernel lock, and then reenter the timeout_mutex to start running TIMEOUT_MPSAFE timeouts. Thoughts? Index: kern/kern_timeout.c === RCS file: /cvs/src/sys/kern/kern_timeout.c,v retrieving revision 1.84 diff -u -p -r1.84 kern_timeout.c --- kern/kern_timeout.c 11 May 2021 13:29:25 - 1.84 +++ kern/kern_timeout.c 13 May 2021 05:01:53 - @@ -74,7 +74,9 @@ struct circq timeout_wheel[BUCKETS]; /* struct circq timeout_wheel_kc[BUCKETS];/* [T] Clock-based timeouts */ struct circq timeout_new; /* [T] New, unscheduled timeouts */ struct circq timeout_todo; /* [T] Due or needs rescheduling */ +struct circq timeout_todo_mpsafe; /* [T] Due + no kernel lock */ struct circq timeout_proc; /* [T] Due + needs process context */ +struct circq timeout_proc_mpsafe; /* [T] Due, proc ctx, no kernel lock */ time_t timeout_level_width[WHEELCOUNT];/* [I] Wheel level width (seconds) */ struct timespec tick_ts; /* [I] Length of a tick (1/hz secs) */ @@ -228,7 +230,9 @@ timeout_startup(void) CIRCQ_INIT(&timeout_new); CIRCQ_INIT(&timeout_todo); + CIRCQ_INIT(&timeout_todo_mpsafe); CIRCQ_INIT(&timeout_proc); + CIRCQ_INIT(&timeout_proc_mpsafe); for (b = 0; b < nitems(timeout_wheel); b++) CIRCQ_INIT(&timeout_wheel[b]); for (b = 0; b < nitems(timeout_wheel_kc); b++) @@ -697,7 +701,13 @@ softclock_process_kclock_timeout(struct if (!new && timespeccmp(&to->to_abstime, &kc->kc_late, <=)) tostat.tos_late++; if (ISSET(to->to_flags, TIMEOUT_PROC)) { - CIRCQ_INSERT_TAIL(&timeout_proc, &to->to_list); + if (ISSET(to->to_flags, TIMEOUT_MPSAFE)) + CIRCQ_INSERT_TAIL(&timeout_proc_mpsafe, &to->to_list); + else + CIRCQ_INSERT_TAIL(&timeout_proc, &to->to_list); + return; + } else if (ISSET(to->to_flags, TIMEOUT_MPSAFE)) { + CIRCQ_INSERT_TAIL(&timeout_todo_mpsafe, &to->to_list); return; } timeout_run(to); @@ -719,7 +729,13 @@ softclock_process_tick_timeout(struct ti if (!new && delta < 0) tostat.tos_late++; if (ISSET(to->to_flags, TIMEOUT_PROC)) { - CIRCQ_INSERT_TAIL(&timeout_proc, &to->to_list); + if (ISSET(to->to_flags, TIMEOUT_MPSAFE)) + CIRCQ_INSERT_TAIL(&timeout_proc_mpsafe, &to->to_list); + else + CIRCQ_INSERT_TAIL(&timeout_proc, &to->to_list); + return; + } else if (ISSET(to->to_flags, TIMEOUT_MPSAFE)) { + CIRCQ_INSERT_TAIL(&timeout_todo_mpsafe, &to->to_list); return; } timeout_run(to); @@ -735,11 +751,14 @@ softclock_process_tick_timeout(struct ti void softclock(void *arg) { + struct circq *cq; struct timeout *first_new, *to; - int needsproc, new; + int needsproc, new, unlocked; first_new = NULL; + needsproc = 1; new = 0; + unlocked = 0; mtx_enter(&timeout_mutex); if (!CIRCQ_EMPTY(&timeout_new)) @@ -755,10 +774,27 @@ softclock(void *arg) else softclock_process_tick_timeout(to, new); } + if (!CIRCQ_EMPTY(&timeout_todo_mpsafe)) { + mtx_leave(&timeout_mutex); + KERNEL_UNLOCK(); + unlocked = 1; + mtx_enter(&timeout_mutex); + while (!CIRCQ_EMPTY(&timeout_todo_mpsafe)) { + cq = CIRCQ_FIRST(&timeout_todo_mpsafe); + to = timeout_from_circq(cq); + CIRCQ_REMOVE(&to->to_list); + timeout_run(to); + t
Whitespace fix in sys/scsi/scsi_base.c
Found this while poking around. diff --git a/sys/scsi/scsi_base.c b/sys/scsi/scsi_base.c index 2ba6a702fbc..6769449e2fa 100644 --- a/sys/scsi/scsi_base.c +++ b/sys/scsi/scsi_base.c @@ -478,7 +478,7 @@ scsi_io_get(struct scsi_iopool *iopl, int flags) return NULL; /* otherwise sleep until we get one */ -scsi_ioh_set(&ioh, iopl, scsi_io_get_done, &m); + scsi_ioh_set(&ioh, iopl, scsi_io_get_done, &m); scsi_ioh_add(&ioh); scsi_move(&m);
Re: Fix IPsec NAT-T for L2TP/IPsec
On Wed, 12 May 2021 19:11:09 +0900 (JST) YASUOKA Masahiko wrote: > Radek reported a problem to misc@ that multiple Windows clients behind > a NAT cannot use a L2TP/IPsec server simultaneously. > > https://marc.info/?t=16099681611&r=1&w=2 > > There is two problems. First is pipex(4) doesn't pass the proper > ipsecflowinfo to ip_output(). Second is the IPsec policy check which > is done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is > not cached. This happens when its flow is shared by another tdb (for > another client of the same NAT). > > The following 2 diffs fix these problem. > > comment? > ok? > > diff #1 > > Fix IPsec NAT-T work with pipex. The original diff #1 used m_tag to specify the ipsecflowinfo. I noticed "ph_cookie" is usable instead of the m_tag. It seems simpler. Is it better? Index: sys/net/if_etherip.c === RCS file: /disk/cvs/openbsd/src/sys/net/if_etherip.c,v retrieving revision 1.48 diff -u -p -r1.48 if_etherip.c --- sys/net/if_etherip.c9 Jan 2021 21:00:58 - 1.48 +++ sys/net/if_etherip.c12 May 2021 23:29:41 - @@ -547,7 +547,7 @@ ip_etherip_output(struct ifnet *ifp, str etheripstat_pkt(etherips_opackets, etherips_obytes, m->m_pkthdr.len - (sizeof(struct ip) + sizeof(struct etherip_header))); - ip_send(m); + ip_send(m, 0); return (0); } Index: sys/net/if_gif.c === RCS file: /disk/cvs/openbsd/src/sys/net/if_gif.c,v retrieving revision 1.132 diff -u -p -r1.132 if_gif.c --- sys/net/if_gif.c20 Feb 2021 04:58:29 - 1.132 +++ sys/net/if_gif.c12 May 2021 23:29:45 - @@ -340,7 +340,7 @@ gif_send(struct gif_softc *sc, struct mb ip->ip_src = sc->sc_tunnel.t_src4; ip->ip_dst = sc->sc_tunnel.t_dst4; - ip_send(m); + ip_send(m, 0); break; } #ifdef INET6 Index: sys/net/if_gre.c === RCS file: /disk/cvs/openbsd/src/sys/net/if_gre.c,v retrieving revision 1.171 diff -u -p -r1.171 if_gre.c --- sys/net/if_gre.c10 Mar 2021 10:21:47 - 1.171 +++ sys/net/if_gre.c12 May 2021 23:29:52 - @@ -1999,7 +1999,7 @@ gre_ip_output(const struct gre_tunnel *t switch (tunnel->t_af) { case AF_INET: - ip_send(m); + ip_send(m, 0); break; #ifdef INET6 case AF_INET6: Index: sys/net/pf.c === RCS file: /disk/cvs/openbsd/src/sys/net/pf.c,v retrieving revision 1.1116 diff -u -p -r1.1116 pf.c --- sys/net/pf.c27 Apr 2021 09:38:29 - 1.1116 +++ sys/net/pf.c12 May 2021 23:29:56 - @@ -2896,7 +2896,7 @@ pf_send_tcp(const struct pf_rule *r, sa_ switch (af) { case AF_INET: - ip_send(m); + ip_send(m, 0); break; #ifdef INET6 case AF_INET6: Index: sys/net/pipex.c === RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v retrieving revision 1.132 diff -u -p -r1.132 pipex.c --- sys/net/pipex.c 10 Mar 2021 10:21:48 - 1.132 +++ sys/net/pipex.c 12 May 2021 23:31:24 - @@ -1258,7 +1258,7 @@ pipex_pptp_output(struct mbuf *m0, struc gre->flags = htons(gre->flags); m0->m_pkthdr.ph_ifidx = session->ifindex; - ip_send(m0); + ip_send(m0, 0); if (len > 0) { /* network layer only */ /* countup statistics */ session->stat.opackets++; @@ -1704,7 +1704,7 @@ pipex_l2tp_output(struct mbuf *m0, struc ip->ip_tos = 0; ip->ip_off = 0; - ip_send(m0); + ip_send(m0, session->proto.l2tp.ipsecflowinfo); break; #ifdef INET6 case AF_INET6: Index: sys/netinet/ip_icmp.c === RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_icmp.c,v retrieving revision 1.186 diff -u -p -r1.186 ip_icmp.c --- sys/netinet/ip_icmp.c 30 Mar 2021 08:37:10 - 1.186 +++ sys/netinet/ip_icmp.c 12 May 2021 23:31:57 - @@ -860,7 +860,7 @@ icmp_send(struct mbuf *m, struct mbuf *o ipstat_inc(ips_localout); ip_send_raw(m); } else - ip_send(m); + ip_send(m, 0); } u_int32_t Index: sys/netinet/ip_input.c === RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v retrieving revision 1.359 diff -u -p -r1.359 ip_input.c --- sys/netinet/ip_input.c 30 Apr 2021 13:52:48 - 1.359 +++ sys/netinet/ip_input.c 12 May 2021 23:29:01 - @@ -1790,6 +1790,7 @@ ip_send_do_dispatch(void *xmq, int flags st
Re: panic(9): set panicstr atomically
Nicer than the garble... It would be nice if we could see all the panics. Could we also have a per-cpu panic buffer, and then adapt ddb to show them all? Scott Cheloha wrote: > In a separate mail thread, bluhm@ mentioned that panic(9) does not > cleanly handle multiple CPUs entering it simultaneously: > > https://marc.info/?l=openbsd-tech&m=161908805925325&w=2 > > I'm unsure which part of panic(9) is causing the problem he mentions, > but one obvious issue I see is that panicstr is not set atomically, > so two CPUs entering panic(9) simultaneously may clobber panicbuf. > > If we set panicstr atomically only one CPU will write panicbuf. > > Thoughts? > > Index: kern/subr_prf.c > === > RCS file: /cvs/src/sys/kern/subr_prf.c,v > retrieving revision 1.102 > diff -u -p -r1.102 subr_prf.c > --- kern/subr_prf.c 28 Nov 2020 17:53:05 - 1.102 > +++ kern/subr_prf.c 13 May 2021 00:04:28 - > @@ -97,7 +97,7 @@ struct mutex kprintf_mutex = > */ > > extern int log_open; /* subr_log: is /dev/klog open? */ > -constchar *panicstr; /* arg to first call to panic (used as a flag > +volatile const char *panicstr; /* arg to first call to panic (used as a flag > to indicate that panic has already been called). */ > constchar *faultstr; /* page fault string */ > #ifdef DDB > @@ -195,12 +195,10 @@ panic(const char *fmt, ...) > > bootopt = RB_AUTOBOOT | RB_DUMP; > va_start(ap, fmt); > - if (panicstr) > + if (atomic_cas_ptr(&panicstr, NULL, panicbuf) != NULL) > bootopt |= RB_NOSYNC; > - else { > + else > vsnprintf(panicbuf, sizeof panicbuf, fmt, ap); > - panicstr = panicbuf; > - } > va_end(ap); > > printf("panic: "); > Index: sys/systm.h > === > RCS file: /cvs/src/sys/sys/systm.h,v > retrieving revision 1.153 > diff -u -p -r1.153 systm.h > --- sys/systm.h 28 Apr 2021 09:42:04 - 1.153 > +++ sys/systm.h 13 May 2021 00:04:28 - > @@ -71,7 +71,7 @@ > * patched by a stalking hacker. > */ > extern int securelevel; /* system security level */ > -extern const char *panicstr; /* panic message */ > +extern volatile const char *panicstr;/* panic message */ > extern const char *faultstr; /* fault message */ > extern const char version[]; /* system version */ > extern const char copyright[]; /* system copyright */ >
panic(9): set panicstr atomically
Hi, In a separate mail thread, bluhm@ mentioned that panic(9) does not cleanly handle multiple CPUs entering it simultaneously: https://marc.info/?l=openbsd-tech&m=161908805925325&w=2 I'm unsure which part of panic(9) is causing the problem he mentions, but one obvious issue I see is that panicstr is not set atomically, so two CPUs entering panic(9) simultaneously may clobber panicbuf. If we set panicstr atomically only one CPU will write panicbuf. Thoughts? Index: kern/subr_prf.c === RCS file: /cvs/src/sys/kern/subr_prf.c,v retrieving revision 1.102 diff -u -p -r1.102 subr_prf.c --- kern/subr_prf.c 28 Nov 2020 17:53:05 - 1.102 +++ kern/subr_prf.c 13 May 2021 00:04:28 - @@ -97,7 +97,7 @@ struct mutex kprintf_mutex = */ extern int log_open; /* subr_log: is /dev/klog open? */ -const char *panicstr; /* arg to first call to panic (used as a flag +volatile const char *panicstr; /* arg to first call to panic (used as a flag to indicate that panic has already been called). */ const char *faultstr; /* page fault string */ #ifdef DDB @@ -195,12 +195,10 @@ panic(const char *fmt, ...) bootopt = RB_AUTOBOOT | RB_DUMP; va_start(ap, fmt); - if (panicstr) + if (atomic_cas_ptr(&panicstr, NULL, panicbuf) != NULL) bootopt |= RB_NOSYNC; - else { + else vsnprintf(panicbuf, sizeof panicbuf, fmt, ap); - panicstr = panicbuf; - } va_end(ap); printf("panic: "); Index: sys/systm.h === RCS file: /cvs/src/sys/sys/systm.h,v retrieving revision 1.153 diff -u -p -r1.153 systm.h --- sys/systm.h 28 Apr 2021 09:42:04 - 1.153 +++ sys/systm.h 13 May 2021 00:04:28 - @@ -71,7 +71,7 @@ * patched by a stalking hacker. */ extern int securelevel;/* system security level */ -extern const char *panicstr; /* panic message */ +extern volatile const char *panicstr; /* panic message */ extern const char *faultstr; /* fault message */ extern const char version[]; /* system version */ extern const char copyright[]; /* system copyright */
Re: Fix IPsec NAT-T for L2TP/IPsec
ok mvs@ > On 13 May 2021, at 02:43, YASUOKA Masahiko wrote: > > On Wed, 12 May 2021 19:15:29 +0300 > Vitaliy Makkoveev wrote: >>> On 12 May 2021, at 18:42, YASUOKA Masahiko wrote: >>> On Wed, 12 May 2021 17:26:51 +0300 >>> Vitaliy Makkoveev wrote: On Wed, May 12, 2021 at 07:11:09PM +0900, YASUOKA Masahiko wrote: > Radek reported a problem to misc@ that multiple Windows clients behind a > NAT > cannot use a L2TP/IPsec server simultaneously. > > https://marc.info/?t=16099681611&r=1&w=2 > > There is two problems. First is pipex(4) doesn't pass the proper > ipsecflowinfo to ip_output(). Second is the IPsec policy check which is > done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is not > cached. This happens when its flow is shared by another tdb (for another > client of the same NAT). > > The following 2 diffs fix these problem. > > comment? > ok? > Hi. I have two comments for the diff 1: 1. You should add PACKET_TAG_IPSEC_FLOWINFO description to m_tag_get(9). 2. You introduced mbuf(9) leak in pipex_l2tp_output() error path. I pointed the place in your diff. >>> >>> Good catch. Thanks. >>> >> >> m_freem(9) accepts NULL so this check before is redundant. > > Yes, > >> It seems to me that "Used by the IPv4 stack to specify the IPsec flow >> of an output IP packet. The tag contains a u_int32_t identifying the >> IPsec flow.” is enough. Anyway it’s better to ask jmc@. > > Ok, > >> Also I like to remove PACKET_TAG_PIPEX with separate diff. > > I removed PACKET_TAG_PIPEX separetely. > > Let me update the diff. > > Index: sys/net/pipex.c > === > RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v > retrieving revision 1.132 > diff -u -p -r1.132 pipex.c > --- sys/net/pipex.c 10 Mar 2021 10:21:48 - 1.132 > +++ sys/net/pipex.c 12 May 2021 23:18:52 - > @@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc > #ifdef INET6 > struct ip6_hdr *ip6; > #endif > + struct m_tag *mtag; > > hlen = sizeof(struct pipex_l2tp_header) + > ((pipex_session_is_l2tp_data_sequencing_on(session)) > @@ -1704,6 +1705,15 @@ pipex_l2tp_output(struct mbuf *m0, struc > ip->ip_tos = 0; > ip->ip_off = 0; > > + if (session->proto.l2tp.ipsecflowinfo > 0) { > + if ((mtag = m_tag_get(PACKET_TAG_IPSEC_FLOWINFO, > + sizeof(u_int32_t), M_NOWAIT)) == NULL) > + goto drop; > + *(u_int32_t *)(mtag + 1) = > + session->proto.l2tp.ipsecflowinfo; > + m_tag_prepend(m0, mtag); > + } > + > ip_send(m0); > break; > #ifdef INET6 > @@ -1733,6 +1743,7 @@ pipex_l2tp_output(struct mbuf *m0, struc > > return; > drop: > + m_freem(m0); > session->stat.oerrors++; > } > > Index: sys/netinet/ip_input.c > === > RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v > retrieving revision 1.359 > diff -u -p -r1.359 ip_input.c > --- sys/netinet/ip_input.c30 Apr 2021 13:52:48 - 1.359 > +++ sys/netinet/ip_input.c12 May 2021 23:18:52 - > @@ -1790,6 +1790,8 @@ ip_send_do_dispatch(void *xmq, int flags > struct mbuf_queue *mq = xmq; > struct mbuf *m; > struct mbuf_list ml; > + struct m_tag *mtag; > + u_int32_t ipsecflowinfo = 0; > > mq_delist(mq, &ml); > if (ml_empty(&ml)) > @@ -1797,7 +1799,12 @@ ip_send_do_dispatch(void *xmq, int flags > > NET_LOCK(); > while ((m = ml_dequeue(&ml)) != NULL) { > - ip_output(m, NULL, NULL, flags, NULL, NULL, 0); > + if ((mtag = m_tag_find(m, PACKET_TAG_IPSEC_FLOWINFO, NULL)) > + != NULL) { > + ipsecflowinfo = *(u_int32_t *)(mtag + 1); > + m_tag_delete(m, mtag); > + } > + ip_output(m, NULL, NULL, flags, NULL, NULL, ipsecflowinfo); > } > NET_UNLOCK(); > } > Index: sys/sys/mbuf.h > === > RCS file: /disk/cvs/openbsd/src/sys/sys/mbuf.h,v > retrieving revision 1.252 > diff -u -p -r1.252 mbuf.h > --- sys/sys/mbuf.h25 Feb 2021 02:43:31 - 1.252 > +++ sys/sys/mbuf.h12 May 2021 23:18:52 - > @@ -469,6 +469,7 @@ struct m_tag *m_tag_next(struct mbuf *, > /* Packet tag types */ > #define PACKET_TAG_IPSEC_IN_DONE 0x0001 /* IPsec applied, in */ > #define PACKET_TAG_IPSEC_OUT_DONE 0x0002 /* IPsec applied, out */ > +#define PACKET_TAG_IPSEC_FLOWINFO0x0004 /* IPsec flowinfo */ > #define PACKET_TAG_WIREGUARD 0x0040 /* WireGuard data */ > #define PACKET_TAG_GRE0x0080 /* GRE processing
Re: Fix IPsec NAT-T for L2TP/IPsec
On Wed, 12 May 2021 19:15:29 +0300 Vitaliy Makkoveev wrote: >> On 12 May 2021, at 18:42, YASUOKA Masahiko wrote: >> On Wed, 12 May 2021 17:26:51 +0300 >> Vitaliy Makkoveev wrote: >>> On Wed, May 12, 2021 at 07:11:09PM +0900, YASUOKA Masahiko wrote: Radek reported a problem to misc@ that multiple Windows clients behind a NAT cannot use a L2TP/IPsec server simultaneously. https://marc.info/?t=16099681611&r=1&w=2 There is two problems. First is pipex(4) doesn't pass the proper ipsecflowinfo to ip_output(). Second is the IPsec policy check which is done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is not cached. This happens when its flow is shared by another tdb (for another client of the same NAT). The following 2 diffs fix these problem. comment? ok? >>> >>> Hi. >>> >>> I have two comments for the diff 1: >>> >>> 1. You should add PACKET_TAG_IPSEC_FLOWINFO description to >>>m_tag_get(9). >>> 2. You introduced mbuf(9) leak in pipex_l2tp_output() error path. I >>> pointed the place in your diff. >> >> Good catch. Thanks. >> > > m_freem(9) accepts NULL so this check before is redundant. Yes, > It seems to me that "Used by the IPv4 stack to specify the IPsec flow > of an output IP packet. The tag contains a u_int32_t identifying the > IPsec flow.” is enough. Anyway it’s better to ask jmc@. Ok, > Also I like to remove PACKET_TAG_PIPEX with separate diff. I removed PACKET_TAG_PIPEX separetely. Let me update the diff. Index: sys/net/pipex.c === RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v retrieving revision 1.132 diff -u -p -r1.132 pipex.c --- sys/net/pipex.c 10 Mar 2021 10:21:48 - 1.132 +++ sys/net/pipex.c 12 May 2021 23:18:52 - @@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc #ifdef INET6 struct ip6_hdr *ip6; #endif + struct m_tag *mtag; hlen = sizeof(struct pipex_l2tp_header) + ((pipex_session_is_l2tp_data_sequencing_on(session)) @@ -1704,6 +1705,15 @@ pipex_l2tp_output(struct mbuf *m0, struc ip->ip_tos = 0; ip->ip_off = 0; + if (session->proto.l2tp.ipsecflowinfo > 0) { + if ((mtag = m_tag_get(PACKET_TAG_IPSEC_FLOWINFO, + sizeof(u_int32_t), M_NOWAIT)) == NULL) + goto drop; + *(u_int32_t *)(mtag + 1) = + session->proto.l2tp.ipsecflowinfo; + m_tag_prepend(m0, mtag); + } + ip_send(m0); break; #ifdef INET6 @@ -1733,6 +1743,7 @@ pipex_l2tp_output(struct mbuf *m0, struc return; drop: + m_freem(m0); session->stat.oerrors++; } Index: sys/netinet/ip_input.c === RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v retrieving revision 1.359 diff -u -p -r1.359 ip_input.c --- sys/netinet/ip_input.c 30 Apr 2021 13:52:48 - 1.359 +++ sys/netinet/ip_input.c 12 May 2021 23:18:52 - @@ -1790,6 +1790,8 @@ ip_send_do_dispatch(void *xmq, int flags struct mbuf_queue *mq = xmq; struct mbuf *m; struct mbuf_list ml; + struct m_tag *mtag; + u_int32_t ipsecflowinfo = 0; mq_delist(mq, &ml); if (ml_empty(&ml)) @@ -1797,7 +1799,12 @@ ip_send_do_dispatch(void *xmq, int flags NET_LOCK(); while ((m = ml_dequeue(&ml)) != NULL) { - ip_output(m, NULL, NULL, flags, NULL, NULL, 0); + if ((mtag = m_tag_find(m, PACKET_TAG_IPSEC_FLOWINFO, NULL)) + != NULL) { + ipsecflowinfo = *(u_int32_t *)(mtag + 1); + m_tag_delete(m, mtag); + } + ip_output(m, NULL, NULL, flags, NULL, NULL, ipsecflowinfo); } NET_UNLOCK(); } Index: sys/sys/mbuf.h === RCS file: /disk/cvs/openbsd/src/sys/sys/mbuf.h,v retrieving revision 1.252 diff -u -p -r1.252 mbuf.h --- sys/sys/mbuf.h 25 Feb 2021 02:43:31 - 1.252 +++ sys/sys/mbuf.h 12 May 2021 23:18:52 - @@ -469,6 +469,7 @@ struct m_tag *m_tag_next(struct mbuf *, /* Packet tag types */ #define PACKET_TAG_IPSEC_IN_DONE 0x0001 /* IPsec applied, in */ #define PACKET_TAG_IPSEC_OUT_DONE 0x0002 /* IPsec applied, out */ +#define PACKET_TAG_IPSEC_FLOWINFO 0x0004 /* IPsec flowinfo */ #define PACKET_TAG_WIREGUARD 0x0040 /* WireGuard data */ #define PACKET_TAG_GRE 0x0080 /* GRE processing done */ #define PACKET_TAG_DLT 0x0100 /* data link layer type */ @@ -479,7 +480,7 @@ struct m_tag *m_tag_next(struct mbuf *, #define PACKET_TAG_CARP_BAL_IP 0x4000 /*
Re: running network stack forwarding in parallel
It seems this lock order issue is not parallel diff specific. > On 12 May 2021, at 12:58, Hrvoje Popovski wrote: > > On 21.4.2021. 21:36, Alexander Bluhm wrote: >> We need more MP preassure to find such bugs and races. I think now >> is a good time to give this diff broader testing and commit it. >> You need interfaces with multiple queues to see a difference. > > Hi, > > while forwarding ip4 traffic over box with parallel diff and aggr > interfaces, and then aggr is destroyed, i'm getting witness log below... > > i can't reproduce this log without parallel diff and i'm getting this > log only first time destroying aggr interface .. > > > > witness: lock order reversal: > 1st 0x82139de0 netlock (netlock) > 2nd 0x82120bb8 timeout (timeout) > lock order "timeout"(rwlock) -> "netlock"(rwlock) first seen at: > #0 rw_enter_write+0x43 > #1 mld6_fasttimeo+0x14 > #2 pffasttimo+0x67 > #3 timeout_run+0x93 > #4 softclock_thread+0x11d > #5 proc_trampoline+0x1c > lock order "netlock"(rwlock) -> "timeout"(rwlock) first seen at: > #0 timeout_del_barrier+0x41 > #1 aggr_p_dtor+0x17b > #2 aggr_clone_destroy+0x91 > #3 if_clone_destroy+0xd8 > #4 ifioctl+0x1d2 > #5 soo_ioctl+0x171 > #6 sys_ioctl+0x2c4 > #7 syscall+0x3b9 > #8 Xsyscall+0x128 > > > r620-1# cat /etc/hostname.aggr0 > trunkport ix0 > lladdr ec:f4:bb:da:f7:f8 > inet 192.168.42.1 255.255.255.0 > !route add 16/8 192.168.42.11 > up > > r620-1# cat /etc/hostname.aggr1 > trunkport ix1 > inet 192.168.43.1 255.255.255.0 > !route add 48/8 192.168.43.11 > up > > > > OpenBSD 6.9-current (GENERIC.MP) #166: Wed May 12 10:18:11 CEST 2021 >hrv...@r620-1.srce.hr:/sys/arch/amd64/compile/GENERIC.MP > real mem = 17115840512 (16322MB) > avail mem = 16450007040 (15687MB) > random: good seed from bootblocks > mpath0 at root > scsibus0 at mpath0: 256 targets > mainbus0 at root > bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xcf42c000 (99 entries) > bios0: vendor Dell Inc. version "2.9.0" date 12/06/2019 > bios0: Dell Inc. PowerEdge R620 > acpi0 at bios0: ACPI 3.0 > acpi0: sleep states S0 S4 S5 > acpi0: tables DSDT FACP APIC SPCR HPET DMAR MCFG WDAT SLIC ERST HEST > BERT EINJ TCPA PC__ SRAT SSDT > acpi0: wakeup devices PCI0(S5) > acpitimer0 at acpi0: 3579545 Hz, 24 bits > acpimadt0 at acpi0 addr 0xfee0: PC-AT compat > cpu0 at mainbus0: apid 4 (boot processor) > cpu0: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.44 MHz, 06-3e-04 > cpu0: > FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN > cpu0: 256KB 64b/line 8-way L2 cache > cpu0: smt 0, core 2, package 0 > mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges > cpu0: apic clock running at 99MHz > cpu0: mwait min=64, max=64, C-substates=0.2.1.1, IBE > cpu1 at mainbus0: apid 6 (application processor) > cpu1: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.04 MHz, 06-3e-04 > cpu1: > FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN > cpu1: 256KB 64b/line 8-way L2 cache > cpu1: smt 0, core 3, package 0 > cpu2 at mainbus0: apid 8 (application processor) > cpu2: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04 > cpu2: > FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN > cpu2: 256KB 64b/line 8-way L2 cache > cpu2: smt 0, core 4, package 0 > cpu3 at mainbus0: apid 16 (application processor) > cpu3: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04 > cpu3: > FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN > cpu3: 256KB 64b/line 8-way L2 cache > cpu3: smt 0, core 8, package 0 > cpu4 at mainbus0: apid 18 (application processor) > cpu4: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04 > cpu4: > FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SE
Re: patch: new fix for vmctl create
James Cook writes: > Hi tech@, > > The below patch removes calls to realpath(3) when looking up a qcow2 > base image. Previous thread: > https://marc.info/?t=16156249642&r=1&w=2 > > In short, the calls were failing inside vmctl, because of unveil. The > other thread has alternative solutions but I think this is simplest. > > I included a regression test demonstrating the vmctl bug, in case > there's interest. I tested vmd manually as described in the other > thread. > > I also added a check in case dirname(3) fails --- I don't think it > currently can, but better safe than sorry, I figure. (Noticed by Dave > in the other thread.) > > - James > > mlarkin@ and I got around to reviewing this today. After some discussion, the decision was to simply remove the usage of unveil in the vmctl disk creation/conversion commands. I just committed that change, so it should be in cvs mirrors and snaps soon. Thanks again for raising the issue and working with me through ideas. In the end deletion looked like a better than modification :-) (We were unveiling user supplied inputs...and then unveiling more user supplied inputs in the form of the base image(s)...not the best usage of unveil.) -dv
Re: patch: new fix for vmctl create
On Mon, Mar 15, 2021 at 08:21:56AM +, James Cook wrote: > Hi tech@, > > The below patch removes calls to realpath(3) when looking up a qcow2 > base image. Previous thread: > https://marc.info/?t=16156249642&r=1&w=2 > > In short, the calls were failing inside vmctl, because of unveil. The > other thread has alternative solutions but I think this is simplest. > > I included a regression test demonstrating the vmctl bug, in case > there's interest. I tested vmd manually as described in the other > thread. > > I also added a check in case dirname(3) fails --- I don't think it > currently can, but better safe than sorry, I figure. (Noticed by Dave > in the other thread.) > > - James > After looking at this a bit, we decided to remove the unveil parts around the base images, since the realpath removal below would also affect vmd. dv@ just committed that. Thanks for the diff and research! > > diff --git a/regress/usr.sbin/Makefile b/regress/usr.sbin/Makefile > index 60e2178d3c7..146f9c9f322 100644 > --- a/regress/usr.sbin/Makefile > +++ b/regress/usr.sbin/Makefile > @@ -15,6 +15,7 @@ SUBDIR += rpki-client > SUBDIR += snmpd > SUBDIR += switchd > SUBDIR += syslogd > +SUBDIR += vmctl > > .if ${MACHINE} == "amd64" || ${MACHINE} == "i386" > SUBDIR += vmd > diff --git a/regress/usr.sbin/vmctl/Makefile b/regress/usr.sbin/vmctl/Makefile > new file mode 100644 > index 000..8fa87f0f6f0 > --- /dev/null > +++ b/regress/usr.sbin/vmctl/Makefile > @@ -0,0 +1,34 @@ > +# $OpenBSD$ > + > +REGRESS_TARGETS = run-regress-convert-with-base-path > + > +run-regress-convert-with-base-path: > + # non-relative base path > + rm -f *.qcow2 > + vmctl create -s 1m base.qcow2 > + vmctl create -b ${PWD}/base.qcow2 source.qcow2 > + vmctl create -i source.qcow2 dest.qcow2 > + > + # relative base path; two base images > + rm -f *.qcow2 > + vmctl create -s 1m base0.qcow2 > + vmctl create -b base0.qcow2 base1.qcow2 > + vmctl create -b base1.qcow2 source.qcow2 > + vmctl create -i source.qcow2 dest.qcow2 > + > + # copy from a different directory > + rm -rf dir *.qcow2 > + vmctl create -s 1m base.qcow2 > + vmctl create -b base.qcow2 source.qcow2 > + mkdir dir > + cd dir; vmctl create -i ../source.qcow2 dest.qcow2 > + > + # base accessed through symlink > + rm -rf dir sym *.qcow2 > + mkdir dir > + cd dir; vmctl create -s 1m base.qcow2 > + cd dir; vmctl create -b base.qcow2 source.qcow2 > + ln -s dir sym > + vmctl create -i sym/source.qcow2 dest.qcow2 > + > +.include > diff --git a/usr.sbin/vmd/vioqcow2.c b/usr.sbin/vmd/vioqcow2.c > index 34d0f116cc4..be8609f1644 100644 > --- a/usr.sbin/vmd/vioqcow2.c > +++ b/usr.sbin/vmd/vioqcow2.c > @@ -145,8 +145,8 @@ virtio_qcow2_init(struct virtio_backing *file, off_t > *szp, int *fd, size_t nfd) > ssize_t > virtio_qcow2_get_base(int fd, char *path, size_t npath, const char *dpath) > { > + char pathbuf[PATH_MAX]; > char dpathbuf[PATH_MAX]; > - char expanded[PATH_MAX]; > struct qcheader header; > uint64_t backingoff; > uint32_t backingsz; > @@ -180,27 +180,23 @@ virtio_qcow2_get_base(int fd, char *path, size_t npath, > const char *dpath) >* rather than relative to the directory vmd happens to be running in, >* since this is the only userful interpretation. >*/ > - if (path[0] == '/') { > - if (realpath(path, expanded) == NULL || > - strlcpy(path, expanded, npath) >= npath) { > - log_warnx("unable to resolve %s", path); > + if (path[0] != '/') { > + if (strlcpy(pathbuf, path, sizeof(pathbuf)) >= > + sizeof(pathbuf)) { > + log_warnx("path too long: %s", path); > return -1; > } > - } else { > if (strlcpy(dpathbuf, dpath, sizeof(dpathbuf)) >= > sizeof(dpathbuf)) { > log_warnx("path too long: %s", dpath); > return -1; > } > - s = dirname(dpathbuf); > - if (snprintf(expanded, sizeof(expanded), > - "%s/%s", s, path) >= (int)sizeof(expanded)) { > - log_warnx("path too long: %s/%s", s, path); > + if ((s = dirname(dpathbuf)) == NULL) { > + log_warn("dirname"); > return -1; > } > - if (npath < PATH_MAX || > - realpath(expanded, path) == NULL) { > - log_warnx("unable to resolve %s", path); > + if (snprintf(path, npath, "%s/%s", s, pathbuf) >= (int)npath) { > + log_warnx("path too long: %s/%s", s, path); > return -1; > } > } >
Re: setitimer(2): don't round up it_value
Hi Scott, Thanks for this work! I see significant improvement with my test code (see below; obviously focussing quite specificially on one thing). Before your diff (snapshot from a few days ago; OpenBSD 6.9-current (GENERIC.MP) #1: Mon May 3 11:04:25 MDT 2021) I got: [weerd@pom] $ ./measure Running for 100 loops timeout 80us min/avg/max/std-dev: 800054/809727.250/810115/1609.611 us timeout 40us min/avg/max/std-dev: 409668/41.406/410362/121.568 us timeout 20us min/avg/max/std-dev: 209421/209997.953/210553/97.505 us timeout 10us min/avg/max/std-dev: 109818/109997.078/110167/39.038 us timeout 5us min/avg/max/std-dev: 59887/59995.539/60108/27.651 us timeout 25000us min/avg/max/std-dev: 29648/29993.961/30316/50.629 us timeout 12500us min/avg/max/std-dev: 19910/19994.100/20098/24.265 us timeout6250us min/avg/max/std-dev: 19932/19994.020/20067/20.760 us timeout3125us min/avg/max/std-dev: 19806/19993.980/20221/33.121 us timeout1562us min/avg/max/std-dev: 19519/19993.971/20470/71.215 us timeout 781us min/avg/max/std-dev: 19778/19993.980/20214/35.454 us timeout 390us min/avg/max/std-dev: 19784/19994.029/20188/32.626 us timeout 195us min/avg/max/std-dev: 19891/19994.230/20096/23.237 us timeout 97us min/avg/max/std-dev: 19938/19994.170/20044/17.468 us timeout 48us min/avg/max/std-dev: 19876/19994.010/20115/26.334 us timeout 24us min/avg/max/std-dev: 19535/19994.199/20458/67.628 us timeout 12us min/avg/max/std-dev: 19941/19994.240/20079/18.478 us timeout 6us min/avg/max/std-dev: 19949/19994.150/20051/16.916 us timeout 3us min/avg/max/std-dev: 19880/19994.221/20109/23.377 us timeout 1us min/avg/max/std-dev: 19940/19994.289/20053/17.575 us [I should add that these numbers are from an otherwise mostly idle machine; they improved when my machine was busy building a new kernel] After upgrading my local snap, cvs-up'ing, patching in your diff and building + rebooting into a new kernel: [weerd@pom] $ ./measure Running for 100 loops timeout 80us min/avg/max/std-dev: 800041/808010.312/810021/2992.553 us timeout 40us min/avg/max/std-dev: 401639/402001.281/402373/-nan us timeout 20us min/avg/max/std-dev: 200959/200997.922/201041/-nan us timeout 10us min/avg/max/std-dev: 100456/100495.977/100533/37.185 us timeout 5us min/avg/max/std-dev: 50141/50244.809/50351/30.089 us timeout 25000us min/avg/max/std-dev: 28917/30144.590/31396/175.702 us timeout 12500us min/avg/max/std-dev: 20048/20094.420/20143/10.583 us timeout6250us min/avg/max/std-dev: 10016/10044.260/10084/9.683 us timeout3125us min/avg/max/std-dev: 9974/10044.060/10140/14.810 us timeout1562us min/avg/max/std-dev: 10016/10044.190/10078/6.827 us timeout 781us min/avg/max/std-dev: 9964/10044.140/10143/14.516 us timeout 390us min/avg/max/std-dev: 9578/10044.150/10507/66.234 us timeout 195us min/avg/max/std-dev: 9981/10044.170/10122/16.570 us timeout 97us min/avg/max/std-dev: 10010/10044.150/10117/10.721 us timeout 48us min/avg/max/std-dev: 10004/10044.140/10083/10.331 us timeout 24us min/avg/max/std-dev: 10016/10044.220/10078/6.487 us timeout 12us min/avg/max/std-dev: 10014/10044.210/10079/6.802 us timeout 6us min/avg/max/std-dev: 10013/10044.300/10085/9.871 us timeout 3us min/avg/max/std-dev: 10007/10044.030/10090/8.477 us timeout 1us min/avg/max/std-dev: 9980/10044.350/10135/13.301 us (obviously some bug in the std-dev calculation there with the -nan) Thanks again, Paul --- measure.c #include #include #include #include #include #include #include #define LOOPCOUNT 100 volatile sig_atomic_t trigger; void sighdlr(int signum) { trigger = signum; } int main() { int long long count, d, sum, min, max, tsumsq; longtimeout; struct timeval str, end; struct itimervalnew_timer; double avg, dev; new_timer.it_interval.tv_sec = 0; new_timer.it_interval.tv_usec = 0; new_timer.it_value.tv_sec = 0; signal(SIGALRM, sighdlr); printf("Running for %d loops\n\n", LOOPCOUNT); for (timeout = 80; timeout != 0; timeout /= 2) { new_timer.it_value.tv_usec = timeout; min = 10; max = 0; sum = 0; tsumsq = 0; for (count = 0; count != LOOPCOUNT; count++) { trigger = 0; setitimer(ITIMER_REAL, &new_timer, NULL); gettimeofday(&str, NULL); while (trigger == 0) pause(); gettimeofday(&end, NULL); d = (end.tv_sec - str.tv_sec) * 100 + \ end.tv_usec - str.tv_usec;
setitimer(2): don't round up it_value
Hi, Paul de Weerd mentioned off-list that the initial expiration for an ITIMER_REAL timer is always at least one tick. I looked into it and yes, this is the case, because the kernel rounds it_value up to one tick if it is non-zero. After thinking about it a bit I don't think we should do this rounding. At least, not for the initial expiration. Rounding the it_interval member of an itimerval value up to one tick makes sense: if we don't round it up we might spin for a long time in realitexpire() and itimerdecr() when we reload the timer. However there is no such risk with the it_value member. We only use it once and then it gets clobbered. Currently the rounding is done in itimerfix(), which takes a timeval pointer as argument. Given that itimerfix() is used nowhere else in the kernel I think the easiest thing to do here is to rewrite itimerfix() to take an itimerval pointer as argument and have it do all input validation and normalization for setitimer(2) in one go: - Validate it_value, return EINVAL if not. - Validate it_interval, return EINVAL if not. - Clear it_interval if it_value is unset. - Round it_interval up if necessary. The 100 million second upper bound for it_value and it_interval is arbitrary and will probably change in the future, so I have isolated that check from the others. While we're changing the itimerfix() prototype we may as well pull it out of sys/time.h. As I said before, it isn't used anywhere else. OK? Index: sys/time.h === RCS file: /cvs/src/sys/sys/time.h,v retrieving revision 1.58 diff -u -p -r1.58 time.h --- sys/time.h 13 Jan 2021 16:28:50 - 1.58 +++ sys/time.h 12 May 2021 17:06:30 - @@ -307,7 +307,6 @@ struct proc; intclock_gettime(struct proc *, clockid_t, struct timespec *); void cancel_all_itimers(void); -intitimerfix(struct timeval *); intitimerdecr(struct itimerspec *, long); intsettime(const struct timespec *); intratecheck(struct timeval *, const struct timeval *); Index: kern/kern_time.c === RCS file: /cvs/src/sys/kern/kern_time.c,v retrieving revision 1.151 diff -u -p -r1.151 kern_time.c --- kern/kern_time.c23 Dec 2020 20:45:02 - 1.151 +++ kern/kern_time.c12 May 2021 17:06:30 - @@ -52,6 +52,8 @@ #include +int itimerfix(struct itimerval *); + /* * Time of day and interval timer support. * @@ -628,10 +630,9 @@ sys_setitimer(struct proc *p, void *v, r error = copyin(SCARG(uap, itv), &aitv, sizeof(aitv)); if (error) return error; - if (itimerfix(&aitv.it_value) || itimerfix(&aitv.it_interval)) - return EINVAL; - if (!timerisset(&aitv.it_value)) - timerclear(&aitv.it_interval); + error = itimerfix(&aitv); + if (error) + return error; newitvp = &aitv; } if (SCARG(uap, oitv) != NULL) { @@ -701,21 +702,34 @@ out: } /* - * Check that a proposed value to load into the .it_value or - * .it_interval part of an interval timer is acceptable. + * Check if the given setitimer(2) timer is valid. Clear it_interval + * if it_value is unset. Round it_interval up to the minimum interval + * if necessary. */ int -itimerfix(struct timeval *tv) +itimerfix(struct itimerval *itv) { + struct timeval min_interval = { .tv_sec = 0, .tv_usec = tick }; - if (tv->tv_sec < 0 || tv->tv_sec > 1 || - tv->tv_usec < 0 || tv->tv_usec >= 100) - return (EINVAL); + if (itv->it_value.tv_sec < 0 || !timerisvalid(&itv->it_value)) + return EINVAL; + if (itv->it_value.tv_sec > 1) + return EINVAL; - if (tv->tv_sec == 0 && tv->tv_usec != 0 && tv->tv_usec < tick) - tv->tv_usec = tick; + if (itv->it_interval.tv_sec < 0 || !timerisvalid(&itv->it_interval)) + return EINVAL; + if (itv->it_interval.tv_sec > 1) + return EINVAL; - return (0); + if (!timerisset(&itv->it_value)) + timerclear(&itv->it_interval); + + if (timerisset(&itv->it_interval)) { + if (timercmp(&itv->it_interval, &min_interval, <)) + itv->it_interval = min_interval; + } + + return 0; } /*
bgpd strict community negotiation
RFC5492 is fairly explicit when a capability should be enabled on a session: A BGP speaker that supports a particular capability may use this capability with its peer after the speaker determines (as described above) that the peer supports this capability. Simply put, a given capability can be used on a peering if that capability has been advertised by both peers. If either peer has not advertised it, the capability cannot be used. Adjust capa_neg_calc() to follow this strict model. This affects route refersh and graceful restart. For graceful restart this requires to flush the RIBs immediatly. Also ignore and warn about RREFRESH messages that are received on a session where route refesh is disabled. -- :wq Claudio Index: session.c === RCS file: /cvs/src/usr.sbin/bgpd/session.c,v retrieving revision 1.414 diff -u -p -r1.414 session.c --- session.c 6 May 2021 09:18:54 - 1.414 +++ session.c 12 May 2021 16:33:42 - @@ -1636,7 +1636,7 @@ session_neighbor_rrefresh(struct peer *p { u_int8_ti; - if (!p->capa.peer.refresh) + if (!p->capa.neg.refresh) return (-1); for (i = 0; i < AID_MAX; i++) { @@ -2257,6 +2257,11 @@ parse_refresh(struct peer *peer) return (0); } + if (!peer->capa.neg.refresh) { + log_peer_warnx(&peer->conf, "peer sent unexpected refresh"); + return (0); + } + if (imsg_rde(IMSG_REFRESH, peer->conf.id, &aid, sizeof(aid)) == -1) return (-1); @@ -2546,16 +2551,15 @@ capa_neg_calc(struct peer *p) { u_int8_ti, hasmp = 0; - /* refresh: does not realy matter here, use peer setting */ - p->capa.neg.refresh = p->capa.peer.refresh; + /* a capability is accepted only if both sides announced it */ - /* as4byte: both side must announce capability */ - if (p->capa.ann.as4byte && p->capa.peer.as4byte) - p->capa.neg.as4byte = 1; - else - p->capa.neg.as4byte = 0; + p->capa.neg.refresh = + (p->capa.ann.refresh && p->capa.peer.refresh) != 0; + + p->capa.neg.as4byte = + (p->capa.ann.as4byte && p->capa.peer.as4byte) != 0; - /* MP: both side must announce capability */ + /* MP: both side must agree on the AFI,SAFI pair */ for (i = 0; i < AID_MAX; i++) { if (p->capa.ann.mp[i] && p->capa.peer.mp[i]) p->capa.neg.mp[i] = 1; @@ -2569,18 +2573,21 @@ capa_neg_calc(struct peer *p) p->capa.neg.mp[AID_INET] = 1; /* -* graceful restart: only the peer capabilities are of interest here. +* graceful restart: the peer capabilities are of interest here. * It is necessary to compare the new values with the previous ones * and act acordingly. AFI/SAFI that are not part in the MP capability * are treated as not being present. +* Also make sure that a flush happens if the session stopped +* supporting graceful restart. */ for (i = 0; i < AID_MAX; i++) { int8_t negflags; /* disable GR if the AFI/SAFI is not present */ - if (p->capa.peer.grestart.flags[i] & CAPA_GR_PRESENT && - p->capa.neg.mp[i] == 0) + if (p->capa.ann.grestart.restart == 0 || + (p->capa.peer.grestart.flags[i] & CAPA_GR_PRESENT && + p->capa.neg.mp[i] == 0)) p->capa.peer.grestart.flags[i] = 0; /* disable */ /* look at current GR state and decide what to do */ negflags = p->capa.neg.grestart.flags[i]; @@ -2600,6 +2607,8 @@ capa_neg_calc(struct peer *p) } p->capa.neg.grestart.timeout = p->capa.peer.grestart.timeout; p->capa.neg.grestart.restart = p->capa.peer.grestart.restart; + if (p->capa.ann.grestart.restart == 0) + p->capa.neg.grestart.restart = 0; return (0); }
Re: Fix IPsec NAT-T for L2TP/IPsec
> On 12 May 2021, at 18:42, YASUOKA Masahiko wrote: > > On Wed, 12 May 2021 17:26:51 +0300 > Vitaliy Makkoveev wrote: >> On Wed, May 12, 2021 at 07:11:09PM +0900, YASUOKA Masahiko wrote: >>> Radek reported a problem to misc@ that multiple Windows clients behind a NAT >>> cannot use a L2TP/IPsec server simultaneously. >>> >>> https://marc.info/?t=16099681611&r=1&w=2 >>> >>> There is two problems. First is pipex(4) doesn't pass the proper >>> ipsecflowinfo to ip_output(). Second is the IPsec policy check which is >>> done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is not >>> cached. This happens when its flow is shared by another tdb (for another >>> client of the same NAT). >>> >>> The following 2 diffs fix these problem. >>> >>> comment? >>> ok? >>> >> >> Hi. >> >> I have two comments for the diff 1: >> >> 1. You should add PACKET_TAG_IPSEC_FLOWINFO description to >>m_tag_get(9). >> 2. You introduced mbuf(9) leak in pipex_l2tp_output() error path. I >> pointed the place in your diff. > > Good catch. Thanks. > m_freem(9) accepts NULL so this check before is redundant. It seems to me that "Used by the IPv4 stack to specify the IPsec flow of an output IP packet. The tag contains a u_int32_t identifying the IPsec flow.” is enough. Anyway it’s better to ask jmc@. Also I like to remove PACKET_TAG_PIPEX with separate diff. The rest of this diff looks ok by me. > > Let me update the diff. > > Index: sys/net/pipex.c > === > RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v > retrieving revision 1.132 > diff -u -p -r1.132 pipex.c > --- sys/net/pipex.c 10 Mar 2021 10:21:48 - 1.132 > +++ sys/net/pipex.c 12 May 2021 15:33:33 - > @@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc > #ifdef INET6 > struct ip6_hdr *ip6; > #endif > + struct m_tag *mtag; > > hlen = sizeof(struct pipex_l2tp_header) + > ((pipex_session_is_l2tp_data_sequencing_on(session)) > @@ -1704,6 +1705,15 @@ pipex_l2tp_output(struct mbuf *m0, struc > ip->ip_tos = 0; > ip->ip_off = 0; > > + if (session->proto.l2tp.ipsecflowinfo > 0) { > + if ((mtag = m_tag_get(PACKET_TAG_IPSEC_FLOWINFO, > + sizeof(u_int32_t), M_NOWAIT)) == NULL) > + goto drop; > + *(u_int32_t *)(mtag + 1) = > + session->proto.l2tp.ipsecflowinfo; > + m_tag_prepend(m0, mtag); > + } > + > ip_send(m0); > break; > #ifdef INET6 > @@ -1733,6 +1743,8 @@ pipex_l2tp_output(struct mbuf *m0, struc > > return; > drop: > + if (m0 != NULL) > + m_freem(m0); > session->stat.oerrors++; > } > > Index: sys/netinet/ip_input.c > === > RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v > retrieving revision 1.359 > diff -u -p -r1.359 ip_input.c > --- sys/netinet/ip_input.c30 Apr 2021 13:52:48 - 1.359 > +++ sys/netinet/ip_input.c12 May 2021 15:31:52 - > @@ -1790,6 +1790,8 @@ ip_send_do_dispatch(void *xmq, int flags > struct mbuf_queue *mq = xmq; > struct mbuf *m; > struct mbuf_list ml; > + struct m_tag *mtag; > + u_int32_t ipsecflowinfo = 0; > > mq_delist(mq, &ml); > if (ml_empty(&ml)) > @@ -1797,7 +1799,12 @@ ip_send_do_dispatch(void *xmq, int flags > > NET_LOCK(); > while ((m = ml_dequeue(&ml)) != NULL) { > - ip_output(m, NULL, NULL, flags, NULL, NULL, 0); > + if ((mtag = m_tag_find(m, PACKET_TAG_IPSEC_FLOWINFO, NULL)) > + != NULL) { > + ipsecflowinfo = *(u_int32_t *)(mtag + 1); > + m_tag_delete(m, mtag); > + } > + ip_output(m, NULL, NULL, flags, NULL, NULL, ipsecflowinfo); > } > NET_UNLOCK(); > } > Index: sys/sys/mbuf.h > === > RCS file: /disk/cvs/openbsd/src/sys/sys/mbuf.h,v > retrieving revision 1.252 > diff -u -p -r1.252 mbuf.h > --- sys/sys/mbuf.h25 Feb 2021 02:43:31 - 1.252 > +++ sys/sys/mbuf.h12 May 2021 15:31:52 - > @@ -469,6 +469,7 @@ struct m_tag *m_tag_next(struct mbuf *, > /* Packet tag types */ > #define PACKET_TAG_IPSEC_IN_DONE 0x0001 /* IPsec applied, in */ > #define PACKET_TAG_IPSEC_OUT_DONE 0x0002 /* IPsec applied, out */ > +#define PACKET_TAG_IPSEC_FLOWINFO0x0004 /* IPsec flowinfo */ > #define PACKET_TAG_WIREGUARD 0x0040 /* WireGuard data */ > #define PACKET_TAG_GRE0x0080 /* GRE processing done > */ > #define PACKET_TAG_DLT0x0100 /* data link layer type > */ > @@ -479,7 +480,7 @@ struct m_tag *m_tag_next(struct mbuf *, > #define PACKET_TAG_CARP_BAL_IP0
Re: Fix IPsec NAT-T for L2TP/IPsec
On Wed, 12 May 2021 17:26:51 +0300 Vitaliy Makkoveev wrote: > On Wed, May 12, 2021 at 07:11:09PM +0900, YASUOKA Masahiko wrote: >> Radek reported a problem to misc@ that multiple Windows clients behind a NAT >> cannot use a L2TP/IPsec server simultaneously. >> >> https://marc.info/?t=16099681611&r=1&w=2 >> >> There is two problems. First is pipex(4) doesn't pass the proper >> ipsecflowinfo to ip_output(). Second is the IPsec policy check which is >> done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is not >> cached. This happens when its flow is shared by another tdb (for another >> client of the same NAT). >> >> The following 2 diffs fix these problem. >> >> comment? >> ok? >> > > Hi. > > I have two comments for the diff 1: > > 1. You should add PACKET_TAG_IPSEC_FLOWINFO description to > m_tag_get(9). > 2. You introduced mbuf(9) leak in pipex_l2tp_output() error path. I >pointed the place in your diff. Good catch. Thanks. Let me update the diff. Index: sys/net/pipex.c === RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v retrieving revision 1.132 diff -u -p -r1.132 pipex.c --- sys/net/pipex.c 10 Mar 2021 10:21:48 - 1.132 +++ sys/net/pipex.c 12 May 2021 15:33:33 - @@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc #ifdef INET6 struct ip6_hdr *ip6; #endif + struct m_tag *mtag; hlen = sizeof(struct pipex_l2tp_header) + ((pipex_session_is_l2tp_data_sequencing_on(session)) @@ -1704,6 +1705,15 @@ pipex_l2tp_output(struct mbuf *m0, struc ip->ip_tos = 0; ip->ip_off = 0; + if (session->proto.l2tp.ipsecflowinfo > 0) { + if ((mtag = m_tag_get(PACKET_TAG_IPSEC_FLOWINFO, + sizeof(u_int32_t), M_NOWAIT)) == NULL) + goto drop; + *(u_int32_t *)(mtag + 1) = + session->proto.l2tp.ipsecflowinfo; + m_tag_prepend(m0, mtag); + } + ip_send(m0); break; #ifdef INET6 @@ -1733,6 +1743,8 @@ pipex_l2tp_output(struct mbuf *m0, struc return; drop: + if (m0 != NULL) + m_freem(m0); session->stat.oerrors++; } Index: sys/netinet/ip_input.c === RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v retrieving revision 1.359 diff -u -p -r1.359 ip_input.c --- sys/netinet/ip_input.c 30 Apr 2021 13:52:48 - 1.359 +++ sys/netinet/ip_input.c 12 May 2021 15:31:52 - @@ -1790,6 +1790,8 @@ ip_send_do_dispatch(void *xmq, int flags struct mbuf_queue *mq = xmq; struct mbuf *m; struct mbuf_list ml; + struct m_tag *mtag; + u_int32_t ipsecflowinfo = 0; mq_delist(mq, &ml); if (ml_empty(&ml)) @@ -1797,7 +1799,12 @@ ip_send_do_dispatch(void *xmq, int flags NET_LOCK(); while ((m = ml_dequeue(&ml)) != NULL) { - ip_output(m, NULL, NULL, flags, NULL, NULL, 0); + if ((mtag = m_tag_find(m, PACKET_TAG_IPSEC_FLOWINFO, NULL)) + != NULL) { + ipsecflowinfo = *(u_int32_t *)(mtag + 1); + m_tag_delete(m, mtag); + } + ip_output(m, NULL, NULL, flags, NULL, NULL, ipsecflowinfo); } NET_UNLOCK(); } Index: sys/sys/mbuf.h === RCS file: /disk/cvs/openbsd/src/sys/sys/mbuf.h,v retrieving revision 1.252 diff -u -p -r1.252 mbuf.h --- sys/sys/mbuf.h 25 Feb 2021 02:43:31 - 1.252 +++ sys/sys/mbuf.h 12 May 2021 15:31:52 - @@ -469,6 +469,7 @@ struct m_tag *m_tag_next(struct mbuf *, /* Packet tag types */ #define PACKET_TAG_IPSEC_IN_DONE 0x0001 /* IPsec applied, in */ #define PACKET_TAG_IPSEC_OUT_DONE 0x0002 /* IPsec applied, out */ +#define PACKET_TAG_IPSEC_FLOWINFO 0x0004 /* IPsec flowinfo */ #define PACKET_TAG_WIREGUARD 0x0040 /* WireGuard data */ #define PACKET_TAG_GRE 0x0080 /* GRE processing done */ #define PACKET_TAG_DLT 0x0100 /* data link layer type */ @@ -479,7 +480,7 @@ struct m_tag *m_tag_next(struct mbuf *, #define PACKET_TAG_CARP_BAL_IP 0x4000 /* carp(4) ip balanced marker */ #define MTAG_BITS \ -("\20\1IPSEC_IN_DONE\2IPSEC_OUT_DONE\3IPSEC_IN_CRYPTO_DONE" \ +("\20\1IPSEC_IN_DONE\2IPSEC_OUT_DONE\3IPSEC_FLOWINFO" \ "\4IPSEC_OUT_CRYPTO_NEEDED\5IPSEC_PENDING_TDB\6BRIDGE\7WG\10GRE\11DLT" \ "\12PF_DIVERT\14PF_REASSEMBLED\15SRCROUTE\16TUNNEL\17CARP_BAL_IP") Index: share/man/man9/mbuf_tags.9 === RCS file: /disk/cvs/openbsd/src/share/man/man9/mbuf_tags.9,v retrieving revision 1.41 diff -u -p -r
Re: Fix IPsec NAT-T for L2TP/IPsec
On Wed, May 12, 2021 at 07:11:09PM +0900, YASUOKA Masahiko wrote: > Hi, > > Radek reported a problem to misc@ that multiple Windows clients behind a NAT > cannot use a L2TP/IPsec server simultaneously. > > https://marc.info/?t=16099681611&r=1&w=2 > > There is two problems. First is pipex(4) doesn't pass the proper > ipsecflowinfo to ip_output(). Second is the IPsec policy check which is > done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is not > cached. This happens when its flow is shared by another tdb (for another > client of the same NAT). > > The following 2 diffs fix these problem. > > comment? > ok? > Hi. I have two comments for the diff 1: 1. You should add PACKET_TAG_IPSEC_FLOWINFO description to m_tag_get(9). 2. You introduced mbuf(9) leak in pipex_l2tp_output() error path. I pointed the place in your diff. I'll see diff 2 later. > diff #1 > > Fix IPsec NAT-T work with pipex. > > Index: sys/net/pipex.c > === > RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v > retrieving revision 1.132 > diff -u -p -r1.132 pipex.c > --- sys/net/pipex.c 10 Mar 2021 10:21:48 - 1.132 > +++ sys/net/pipex.c 12 May 2021 09:38:32 - > @@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc > #ifdef INET6 > struct ip6_hdr *ip6; > #endif > + struct m_tag *mtag; > > hlen = sizeof(struct pipex_l2tp_header) + > ((pipex_session_is_l2tp_data_sequencing_on(session)) > @@ -1703,6 +1704,15 @@ pipex_l2tp_output(struct mbuf *m0, struc > ip->ip_ttl = MAXTTL; > ip->ip_tos = 0; > ip->ip_off = 0; > + > + if (session->proto.l2tp.ipsecflowinfo > 0) { > + if ((mtag = m_tag_get(PACKET_TAG_IPSEC_FLOWINFO, > + sizeof(u_int32_t), M_NOWAIT)) == NULL) > + goto drop; mbuf(9) will leak here. > + *(u_int32_t *)(mtag + 1) = > + session->proto.l2tp.ipsecflowinfo; > + m_tag_prepend(m0, mtag); > + } > > ip_send(m0); > break;
Re: [Diff] Implement multiple device cloning for hotplug
On Wed, May 12, 2021 at 10:03:13AM -0400, Ashton Fagg wrote: > joshua stein writes: > > > I'm glad I could inspire you to repost the work I already did years > > ago. > > I'm not sure if you're being sarcastic. > > > But either way, if a driver is causing a panic because it responds > > before it is ready, the same thing would happen without hotplug if > > it was communicated with at the wrong time (like some script running > > in a loop). Those drivers should just not respond to ioctls or > > whatever they are doing before they are ready rather than just > > hoping that no one sees their attachment too early. > > So the issue I have with iscsid is actually different to this because > it's entirely in userland. > > To fix it in a robust way, you need to know when a device has attached > and if it's ready or not, but there's extra requirements for iscsid that > would require you to match those /dev/sd* devices back to the configured > targets. > > At a minimum, we would need to figure out (after device attachment): > > a) which scsibus it has shown up on > > b) if the scsibus it has appeared on is indeed the one you'd expect it > to be on (/dev/vscsi0 will have the same scsibus as devices initiated > from iscsid, as far as I can tell) > > c) does the target/lun info match an initiated target with a > successful session established. > > d) then indicate whether the disk device is actually ready > > There's no panic in this case, the "iscsictl reload" call in the init > script returns before the devices are up and ready since we're only > looking at connection completion right now. Only returning once we know > there's a ready device corresponding to every connected target we expect > to have one avoids the problem where fsck tries to do it's thing on > devices that aren't ready. > > Matching back to the iscsi targets avoids some brittleness and edge > cases that could be encountered if we were just looking for generic scsi > disk attachments/readiness. > > c) is fairly easy with the diff that I submitted (and was merged) a > while ago, since there is book-keeping code inside iscsid to work > this out that would be very easy to extend. > > So the hotplug idea would solve part of this, but not all of it. There's > still parts of this that are not easy and aren't clear to me how to > fully resolve. For iscsid I think the better approach would be to extend vscsi(4) to allow the system to know when a device was attached. Now this does not solve all problems with iscsid since iscsid has no knowledge about which disks are presented by the initator and I don't think iscsid should know about this either. -- :wq Claudio
Re: [Diff] Implement multiple device cloning for hotplug
joshua stein writes: > I'm glad I could inspire you to repost the work I already did years > ago. I'm not sure if you're being sarcastic. > But either way, if a driver is causing a panic because it responds > before it is ready, the same thing would happen without hotplug if > it was communicated with at the wrong time (like some script running > in a loop). Those drivers should just not respond to ioctls or > whatever they are doing before they are ready rather than just > hoping that no one sees their attachment too early. So the issue I have with iscsid is actually different to this because it's entirely in userland. To fix it in a robust way, you need to know when a device has attached and if it's ready or not, but there's extra requirements for iscsid that would require you to match those /dev/sd* devices back to the configured targets. At a minimum, we would need to figure out (after device attachment): a) which scsibus it has shown up on b) if the scsibus it has appeared on is indeed the one you'd expect it to be on (/dev/vscsi0 will have the same scsibus as devices initiated from iscsid, as far as I can tell) c) does the target/lun info match an initiated target with a successful session established. d) then indicate whether the disk device is actually ready There's no panic in this case, the "iscsictl reload" call in the init script returns before the devices are up and ready since we're only looking at connection completion right now. Only returning once we know there's a ready device corresponding to every connected target we expect to have one avoids the problem where fsck tries to do it's thing on devices that aren't ready. Matching back to the iscsi targets avoids some brittleness and edge cases that could be encountered if we were just looking for generic scsi disk attachments/readiness. c) is fairly easy with the diff that I submitted (and was merged) a while ago, since there is book-keeping code inside iscsid to work this out that would be very easy to extend. So the hotplug idea would solve part of this, but not all of it. There's still parts of this that are not easy and aren't clear to me how to fully resolve.
hvn(4): don't input mbufs if interface is not running
Hi, when hvn(4) attaches it sends commands and waits for replies to come back in, hence the interrupt function is being polled. Unfortunately it seems that the 'receive pipe' has both command completion and data packets. As it turns out, while hvn(4) is just setting up the pipes, it can already receive packets, which I have seen happening on Hyper-V. This essentially means that if_input() is being called *before* the card is set up (or UP). This seems wrong. Apparently on drivers like em(4) we only read packets if IFF_RUNNING is set. I think in the case of hvn(4), we should drop packets unless IFF_RUNNING is set. Opinions? Patrick diff --git a/sys/dev/pv/if_hvn.c b/sys/dev/pv/if_hvn.c index f12e2f935ca..4306f717baf 100644 --- a/sys/dev/pv/if_hvn.c +++ b/sys/dev/pv/if_hvn.c @@ -1470,7 +1470,10 @@ hvn_rndis_input(struct hvn_softc *sc, uint64_t tid, void *arg) } hvn_nvs_ack(sc, tid); - if_input(ifp, &ml); + if (ifp->if_flags & IFF_RUNNING) + if_input(ifp, &ml); + else + ml_purge(&ml); } static inline struct mbuf *
Re: [Diff] Implement multiple device cloning for hotplug
On Tue, 11 May 2021 at 22:37:55 -0400, Ashton Fagg wrote: > Inspiration for this diff was drawn from Joshua Stein [1], seeminly he > had a use-case that was somewhat similar to mine at one point but it > doesn't look like this was ever committed. Nor can I find any discussion > on it. I'm glad I could inspire you to repost the work I already did years ago. On Tue, 11 May 2021 at 20:53:17 -0600, Theo de Raadt wrote: > Unfortunately there is no instrumentation in drivers or subsystems to > capture "ready to do work", and create the events at that point instead. > hotplug is hooked into subr_autoconf.c, this code is for handling the > device heirarchy, but it is not involved in operation, and thus unaware > of "readiness". Fixing this would require a subsystem by subsystem study, > figuring out where the glue should exist, and creating the events at > the CORRECT time. When this same objection was raised in 2018 when I proposed this, I said that I looked through the bugs archives and could only find references to hotplug-related issues 8 years prior, and they were considered solved: https://marc.info/?l=openbsd-bugs&m=127990199813627&w=2 I also proposed this diff, which individual drivers could use to defer hotplug notification from their attach routine if they needed to do extra work before becoming "ready". No one spoke up of any drivers needing such functionality, and all of this work was ignored since you and Mark wanted the Xorg input device notifications to come through wscons instead of hotplug. But either way, if a driver is causing a panic because it responds before it is ready, the same thing would happen without hotplug if it was communicated with at the wrong time (like some script running in a loop). Those drivers should just not respond to ioctls or whatever they are doing before they are ready rather than just hoping that no one sees their attachment too early. diff --git sys/dev/hotplug.c sys/dev/hotplug.c index e1e7bad95c9..434c43f4135 100644 --- sys/dev/hotplug.c +++ sys/dev/hotplug.c @@ -73,6 +73,12 @@ hotplug_device_attach(enum devclass class, char *name) hotplug_put_event(&he); } +void +hotplug_defer_attach(struct device *dev) +{ + dev->dv_flags |= DVF_HOTPLUG_DEFER; +} + void hotplug_device_detach(enum devclass class, char *name) { diff --git sys/kern/subr_autoconf.c sys/kern/subr_autoconf.c index 13fafc6994e..e09f11486c0 100644 --- sys/kern/subr_autoconf.c +++ sys/kern/subr_autoconf.c @@ -403,7 +403,7 @@ config_attach(struct device *parent, void *match, void *aux, cfprint_t print) (*ca->ca_attach)(parent, dev, aux); config_process_deferred_children(dev); #if NHOTPLUG > 0 - if (!cold) + if (!cold && !(dev->dv_flags & DVF_HOTPLUG_DEFER)) hotplug_device_attach(cd->cd_class, dev->dv_xname); #endif diff --git sys/sys/device.h sys/sys/device.h index 90827f53503..b73b248d040 100644 --- sys/sys/device.h +++ sys/sys/device.h @@ -81,7 +81,8 @@ struct device { }; /* dv_flags */ -#defineDVF_ACTIVE 0x0001 /* device is activated */ +#defineDVF_ACTIVE 0x0001 /* device is activated */ +#defineDVF_HOTPLUG_DEFER 0x0002 /* defer hotplug announce */ TAILQ_HEAD(devicelist, device); diff --git sys/sys/hotplug.h sys/sys/hotplug.h index f2da3f6cfd8..7e68482a428 100644 --- sys/sys/hotplug.h +++ sys/sys/hotplug.h @@ -35,6 +35,7 @@ struct hotplug_event { #ifdef _KERNEL void hotplug_device_attach(enum devclass, char *); void hotplug_device_detach(enum devclass, char *); +void hotplug_defer_attach(struct device *); #endif #endif /* _SYS_HOTPLUG_H_ */
Re: nvme: add timeout to nvme_poll() loop
Ping? Ashton Fagg writes: > I noticed this when looking through the nvme.c code the other day. > > Currently, nvme_poll() has a loop like this: > > while (!ISSET(state.c.flags, htole16(NVME_CQE_PHASE))) { > if (nvme_q_complete(sc, q) == 0) > delay(10); > > /* XXX no timeout? */ > } > > The comment is indicative - there probably should be a timeout here in > case things are terribly wrong and the queue never gets fully processed. > > NetBSD appears to have a similar construct in their version of the > driver: > > https://github.com/NetBSD/src/blob/trunk/sys/dev/ic/nvme.c > > The following diff adds a counter to fail the poll after 10 seconds of > spinning. I've been torturing my test machine (which has an nvme boot > drive) with this applied - no issues so far. > > AFAICT, returning anything non-zero should be ok as it looks like > nothing specifically checks the return code for anything other being > zero. But, I may be wrong and perhaps one of the various NVME_CQE_* > masks is more suitable. > > Thanks. diff --git a/sys/dev/ic/nvme.c b/sys/dev/ic/nvme.c index 9a79c8b1bfe..4e112f482ee 100644 --- a/sys/dev/ic/nvme.c +++ b/sys/dev/ic/nvme.c @@ -117,6 +117,11 @@ void nvme_scsi_inquiry(struct scsi_xfer *); void nvme_scsi_capacity16(struct scsi_xfer *); void nvme_scsi_capacity(struct scsi_xfer *); +/* Here, we define a delay of 10 microseconds between poll attempts. + * And, we allow 1e6 attempts. This corresponds to a ~10 second wait. */ +#define NVME_POLL_DELAY_USECS 10 +#define NVME_POLL_MAX_ATTEMPTS 1E6 + #define nvme_read4(_s, _r) \ bus_space_read_4((_s)->sc_iot, (_s)->sc_ioh, (_r)) #define nvme_write4(_s, _r, _v) \ @@ -918,6 +923,7 @@ nvme_poll(struct nvme_softc *sc, struct nvme_queue *q, struct nvme_ccb *ccb, void (*done)(struct nvme_softc *, struct nvme_ccb *, struct nvme_cqe *); void *cookie; u_int16_t flags; + int tries = 0; memset(&state, 0, sizeof(state)); (*fill)(sc, ccb, &state.s); @@ -931,9 +937,13 @@ nvme_poll(struct nvme_softc *sc, struct nvme_queue *q, struct nvme_ccb *ccb, nvme_q_submit(sc, q, ccb, nvme_poll_fill); while (!ISSET(state.c.flags, htole16(NVME_CQE_PHASE))) { if (nvme_q_complete(sc, q) == 0) - delay(10); + delay(NVME_POLL_DELAY_USECS); - /* XXX no timeout? */ + if (++tries >= NVME_POLL_MAX_ATTEMPTS) { + /* Something appears to be quite wrong... */ + printf("%s: poll timeout.\n", DEVNAME(sc)); + return (1); + } } ccb->ccb_cookie = cookie;
Re: [Diff] Implement multiple device cloning for hotplug
"Theo de Raadt" writes: > We don't need a new subsystem. > > We just the operation changed, so it sends the event when the device is ready, > rather than before it is ready. Ok fair enough. More curiosity than anything on my part. I need to find a way to make this information available for iscsid irrespective of the broader issue. I think it might be not "as simple" as just waiting for an attach and subsequent ready event to show up for the iscsi use case. I have some ideas, I'll come back with a diff rather than waste everyone's time on this. But happy to discuss privately if anyone is interested.
Fix IPsec NAT-T for L2TP/IPsec
Hi, Radek reported a problem to misc@ that multiple Windows clients behind a NAT cannot use a L2TP/IPsec server simultaneously. https://marc.info/?t=16099681611&r=1&w=2 There is two problems. First is pipex(4) doesn't pass the proper ipsecflowinfo to ip_output(). Second is the IPsec policy check which is done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is not cached. This happens when its flow is shared by another tdb (for another client of the same NAT). The following 2 diffs fix these problem. comment? ok? diff #1 Fix IPsec NAT-T work with pipex. Index: sys/net/pipex.c === RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v retrieving revision 1.132 diff -u -p -r1.132 pipex.c --- sys/net/pipex.c 10 Mar 2021 10:21:48 - 1.132 +++ sys/net/pipex.c 12 May 2021 09:38:32 - @@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc #ifdef INET6 struct ip6_hdr *ip6; #endif + struct m_tag *mtag; hlen = sizeof(struct pipex_l2tp_header) + ((pipex_session_is_l2tp_data_sequencing_on(session)) @@ -1703,6 +1704,15 @@ pipex_l2tp_output(struct mbuf *m0, struc ip->ip_ttl = MAXTTL; ip->ip_tos = 0; ip->ip_off = 0; + + if (session->proto.l2tp.ipsecflowinfo > 0) { + if ((mtag = m_tag_get(PACKET_TAG_IPSEC_FLOWINFO, + sizeof(u_int32_t), M_NOWAIT)) == NULL) + goto drop; + *(u_int32_t *)(mtag + 1) = + session->proto.l2tp.ipsecflowinfo; + m_tag_prepend(m0, mtag); + } ip_send(m0); break; Index: sys/netinet/ip_input.c === RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v retrieving revision 1.359 diff -u -p -r1.359 ip_input.c --- sys/netinet/ip_input.c 30 Apr 2021 13:52:48 - 1.359 +++ sys/netinet/ip_input.c 12 May 2021 09:38:32 - @@ -1790,6 +1790,8 @@ ip_send_do_dispatch(void *xmq, int flags struct mbuf_queue *mq = xmq; struct mbuf *m; struct mbuf_list ml; + struct m_tag *mtag; + u_int32_t ipsecflowinfo = 0; mq_delist(mq, &ml); if (ml_empty(&ml)) @@ -1797,7 +1799,12 @@ ip_send_do_dispatch(void *xmq, int flags NET_LOCK(); while ((m = ml_dequeue(&ml)) != NULL) { - ip_output(m, NULL, NULL, flags, NULL, NULL, 0); + if ((mtag = m_tag_find(m, PACKET_TAG_IPSEC_FLOWINFO, NULL)) + != NULL) { + ipsecflowinfo = *(u_int32_t *)(mtag + 1); + m_tag_delete(m, mtag); + } + ip_output(m, NULL, NULL, flags, NULL, NULL, ipsecflowinfo); } NET_UNLOCK(); } Index: sys/sys/mbuf.h === RCS file: /disk/cvs/openbsd/src/sys/sys/mbuf.h,v retrieving revision 1.252 diff -u -p -r1.252 mbuf.h --- sys/sys/mbuf.h 25 Feb 2021 02:43:31 - 1.252 +++ sys/sys/mbuf.h 12 May 2021 09:38:32 - @@ -469,6 +469,7 @@ struct m_tag *m_tag_next(struct mbuf *, /* Packet tag types */ #define PACKET_TAG_IPSEC_IN_DONE 0x0001 /* IPsec applied, in */ #define PACKET_TAG_IPSEC_OUT_DONE 0x0002 /* IPsec applied, out */ +#define PACKET_TAG_IPSEC_FLOWINFO 0x0004 /* IPsec flowinfo */ #define PACKET_TAG_WIREGUARD 0x0040 /* WireGuard data */ #define PACKET_TAG_GRE 0x0080 /* GRE processing done */ #define PACKET_TAG_DLT 0x0100 /* data link layer type */ @@ -479,7 +480,7 @@ struct m_tag *m_tag_next(struct mbuf *, #define PACKET_TAG_CARP_BAL_IP 0x4000 /* carp(4) ip balanced marker */ #define MTAG_BITS \ -("\20\1IPSEC_IN_DONE\2IPSEC_OUT_DONE\3IPSEC_IN_CRYPTO_DONE" \ +("\20\1IPSEC_IN_DONE\2IPSEC_OUT_DONE\3IPSEC_FLOWINFO" \ "\4IPSEC_OUT_CRYPTO_NEEDED\5IPSEC_PENDING_TDB\6BRIDGE\7WG\10GRE\11DLT" \ "\12PF_DIVERT\14PF_REASSEMBLED\15SRCROUTE\16TUNNEL\17CARP_BAL_IP") diff #2 Make the IPsec flow can have multiple `ipsec_ids' so that ipsp_spd_lookup() can check whether the `ipsec_ids` of the given tdb is belonged with a flow shared by mutlple clients behind a NAT. Index: sys/net/pfkeyv2.c === RCS file: /disk/cvs/openbsd/src/sys/net/pfkeyv2.c,v retrieving revision 1.211 diff -u -p -r1.211 pfkeyv2.c --- sys/net/pfkeyv2.c 4 May 2021 09:28:04 - 1.211 +++ sys/net/pfkeyv2.c 12 May 2021 10:07:11 - @@ -1106,6 +1106,7 @@ pfkeyv2_send(struct socket *so, void *me int i, j, rval = 0, mode = PFKEYV2_SENDMESSAGE_BROADCAST; int delflag = 0; struct sockaddr_encap encapdst, encapnetmask; + struct ipsec_ids *ids, *ids0; struct ipsec_policy *ipo; struct ipsec_acquire *ipa; stru
Re: running network stack forwarding in parallel
On 21.4.2021. 21:36, Alexander Bluhm wrote: > We need more MP preassure to find such bugs and races. I think now > is a good time to give this diff broader testing and commit it. > You need interfaces with multiple queues to see a difference. Hi, while forwarding ip4 traffic over box with parallel diff and aggr interfaces, and then aggr is destroyed, i'm getting witness log below... i can't reproduce this log without parallel diff and i'm getting this log only first time destroying aggr interface .. witness: lock order reversal: 1st 0x82139de0 netlock (netlock) 2nd 0x82120bb8 timeout (timeout) lock order "timeout"(rwlock) -> "netlock"(rwlock) first seen at: #0 rw_enter_write+0x43 #1 mld6_fasttimeo+0x14 #2 pffasttimo+0x67 #3 timeout_run+0x93 #4 softclock_thread+0x11d #5 proc_trampoline+0x1c lock order "netlock"(rwlock) -> "timeout"(rwlock) first seen at: #0 timeout_del_barrier+0x41 #1 aggr_p_dtor+0x17b #2 aggr_clone_destroy+0x91 #3 if_clone_destroy+0xd8 #4 ifioctl+0x1d2 #5 soo_ioctl+0x171 #6 sys_ioctl+0x2c4 #7 syscall+0x3b9 #8 Xsyscall+0x128 r620-1# cat /etc/hostname.aggr0 trunkport ix0 lladdr ec:f4:bb:da:f7:f8 inet 192.168.42.1 255.255.255.0 !route add 16/8 192.168.42.11 up r620-1# cat /etc/hostname.aggr1 trunkport ix1 inet 192.168.43.1 255.255.255.0 !route add 48/8 192.168.43.11 up OpenBSD 6.9-current (GENERIC.MP) #166: Wed May 12 10:18:11 CEST 2021 hrv...@r620-1.srce.hr:/sys/arch/amd64/compile/GENERIC.MP real mem = 17115840512 (16322MB) avail mem = 16450007040 (15687MB) random: good seed from bootblocks mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xcf42c000 (99 entries) bios0: vendor Dell Inc. version "2.9.0" date 12/06/2019 bios0: Dell Inc. PowerEdge R620 acpi0 at bios0: ACPI 3.0 acpi0: sleep states S0 S4 S5 acpi0: tables DSDT FACP APIC SPCR HPET DMAR MCFG WDAT SLIC ERST HEST BERT EINJ TCPA PC__ SRAT SSDT acpi0: wakeup devices PCI0(S5) acpitimer0 at acpi0: 3579545 Hz, 24 bits acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 4 (boot processor) cpu0: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.44 MHz, 06-3e-04 cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN cpu0: 256KB 64b/line 8-way L2 cache cpu0: smt 0, core 2, package 0 mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges cpu0: apic clock running at 99MHz cpu0: mwait min=64, max=64, C-substates=0.2.1.1, IBE cpu1 at mainbus0: apid 6 (application processor) cpu1: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.04 MHz, 06-3e-04 cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN cpu1: 256KB 64b/line 8-way L2 cache cpu1: smt 0, core 3, package 0 cpu2 at mainbus0: apid 8 (application processor) cpu2: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04 cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN cpu2: 256KB 64b/line 8-way L2 cache cpu2: smt 0, core 4, package 0 cpu3 at mainbus0: apid 16 (application processor) cpu3: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04 cpu3: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN cpu3: 256KB 64b/line 8-way L2 cache cpu3: smt 0, core 8, package 0 cpu4 at mainbus0: apid 18 (application processor) cpu4: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04 cpu4: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SEN