Re: panic(9): set panicstr atomically

2021-05-12 Thread Anton Lindqvist
On Wed, May 12, 2021 at 07:08:39PM -0500, Scott Cheloha wrote:
> Hi,
> 
> In a separate mail thread, bluhm@ mentioned that panic(9) does not
> cleanly handle multiple CPUs entering it simultaneously:
> 
> https://marc.info/?l=openbsd-tech&m=161908805925325&w=2
> 
> I'm unsure which part of panic(9) is causing the problem he mentions,
> but one obvious issue I see is that panicstr is not set atomically,
> so two CPUs entering panic(9) simultaneously may clobber panicbuf.
> 
> If we set panicstr atomically only one CPU will write panicbuf.
> 
> Thoughts?

I've seen panics caused by syzkaller where panicbuf looks scrambled by
more than one thread writing to the same static buffer. Assigning
panicstr before the vsnprintf() call therefore makes sense. This is also
what NetBSD does, although not as an atomic operation.

ok anton@



Re: iked(8): support for intermediate CAs and multiple CERT payloads

2021-05-12 Thread Stuart Henderson
I can't test at the moment, but as you asked for comments too: this is 
*very* welcome, it's an important missing feature. Thanks!


--
 Sent from a phone, apologies for poor formatting.
On 13 May 2021 06:40:49 Katsuhiro Ueno  wrote:


Hi,

I would be happy if iked(8) supports intermediate CAs and sends the
entire certificate chain to the clients. The diff attached adds
supports for intermediate CAs and multiple CERT payloads to iked(8).

What I would like to do is to use a LetsEncrypt certificate as a
server certificate of IKEv2 EAP and establish VPN connections with
Windows clients. However, I could not complete it because of the
following reasons.
* LetsEncrypt server certificate is issued by an intermediate CA
 and therefore the certificate of the intermediate CA is needed to
 check the validity of the server certificate.
* Windows expects the IKEv2 server to send the intermediate CA's
 certificate in addition to the server certificate to check the
 validity.
* On the other hand, iked(8) is not capable of dealing with
 certificate chains and sending multiple certificates (multiple
 CERT payloads) to the clients.
Consequently, Windows fails to verify the certificate and therefore
VPN connection cannot be established.

To overcome this, I added an (ad-hoc) support for certificate chain
and multiple CERT payloads. The diff attached is the changes that I
made. It works fine for me but I am not sure whether or not it works
for everyone and everywhere. Tests and comments are greatly
appreciated.

Many thanks,
Katsuhiro Ueno




Re: acme-client: use field agnostic {get,set}_affine_coordinates()

2021-05-12 Thread Florian Obser


I trust you know what you are doing. OK florian fwiw

On 2021-05-13 07:46 +02, Theo Buehler  wrote:
> The _GFp() variants provide no benefit and are just wrappers around the
> variants without _GFp(), so use the latter directly.
>
> Index: acctproc.c
> ===
> RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v
> retrieving revision 1.20
> diff -u -p -r1.20 acctproc.c
> --- acctproc.c17 Jun 2019 15:20:10 -  1.20
> +++ acctproc.c13 May 2021 05:38:59 -
> @@ -109,9 +109,9 @@ op_thumb_ec(EVP_PKEY *pkey)
>   warnx("BN_new");
>   else if ((Y = BN_new()) == NULL)
>   warnx("BN_new");
> - else if (!EC_POINT_get_affine_coordinates_GFp(EC_KEY_get0_group(ec),
> + else if (!EC_POINT_get_affine_coordinates(EC_KEY_get0_group(ec),
>   EC_KEY_get0_public_key(ec), X, Y, NULL))
> - warnx("EC_POINT_get_affine_coordinates_GFp");
> + warnx("EC_POINT_get_affine_coordinates");
>   else if ((x = bn2string(X)) == NULL)
>   warnx("bn2string");
>   else if ((y = bn2string(Y)) == NULL)
> @@ -237,9 +237,9 @@ op_sign_ec(char **prot, EVP_PKEY *pkey, 
>   warnx("BN_new");
>   else if ((Y = BN_new()) == NULL)
>   warnx("BN_new");
> - else if (!EC_POINT_get_affine_coordinates_GFp(EC_KEY_get0_group(ec),
> + else if (!EC_POINT_get_affine_coordinates(EC_KEY_get0_group(ec),
>   EC_KEY_get0_public_key(ec), X, Y, NULL))
> - warnx("EC_POINT_get_affine_coordinates_GFp");
> + warnx("EC_POINT_get_affine_coordinates");
>   else if ((x = bn2string(X)) == NULL)
>   warnx("bn2string");
>   else if ((y = bn2string(Y)) == NULL)
>

-- 
I'm not entirely sure you are real.



Re: macppc bsd.mp pmap's hash lock

2021-05-12 Thread George Koehler
My last diff (11 May 2021) still has a potential problem with memory
barriers.  I will mail a new diff if I think of a fix.

On Tue, 11 May 2021 11:08:55 -0500
Dale Rahn  wrote:

> This structure really should be cache-line aligned, which should prevent it
> from spilling across a page boundary.

There is both a struct __ppc_lock and a function __ppc_lock(); my G5
got stuck at boot when the function crossed a page boundary.  It might
help to align parts of the function, but if I would do so, I might
need to write the entire function in asm and count instructions.

The potential problem in my last diff is here inside __ppc_lock():

  15eb8c:   7c c0 19 2d stwcx.  r6,0,r3
  15eb90:   40 c2 ff f0 bne+15eb80 <__ppc_lock+0x40>
  15eb94:   7c 08 38 40 cmplw   r8,r7
  15eb98:   41 82 00 28 beq-15ebc0 <__ppc_lock+0x80>
  ...
  15ebc0:   4c 00 01 2c isync

After "stwcx." grabs the lock, we need a "bc; isync" memory barrier,
which is the "bne+" and "isync" in the above disassembly.  A page
fault might act like an "isync", so if we can do the "bne+" before
the page fault, we might be fine.

The problem is that, if the "stwcx." and "bne+" are in different pages
(1 in 1024 chance?), then I would get a page fault before the "bne+",
so my code would skip the necessary memory barrier.  I guess that I
can fix it by adding another membar_enter() somewhere.

> The powerpc pmap was originally designed to have  8 'way' locks so that
> only a single way would get locked, thus (as long as one doesn't have way
> more than 8 cores) any core should be able to choose some way to replace
> and get the work done. This also would have allowed limited recursion.
> However at some point those were taken out. Note that before locking one of
> the ways of the hashtable, it would hold the lock on the pmap that it was
> inserting the mapping from (which could be the kernel). Not certain if the
> kernel pmap lock is recursive or not.

That's a good history lesson for me; I'm too new.

The pmap lock is "struct mutex pm_mtx" (powerpc/include/pmap.h), so
it isn't recursive.  The code in pmap.c pte_spill_r() doesn't hold the
pmap lock if the page is in the direct map.  The kernel text, including
the function __ppc_lock(), is in the direct map.

> The point was to allow multiple cores to access different regions of the
> hashtable at the same time (minimal contention). However it would also
> allow a core to reenter the lock_try on a different way where it should be
> able to succeed (as long as the recursion doesn't go 8 deep?)

If I understand the code, the max recursion is 2 levels.  The kernel
can grab the hash lock, get a page fault, then the page fault handler
can do 2nd grab.  The handler is in real mode (no address translation)
and would never cause another page fault and a 3rd grab.

If we would have the 8 locks, then it might deadlock if 8 cpus each
hold a lock and try to grab a 2nd lock (but a macppc has at most
4 cpus: PPC_MAXPROCS in powerpc/include/cpu.h).

--George

> On Tue, May 11, 2021 at 12:42 AM George Koehler  wrote:
> 
> > I made a mistake in my last diff by deleting the memory barriers.
> > Here is a diff that keeps membar_enter() and membar_exit().
> >
> > With my last diff, my G5 froze after 15 hours of uptime, and again
> > after 10 hours of uptime.  I'm not sure whether the missing memory
> > barriers caused the freeze, but I will be running this diff and hoping
> > for fewer freezes.
> >
> > On Sat, 8 May 2021 18:59:35 +0200 (CEST)
> > Mark Kettenis  wrote:
> >
> > > Good find!  On powerpc64 I avoid the issue because I guarantee that
> > > the kernel mappings are never evicted from the hash.  But doing so on
> > > powerpc would require more serious development.  I'm not sure we
> > > really need a ticket lock for this, but since you already did the
> > > work, let's stick with it for now.
> >
> > __ppc_lock (before and after this diff) doesn't give tickets like
> > __mp_lock does, but __ppc_lock and __mp_lock are both recursive locks.
> > I don't know an easy way to avoid the recursion, if some of the kernel
> > mappings might not be in the hash.  My __ppc_lock diff should be easy
> > if I don't make mistakes like wrong memory barriers.
> >
> > --George
> >
> > Index: arch/powerpc/include/mplock.h
> > ===
> > RCS file: /cvs/src/sys/arch/powerpc/include/mplock.h,v
> > retrieving revision 1.4
> > diff -u -p -r1.4 mplock.h
> > --- arch/powerpc/include/mplock.h   15 Apr 2020 08:09:00 -  1.4
> > +++ arch/powerpc/include/mplock.h   10 May 2021 23:33:00 -
> > @@ -30,13 +30,13 @@
> >  #define __USE_MI_MPLOCK
> >
> >  /*
> > + * __ppc_lock exists because pte_spill_r() can't use __mp_lock.
> >   * Really simple spinlock implementation with recursive capabilities.
> >   * Correctness is paramount, no fancyness allowed.
> >   */
> >
> >  struct __ppc_lock {
> > -   volatile struct cpu_info *mpl_cpu

acme-client: use field agnostic {get,set}_affine_coordinates()

2021-05-12 Thread Theo Buehler
The _GFp() variants provide no benefit and are just wrappers around the
variants without _GFp(), so use the latter directly.

Index: acctproc.c
===
RCS file: /cvs/src/usr.sbin/acme-client/acctproc.c,v
retrieving revision 1.20
diff -u -p -r1.20 acctproc.c
--- acctproc.c  17 Jun 2019 15:20:10 -  1.20
+++ acctproc.c  13 May 2021 05:38:59 -
@@ -109,9 +109,9 @@ op_thumb_ec(EVP_PKEY *pkey)
warnx("BN_new");
else if ((Y = BN_new()) == NULL)
warnx("BN_new");
-   else if (!EC_POINT_get_affine_coordinates_GFp(EC_KEY_get0_group(ec),
+   else if (!EC_POINT_get_affine_coordinates(EC_KEY_get0_group(ec),
EC_KEY_get0_public_key(ec), X, Y, NULL))
-   warnx("EC_POINT_get_affine_coordinates_GFp");
+   warnx("EC_POINT_get_affine_coordinates");
else if ((x = bn2string(X)) == NULL)
warnx("bn2string");
else if ((y = bn2string(Y)) == NULL)
@@ -237,9 +237,9 @@ op_sign_ec(char **prot, EVP_PKEY *pkey, 
warnx("BN_new");
else if ((Y = BN_new()) == NULL)
warnx("BN_new");
-   else if (!EC_POINT_get_affine_coordinates_GFp(EC_KEY_get0_group(ec),
+   else if (!EC_POINT_get_affine_coordinates(EC_KEY_get0_group(ec),
EC_KEY_get0_public_key(ec), X, Y, NULL))
-   warnx("EC_POINT_get_affine_coordinates_GFp");
+   warnx("EC_POINT_get_affine_coordinates");
else if ((x = bn2string(X)) == NULL)
warnx("bn2string");
else if ((y = bn2string(Y)) == NULL)



iked(8): support for intermediate CAs and multiple CERT payloads

2021-05-12 Thread Katsuhiro Ueno
Hi,

I would be happy if iked(8) supports intermediate CAs and sends the
entire certificate chain to the clients. The diff attached adds
supports for intermediate CAs and multiple CERT payloads to iked(8).

What I would like to do is to use a LetsEncrypt certificate as a
server certificate of IKEv2 EAP and establish VPN connections with
Windows clients. However, I could not complete it because of the
following reasons.
* LetsEncrypt server certificate is issued by an intermediate CA
  and therefore the certificate of the intermediate CA is needed to
  check the validity of the server certificate.
* Windows expects the IKEv2 server to send the intermediate CA's
  certificate in addition to the server certificate to check the
  validity.
* On the other hand, iked(8) is not capable of dealing with
  certificate chains and sending multiple certificates (multiple
  CERT payloads) to the clients.
Consequently, Windows fails to verify the certificate and therefore
VPN connection cannot be established.

To overcome this, I added an (ad-hoc) support for certificate chain
and multiple CERT payloads. The diff attached is the changes that I
made. It works fine for me but I am not sure whether or not it works
for everyone and everywhere. Tests and comments are greatly
appreciated.

Many thanks,
Katsuhiro Ueno


iked.diff
Description: Binary data


isakmpd: simplify calls to {get,set}_affine_coordinates

2021-05-12 Thread Theo Buehler
There never was a real need for a distinction between the GFp and GF2m
variants of EC_POINT_{get,set}_affine_coordinates_{GFp,GF2m}() in
libcrypto. The EC_GROUP method has the correct function pointer set. Now
that we have EC_POINT_{get,set}_affine_coordinates(), let's use them and
avoid the complexity of using the GFp and GF2m variants.

Index: dh.c
===
RCS file: /cvs/src/sbin/isakmpd/dh.c,v
retrieving revision 1.21
diff -u -p -r1.21 dh.c
--- dh.c8 Nov 2017 13:33:49 -   1.21
+++ dh.c13 May 2021 05:25:32 -
@@ -535,16 +535,8 @@ ec_point2raw(struct group *group, const 
if ((ecgroup = EC_KEY_get0_group(group->ec)) == NULL)
goto done;
 
-   if (EC_METHOD_get_field_type(EC_GROUP_method_of(ecgroup)) ==
-   NID_X9_62_prime_field) {
-   if (!EC_POINT_get_affine_coordinates_GFp(ecgroup,
-   point, x, y, bnctx))
-   goto done;
-   } else {
-   if (!EC_POINT_get_affine_coordinates_GF2m(ecgroup,
-   point, x, y, bnctx))
-   goto done;
-   }
+   if (!EC_POINT_get_affine_coordinates(ecgroup, point, x, y, bnctx))
+   goto done;
 
xoff = xlen - BN_num_bytes(x);
bzero(buf, xoff);
@@ -603,16 +595,8 @@ ec_raw2point(struct group *group, u_int8
if ((point = EC_POINT_new(ecgroup)) == NULL)
goto done;
 
-   if (EC_METHOD_get_field_type(EC_GROUP_method_of(ecgroup)) ==
-   NID_X9_62_prime_field) {
-   if (!EC_POINT_set_affine_coordinates_GFp(ecgroup,
-   point, x, y, bnctx))
-   goto done;
-   } else {
-   if (!EC_POINT_set_affine_coordinates_GF2m(ecgroup,
-   point, x, y, bnctx))
-   goto done;
-   }
+   if (!EC_POINT_set_affine_coordinates(ecgroup, point, x, y, bnctx))
+   goto done;
 
ret = 0;
  done:



timeout(9): add TIMEOUT_MPSAFE flag

2021-05-12 Thread Scott Cheloha
Hi,

With the removal of the kernel lock from timeout_barrier(9),
softclock() and the timeout thread do not need the kernel lock.

However, many timeouts implicitly rely on the kernel lock.

So to unlock softclock() and the timeout thread I propose adding a new
flag, TIMEOUT_MPSAFE.  The flag signifies that a given timeout doesn't
need the kernel lock at all.  We'll run all such timeouts without the
kernel lock.

Taking/releasing the kernel lock during timeout_run() on a per-timeout
basis is way too slow.  Instead, during softclock() we can aggregate
TIMEOUT_MPSAFE timeouts onto secondary queues and process them all at
once after dropping the kernel lock.  This will minimize locking
overhead.

Normal timeouts are put onto timeout_todo_mpsafe and are run during
softclock().  Process timeouts are put onto timeout_proc_mpsafe and
run from the timeout thread.

The funky locking dance in softclock() and softclock_thread() is
needed to keep from violating the locking hierarchy.  The
timeout_mutex is above the kernel lock, so we need to leave the
timeout_mutex, then drop the kernel lock, and then reenter the
timeout_mutex to start running TIMEOUT_MPSAFE timeouts.

Thoughts?

Index: kern/kern_timeout.c
===
RCS file: /cvs/src/sys/kern/kern_timeout.c,v
retrieving revision 1.84
diff -u -p -r1.84 kern_timeout.c
--- kern/kern_timeout.c 11 May 2021 13:29:25 -  1.84
+++ kern/kern_timeout.c 13 May 2021 05:01:53 -
@@ -74,7 +74,9 @@ struct circq timeout_wheel[BUCKETS];  /* 
 struct circq timeout_wheel_kc[BUCKETS];/* [T] Clock-based timeouts */
 struct circq timeout_new;  /* [T] New, unscheduled timeouts */
 struct circq timeout_todo; /* [T] Due or needs rescheduling */
+struct circq timeout_todo_mpsafe;  /* [T] Due + no kernel lock */
 struct circq timeout_proc; /* [T] Due + needs process context */
+struct circq timeout_proc_mpsafe;  /* [T] Due, proc ctx, no kernel lock */
 
 time_t timeout_level_width[WHEELCOUNT];/* [I] Wheel level width 
(seconds) */
 struct timespec tick_ts;   /* [I] Length of a tick (1/hz secs) */
@@ -228,7 +230,9 @@ timeout_startup(void)
 
CIRCQ_INIT(&timeout_new);
CIRCQ_INIT(&timeout_todo);
+   CIRCQ_INIT(&timeout_todo_mpsafe);
CIRCQ_INIT(&timeout_proc);
+   CIRCQ_INIT(&timeout_proc_mpsafe);
for (b = 0; b < nitems(timeout_wheel); b++)
CIRCQ_INIT(&timeout_wheel[b]);
for (b = 0; b < nitems(timeout_wheel_kc); b++)
@@ -697,7 +701,13 @@ softclock_process_kclock_timeout(struct 
if (!new && timespeccmp(&to->to_abstime, &kc->kc_late, <=))
tostat.tos_late++;
if (ISSET(to->to_flags, TIMEOUT_PROC)) {
-   CIRCQ_INSERT_TAIL(&timeout_proc, &to->to_list);
+   if (ISSET(to->to_flags, TIMEOUT_MPSAFE))
+   CIRCQ_INSERT_TAIL(&timeout_proc_mpsafe, &to->to_list);
+   else
+   CIRCQ_INSERT_TAIL(&timeout_proc, &to->to_list);
+   return;
+   } else if (ISSET(to->to_flags, TIMEOUT_MPSAFE)) {
+   CIRCQ_INSERT_TAIL(&timeout_todo_mpsafe, &to->to_list);
return;
}
timeout_run(to);
@@ -719,7 +729,13 @@ softclock_process_tick_timeout(struct ti
if (!new && delta < 0)
tostat.tos_late++;
if (ISSET(to->to_flags, TIMEOUT_PROC)) {
-   CIRCQ_INSERT_TAIL(&timeout_proc, &to->to_list);
+   if (ISSET(to->to_flags, TIMEOUT_MPSAFE))
+   CIRCQ_INSERT_TAIL(&timeout_proc_mpsafe, &to->to_list);
+   else
+   CIRCQ_INSERT_TAIL(&timeout_proc, &to->to_list);
+   return;
+   } else if (ISSET(to->to_flags, TIMEOUT_MPSAFE)) {
+   CIRCQ_INSERT_TAIL(&timeout_todo_mpsafe, &to->to_list);
return;
}
timeout_run(to);
@@ -735,11 +751,14 @@ softclock_process_tick_timeout(struct ti
 void
 softclock(void *arg)
 {
+   struct circq *cq;
struct timeout *first_new, *to;
-   int needsproc, new;
+   int needsproc, new, unlocked;
 
first_new = NULL;
+   needsproc = 1;
new = 0;
+   unlocked = 0;
 
mtx_enter(&timeout_mutex);
if (!CIRCQ_EMPTY(&timeout_new))
@@ -755,10 +774,27 @@ softclock(void *arg)
else
softclock_process_tick_timeout(to, new);
}
+   if (!CIRCQ_EMPTY(&timeout_todo_mpsafe)) {
+   mtx_leave(&timeout_mutex);
+   KERNEL_UNLOCK();
+   unlocked = 1;
+   mtx_enter(&timeout_mutex);
+   while (!CIRCQ_EMPTY(&timeout_todo_mpsafe)) {
+   cq = CIRCQ_FIRST(&timeout_todo_mpsafe);
+   to = timeout_from_circq(cq);
+   CIRCQ_REMOVE(&to->to_list);
+   timeout_run(to);
+   t

Whitespace fix in sys/scsi/scsi_base.c

2021-05-12 Thread Ashton Fagg
Found this while poking around.

diff --git a/sys/scsi/scsi_base.c b/sys/scsi/scsi_base.c
index 2ba6a702fbc..6769449e2fa 100644
--- a/sys/scsi/scsi_base.c
+++ b/sys/scsi/scsi_base.c
@@ -478,7 +478,7 @@ scsi_io_get(struct scsi_iopool *iopl, int flags)
 		return NULL;
 
 	/* otherwise sleep until we get one */
-scsi_ioh_set(&ioh, iopl, scsi_io_get_done, &m);
+	scsi_ioh_set(&ioh, iopl, scsi_io_get_done, &m);
 	scsi_ioh_add(&ioh);
 	scsi_move(&m);
 


Re: Fix IPsec NAT-T for L2TP/IPsec

2021-05-12 Thread YASUOKA Masahiko
On Wed, 12 May 2021 19:11:09 +0900 (JST)
YASUOKA Masahiko  wrote:
> Radek reported a problem to misc@ that multiple Windows clients behind
> a NAT cannot use a L2TP/IPsec server simultaneously.
> 
> https://marc.info/?t=16099681611&r=1&w=2
> 
> There is two problems.  First is pipex(4) doesn't pass the proper
> ipsecflowinfo to ip_output().  Second is the IPsec policy check which
> is done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is
> not cached.  This happens when its flow is shared by another tdb (for
> another client of the same NAT).
> 
> The following 2 diffs fix these problem.
> 
> comment?
> ok?
> 
> diff #1
> 
> Fix IPsec NAT-T work with pipex.

The original diff #1 used m_tag to specify the ipsecflowinfo.

I noticed "ph_cookie" is usable instead of the m_tag.  It seems simpler.

Is it better?

Index: sys/net/if_etherip.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/if_etherip.c,v
retrieving revision 1.48
diff -u -p -r1.48 if_etherip.c
--- sys/net/if_etherip.c9 Jan 2021 21:00:58 -   1.48
+++ sys/net/if_etherip.c12 May 2021 23:29:41 -
@@ -547,7 +547,7 @@ ip_etherip_output(struct ifnet *ifp, str
etheripstat_pkt(etherips_opackets, etherips_obytes, m->m_pkthdr.len -
(sizeof(struct ip) + sizeof(struct etherip_header)));
 
-   ip_send(m);
+   ip_send(m, 0);
 
return (0);
 }
Index: sys/net/if_gif.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/if_gif.c,v
retrieving revision 1.132
diff -u -p -r1.132 if_gif.c
--- sys/net/if_gif.c20 Feb 2021 04:58:29 -  1.132
+++ sys/net/if_gif.c12 May 2021 23:29:45 -
@@ -340,7 +340,7 @@ gif_send(struct gif_softc *sc, struct mb
ip->ip_src = sc->sc_tunnel.t_src4;
ip->ip_dst = sc->sc_tunnel.t_dst4;
 
-   ip_send(m);
+   ip_send(m, 0);
break;
}
 #ifdef INET6
Index: sys/net/if_gre.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/if_gre.c,v
retrieving revision 1.171
diff -u -p -r1.171 if_gre.c
--- sys/net/if_gre.c10 Mar 2021 10:21:47 -  1.171
+++ sys/net/if_gre.c12 May 2021 23:29:52 -
@@ -1999,7 +1999,7 @@ gre_ip_output(const struct gre_tunnel *t
 
switch (tunnel->t_af) {
case AF_INET:
-   ip_send(m);
+   ip_send(m, 0);
break;
 #ifdef INET6
case AF_INET6:
Index: sys/net/pf.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/pf.c,v
retrieving revision 1.1116
diff -u -p -r1.1116 pf.c
--- sys/net/pf.c27 Apr 2021 09:38:29 -  1.1116
+++ sys/net/pf.c12 May 2021 23:29:56 -
@@ -2896,7 +2896,7 @@ pf_send_tcp(const struct pf_rule *r, sa_
 
switch (af) {
case AF_INET:
-   ip_send(m);
+   ip_send(m, 0);
break;
 #ifdef INET6
case AF_INET6:
Index: sys/net/pipex.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v
retrieving revision 1.132
diff -u -p -r1.132 pipex.c
--- sys/net/pipex.c 10 Mar 2021 10:21:48 -  1.132
+++ sys/net/pipex.c 12 May 2021 23:31:24 -
@@ -1258,7 +1258,7 @@ pipex_pptp_output(struct mbuf *m0, struc
gre->flags = htons(gre->flags);
 
m0->m_pkthdr.ph_ifidx = session->ifindex;
-   ip_send(m0);
+   ip_send(m0, 0);
if (len > 0) {  /* network layer only */
/* countup statistics */
session->stat.opackets++;
@@ -1704,7 +1704,7 @@ pipex_l2tp_output(struct mbuf *m0, struc
ip->ip_tos = 0;
ip->ip_off = 0;
 
-   ip_send(m0);
+   ip_send(m0, session->proto.l2tp.ipsecflowinfo);
break;
 #ifdef INET6
case AF_INET6:
Index: sys/netinet/ip_icmp.c
===
RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_icmp.c,v
retrieving revision 1.186
diff -u -p -r1.186 ip_icmp.c
--- sys/netinet/ip_icmp.c   30 Mar 2021 08:37:10 -  1.186
+++ sys/netinet/ip_icmp.c   12 May 2021 23:31:57 -
@@ -860,7 +860,7 @@ icmp_send(struct mbuf *m, struct mbuf *o
ipstat_inc(ips_localout);
ip_send_raw(m);
} else
-   ip_send(m);
+   ip_send(m, 0);
 }
 
 u_int32_t
Index: sys/netinet/ip_input.c
===
RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v
retrieving revision 1.359
diff -u -p -r1.359 ip_input.c
--- sys/netinet/ip_input.c  30 Apr 2021 13:52:48 -  1.359
+++ sys/netinet/ip_input.c  12 May 2021 23:29:01 -
@@ -1790,6 +1790,7 @@ ip_send_do_dispatch(void *xmq, int flags
st

Re: panic(9): set panicstr atomically

2021-05-12 Thread Theo de Raadt
Nicer than the garble...

It would be nice if we could see all the panics.  Could we also have a
per-cpu panic buffer, and then adapt ddb to show them all?

Scott Cheloha  wrote:

> In a separate mail thread, bluhm@ mentioned that panic(9) does not
> cleanly handle multiple CPUs entering it simultaneously:
> 
> https://marc.info/?l=openbsd-tech&m=161908805925325&w=2
> 
> I'm unsure which part of panic(9) is causing the problem he mentions,
> but one obvious issue I see is that panicstr is not set atomically,
> so two CPUs entering panic(9) simultaneously may clobber panicbuf.
> 
> If we set panicstr atomically only one CPU will write panicbuf.
> 
> Thoughts?
> 
> Index: kern/subr_prf.c
> ===
> RCS file: /cvs/src/sys/kern/subr_prf.c,v
> retrieving revision 1.102
> diff -u -p -r1.102 subr_prf.c
> --- kern/subr_prf.c   28 Nov 2020 17:53:05 -  1.102
> +++ kern/subr_prf.c   13 May 2021 00:04:28 -
> @@ -97,7 +97,7 @@ struct mutex kprintf_mutex =
>   */
>  
>  extern   int log_open;   /* subr_log: is /dev/klog open? */
> -constchar *panicstr; /* arg to first call to panic (used as a flag
> +volatile const char *panicstr; /* arg to first call to panic (used as a flag
>  to indicate that panic has already been called). */
>  constchar *faultstr; /* page fault string */
>  #ifdef DDB
> @@ -195,12 +195,10 @@ panic(const char *fmt, ...)
>  
>   bootopt = RB_AUTOBOOT | RB_DUMP;
>   va_start(ap, fmt);
> - if (panicstr)
> + if (atomic_cas_ptr(&panicstr, NULL, panicbuf) != NULL)
>   bootopt |= RB_NOSYNC;
> - else {
> + else
>   vsnprintf(panicbuf, sizeof panicbuf, fmt, ap);
> - panicstr = panicbuf;
> - }
>   va_end(ap);
>  
>   printf("panic: ");
> Index: sys/systm.h
> ===
> RCS file: /cvs/src/sys/sys/systm.h,v
> retrieving revision 1.153
> diff -u -p -r1.153 systm.h
> --- sys/systm.h   28 Apr 2021 09:42:04 -  1.153
> +++ sys/systm.h   13 May 2021 00:04:28 -
> @@ -71,7 +71,7 @@
>   * patched by a stalking hacker.
>   */
>  extern int securelevel;  /* system security level */
> -extern const char *panicstr; /* panic message */
> +extern volatile const char *panicstr;/* panic message */
>  extern const char *faultstr; /* fault message */
>  extern const char version[]; /* system version */
>  extern const char copyright[];   /* system copyright */
> 



panic(9): set panicstr atomically

2021-05-12 Thread Scott Cheloha
Hi,

In a separate mail thread, bluhm@ mentioned that panic(9) does not
cleanly handle multiple CPUs entering it simultaneously:

https://marc.info/?l=openbsd-tech&m=161908805925325&w=2

I'm unsure which part of panic(9) is causing the problem he mentions,
but one obvious issue I see is that panicstr is not set atomically,
so two CPUs entering panic(9) simultaneously may clobber panicbuf.

If we set panicstr atomically only one CPU will write panicbuf.

Thoughts?

Index: kern/subr_prf.c
===
RCS file: /cvs/src/sys/kern/subr_prf.c,v
retrieving revision 1.102
diff -u -p -r1.102 subr_prf.c
--- kern/subr_prf.c 28 Nov 2020 17:53:05 -  1.102
+++ kern/subr_prf.c 13 May 2021 00:04:28 -
@@ -97,7 +97,7 @@ struct mutex kprintf_mutex =
  */
 
 extern int log_open;   /* subr_log: is /dev/klog open? */
-const  char *panicstr; /* arg to first call to panic (used as a flag
+volatile const char *panicstr; /* arg to first call to panic (used as a flag
   to indicate that panic has already been called). */
 const  char *faultstr; /* page fault string */
 #ifdef DDB
@@ -195,12 +195,10 @@ panic(const char *fmt, ...)
 
bootopt = RB_AUTOBOOT | RB_DUMP;
va_start(ap, fmt);
-   if (panicstr)
+   if (atomic_cas_ptr(&panicstr, NULL, panicbuf) != NULL)
bootopt |= RB_NOSYNC;
-   else {
+   else
vsnprintf(panicbuf, sizeof panicbuf, fmt, ap);
-   panicstr = panicbuf;
-   }
va_end(ap);
 
printf("panic: ");
Index: sys/systm.h
===
RCS file: /cvs/src/sys/sys/systm.h,v
retrieving revision 1.153
diff -u -p -r1.153 systm.h
--- sys/systm.h 28 Apr 2021 09:42:04 -  1.153
+++ sys/systm.h 13 May 2021 00:04:28 -
@@ -71,7 +71,7 @@
  * patched by a stalking hacker.
  */
 extern int securelevel;/* system security level */
-extern const char *panicstr;   /* panic message */
+extern volatile const char *panicstr;  /* panic message */
 extern const char *faultstr;   /* fault message */
 extern const char version[];   /* system version */
 extern const char copyright[]; /* system copyright */



Re: Fix IPsec NAT-T for L2TP/IPsec

2021-05-12 Thread Vitaliy Makkoveev
ok mvs@


> On 13 May 2021, at 02:43, YASUOKA Masahiko  wrote:
> 
> On Wed, 12 May 2021 19:15:29 +0300
> Vitaliy Makkoveev  wrote:
>>> On 12 May 2021, at 18:42, YASUOKA Masahiko  wrote:
>>> On Wed, 12 May 2021 17:26:51 +0300
>>> Vitaliy Makkoveev  wrote:
 On Wed, May 12, 2021 at 07:11:09PM +0900, YASUOKA Masahiko wrote:
> Radek reported a problem to misc@ that multiple Windows clients behind a 
> NAT
> cannot use a L2TP/IPsec server simultaneously.
> 
> https://marc.info/?t=16099681611&r=1&w=2
> 
> There is two problems.  First is pipex(4) doesn't pass the proper
> ipsecflowinfo to ip_output().  Second is the IPsec policy check which is
> done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is not
> cached.  This happens when its flow is shared by another tdb (for another
> client of the same NAT).
> 
> The following 2 diffs fix these problem.
> 
> comment?
> ok?
> 
 
 Hi.
 
 I have two comments for the diff 1:
 
 1. You should add PACKET_TAG_IPSEC_FLOWINFO description to
   m_tag_get(9).
 2. You introduced mbuf(9) leak in pipex_l2tp_output() error path. I
  pointed the place in your diff.
>>> 
>>> Good catch.  Thanks.
>>> 
>> 
>> m_freem(9) accepts NULL so this check before is redundant.
> 
> Yes,
> 
>> It seems to me that "Used by the IPv4 stack to specify the IPsec flow
>> of an output IP packet. The tag contains a u_int32_t identifying the
>> IPsec flow.” is enough. Anyway it’s better to ask jmc@.
> 
> Ok,
> 
>> Also I like to remove PACKET_TAG_PIPEX with separate diff.
> 
> I removed PACKET_TAG_PIPEX separetely.  
> 
> Let me update the diff.
> 
> Index: sys/net/pipex.c
> ===
> RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v
> retrieving revision 1.132
> diff -u -p -r1.132 pipex.c
> --- sys/net/pipex.c   10 Mar 2021 10:21:48 -  1.132
> +++ sys/net/pipex.c   12 May 2021 23:18:52 -
> @@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc
> #ifdef INET6
>   struct ip6_hdr *ip6;
> #endif
> + struct m_tag *mtag;
> 
>   hlen = sizeof(struct pipex_l2tp_header) +
>   ((pipex_session_is_l2tp_data_sequencing_on(session))
> @@ -1704,6 +1705,15 @@ pipex_l2tp_output(struct mbuf *m0, struc
>   ip->ip_tos = 0;
>   ip->ip_off = 0;
> 
> + if (session->proto.l2tp.ipsecflowinfo > 0) {
> + if ((mtag = m_tag_get(PACKET_TAG_IPSEC_FLOWINFO,
> + sizeof(u_int32_t), M_NOWAIT)) == NULL)
> + goto drop;
> + *(u_int32_t *)(mtag + 1) =
> + session->proto.l2tp.ipsecflowinfo;
> + m_tag_prepend(m0, mtag);
> + }
> +
>   ip_send(m0);
>   break;
> #ifdef INET6
> @@ -1733,6 +1743,7 @@ pipex_l2tp_output(struct mbuf *m0, struc
> 
>   return;
> drop:
> + m_freem(m0);
>   session->stat.oerrors++;
> }
> 
> Index: sys/netinet/ip_input.c
> ===
> RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v
> retrieving revision 1.359
> diff -u -p -r1.359 ip_input.c
> --- sys/netinet/ip_input.c30 Apr 2021 13:52:48 -  1.359
> +++ sys/netinet/ip_input.c12 May 2021 23:18:52 -
> @@ -1790,6 +1790,8 @@ ip_send_do_dispatch(void *xmq, int flags
>   struct mbuf_queue *mq = xmq;
>   struct mbuf *m;
>   struct mbuf_list ml;
> + struct m_tag *mtag;
> + u_int32_t ipsecflowinfo = 0;
> 
>   mq_delist(mq, &ml);
>   if (ml_empty(&ml))
> @@ -1797,7 +1799,12 @@ ip_send_do_dispatch(void *xmq, int flags
> 
>   NET_LOCK();
>   while ((m = ml_dequeue(&ml)) != NULL) {
> - ip_output(m, NULL, NULL, flags, NULL, NULL, 0);
> + if ((mtag = m_tag_find(m, PACKET_TAG_IPSEC_FLOWINFO, NULL))
> + != NULL) {
> + ipsecflowinfo = *(u_int32_t *)(mtag + 1);
> + m_tag_delete(m, mtag);
> + }
> + ip_output(m, NULL, NULL, flags, NULL, NULL, ipsecflowinfo);
>   }
>   NET_UNLOCK();
> }
> Index: sys/sys/mbuf.h
> ===
> RCS file: /disk/cvs/openbsd/src/sys/sys/mbuf.h,v
> retrieving revision 1.252
> diff -u -p -r1.252 mbuf.h
> --- sys/sys/mbuf.h25 Feb 2021 02:43:31 -  1.252
> +++ sys/sys/mbuf.h12 May 2021 23:18:52 -
> @@ -469,6 +469,7 @@ struct m_tag *m_tag_next(struct mbuf *, 
> /* Packet tag types */
> #define PACKET_TAG_IPSEC_IN_DONE  0x0001  /* IPsec applied, in */
> #define PACKET_TAG_IPSEC_OUT_DONE 0x0002  /* IPsec applied, out */
> +#define PACKET_TAG_IPSEC_FLOWINFO0x0004  /* IPsec flowinfo */
> #define PACKET_TAG_WIREGUARD  0x0040  /* WireGuard data */
> #define PACKET_TAG_GRE0x0080  /* GRE processing

Re: Fix IPsec NAT-T for L2TP/IPsec

2021-05-12 Thread YASUOKA Masahiko
On Wed, 12 May 2021 19:15:29 +0300
Vitaliy Makkoveev  wrote:
>> On 12 May 2021, at 18:42, YASUOKA Masahiko  wrote:
>> On Wed, 12 May 2021 17:26:51 +0300
>> Vitaliy Makkoveev  wrote:
>>> On Wed, May 12, 2021 at 07:11:09PM +0900, YASUOKA Masahiko wrote:
 Radek reported a problem to misc@ that multiple Windows clients behind a 
 NAT
 cannot use a L2TP/IPsec server simultaneously.
 
 https://marc.info/?t=16099681611&r=1&w=2
 
 There is two problems.  First is pipex(4) doesn't pass the proper
 ipsecflowinfo to ip_output().  Second is the IPsec policy check which is
 done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is not
 cached.  This happens when its flow is shared by another tdb (for another
 client of the same NAT).
 
 The following 2 diffs fix these problem.
 
 comment?
 ok?
 
>>> 
>>> Hi.
>>> 
>>> I have two comments for the diff 1:
>>> 
>>> 1. You should add PACKET_TAG_IPSEC_FLOWINFO description to
>>>m_tag_get(9).
>>> 2. You introduced mbuf(9) leak in pipex_l2tp_output() error path. I
>>>   pointed the place in your diff.
>> 
>> Good catch.  Thanks.
>> 
> 
> m_freem(9) accepts NULL so this check before is redundant.

Yes,

> It seems to me that "Used by the IPv4 stack to specify the IPsec flow
> of an output IP packet. The tag contains a u_int32_t identifying the
> IPsec flow.” is enough. Anyway it’s better to ask jmc@.

Ok,

> Also I like to remove PACKET_TAG_PIPEX with separate diff.

I removed PACKET_TAG_PIPEX separetely.  

Let me update the diff.

Index: sys/net/pipex.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v
retrieving revision 1.132
diff -u -p -r1.132 pipex.c
--- sys/net/pipex.c 10 Mar 2021 10:21:48 -  1.132
+++ sys/net/pipex.c 12 May 2021 23:18:52 -
@@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc
 #ifdef INET6
struct ip6_hdr *ip6;
 #endif
+   struct m_tag *mtag;
 
hlen = sizeof(struct pipex_l2tp_header) +
((pipex_session_is_l2tp_data_sequencing_on(session))
@@ -1704,6 +1705,15 @@ pipex_l2tp_output(struct mbuf *m0, struc
ip->ip_tos = 0;
ip->ip_off = 0;
 
+   if (session->proto.l2tp.ipsecflowinfo > 0) {
+   if ((mtag = m_tag_get(PACKET_TAG_IPSEC_FLOWINFO,
+   sizeof(u_int32_t), M_NOWAIT)) == NULL)
+   goto drop;
+   *(u_int32_t *)(mtag + 1) =
+   session->proto.l2tp.ipsecflowinfo;
+   m_tag_prepend(m0, mtag);
+   }
+
ip_send(m0);
break;
 #ifdef INET6
@@ -1733,6 +1743,7 @@ pipex_l2tp_output(struct mbuf *m0, struc
 
return;
 drop:
+   m_freem(m0);
session->stat.oerrors++;
 }
 
Index: sys/netinet/ip_input.c
===
RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v
retrieving revision 1.359
diff -u -p -r1.359 ip_input.c
--- sys/netinet/ip_input.c  30 Apr 2021 13:52:48 -  1.359
+++ sys/netinet/ip_input.c  12 May 2021 23:18:52 -
@@ -1790,6 +1790,8 @@ ip_send_do_dispatch(void *xmq, int flags
struct mbuf_queue *mq = xmq;
struct mbuf *m;
struct mbuf_list ml;
+   struct m_tag *mtag;
+   u_int32_t ipsecflowinfo = 0;
 
mq_delist(mq, &ml);
if (ml_empty(&ml))
@@ -1797,7 +1799,12 @@ ip_send_do_dispatch(void *xmq, int flags
 
NET_LOCK();
while ((m = ml_dequeue(&ml)) != NULL) {
-   ip_output(m, NULL, NULL, flags, NULL, NULL, 0);
+   if ((mtag = m_tag_find(m, PACKET_TAG_IPSEC_FLOWINFO, NULL))
+   != NULL) {
+   ipsecflowinfo = *(u_int32_t *)(mtag + 1);
+   m_tag_delete(m, mtag);
+   }
+   ip_output(m, NULL, NULL, flags, NULL, NULL, ipsecflowinfo);
}
NET_UNLOCK();
 }
Index: sys/sys/mbuf.h
===
RCS file: /disk/cvs/openbsd/src/sys/sys/mbuf.h,v
retrieving revision 1.252
diff -u -p -r1.252 mbuf.h
--- sys/sys/mbuf.h  25 Feb 2021 02:43:31 -  1.252
+++ sys/sys/mbuf.h  12 May 2021 23:18:52 -
@@ -469,6 +469,7 @@ struct m_tag *m_tag_next(struct mbuf *, 
 /* Packet tag types */
 #define PACKET_TAG_IPSEC_IN_DONE   0x0001  /* IPsec applied, in */
 #define PACKET_TAG_IPSEC_OUT_DONE  0x0002  /* IPsec applied, out */
+#define PACKET_TAG_IPSEC_FLOWINFO  0x0004  /* IPsec flowinfo */
 #define PACKET_TAG_WIREGUARD   0x0040  /* WireGuard data */
 #define PACKET_TAG_GRE 0x0080  /* GRE processing done */
 #define PACKET_TAG_DLT 0x0100 /* data link layer type */
@@ -479,7 +480,7 @@ struct m_tag *m_tag_next(struct mbuf *, 
 #define PACKET_TAG_CARP_BAL_IP 0x4000  /*

Re: running network stack forwarding in parallel

2021-05-12 Thread Vitaliy Makkoveev
It seems this lock order issue is not parallel diff specific.

> On 12 May 2021, at 12:58, Hrvoje Popovski  wrote:
> 
> On 21.4.2021. 21:36, Alexander Bluhm wrote:
>> We need more MP preassure to find such bugs and races.  I think now
>> is a good time to give this diff broader testing and commit it.
>> You need interfaces with multiple queues to see a difference.
> 
> Hi,
> 
> while forwarding ip4 traffic over box with parallel diff and aggr
> interfaces, and then aggr is destroyed, i'm getting witness log below...
> 
> i can't reproduce this log without parallel diff and i'm getting this
> log only first time destroying aggr interface ..
> 
> 
> 
> witness: lock order reversal:
> 1st 0x82139de0 netlock (netlock)
> 2nd 0x82120bb8 timeout (timeout)
> lock order "timeout"(rwlock) -> "netlock"(rwlock) first seen at:
> #0  rw_enter_write+0x43
> #1  mld6_fasttimeo+0x14
> #2  pffasttimo+0x67
> #3  timeout_run+0x93
> #4  softclock_thread+0x11d
> #5  proc_trampoline+0x1c
> lock order "netlock"(rwlock) -> "timeout"(rwlock) first seen at:
> #0  timeout_del_barrier+0x41
> #1  aggr_p_dtor+0x17b
> #2  aggr_clone_destroy+0x91
> #3  if_clone_destroy+0xd8
> #4  ifioctl+0x1d2
> #5  soo_ioctl+0x171
> #6  sys_ioctl+0x2c4
> #7  syscall+0x3b9
> #8  Xsyscall+0x128
> 
> 
> r620-1# cat /etc/hostname.aggr0
> trunkport ix0
> lladdr ec:f4:bb:da:f7:f8
> inet 192.168.42.1 255.255.255.0
> !route add 16/8 192.168.42.11
> up
> 
> r620-1# cat /etc/hostname.aggr1
> trunkport ix1
> inet 192.168.43.1 255.255.255.0
> !route add 48/8 192.168.43.11
> up
> 
> 
> 
> OpenBSD 6.9-current (GENERIC.MP) #166: Wed May 12 10:18:11 CEST 2021
>hrv...@r620-1.srce.hr:/sys/arch/amd64/compile/GENERIC.MP
> real mem = 17115840512 (16322MB)
> avail mem = 16450007040 (15687MB)
> random: good seed from bootblocks
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xcf42c000 (99 entries)
> bios0: vendor Dell Inc. version "2.9.0" date 12/06/2019
> bios0: Dell Inc. PowerEdge R620
> acpi0 at bios0: ACPI 3.0
> acpi0: sleep states S0 S4 S5
> acpi0: tables DSDT FACP APIC SPCR HPET DMAR MCFG WDAT SLIC ERST HEST
> BERT EINJ TCPA PC__ SRAT SSDT
> acpi0: wakeup devices PCI0(S5)
> acpitimer0 at acpi0: 3579545 Hz, 24 bits
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> cpu0 at mainbus0: apid 4 (boot processor)
> cpu0: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.44 MHz, 06-3e-04
> cpu0:
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
> cpu0: 256KB 64b/line 8-way L2 cache
> cpu0: smt 0, core 2, package 0
> mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
> cpu0: apic clock running at 99MHz
> cpu0: mwait min=64, max=64, C-substates=0.2.1.1, IBE
> cpu1 at mainbus0: apid 6 (application processor)
> cpu1: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.04 MHz, 06-3e-04
> cpu1:
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
> cpu1: 256KB 64b/line 8-way L2 cache
> cpu1: smt 0, core 3, package 0
> cpu2 at mainbus0: apid 8 (application processor)
> cpu2: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04
> cpu2:
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
> cpu2: 256KB 64b/line 8-way L2 cache
> cpu2: smt 0, core 4, package 0
> cpu3 at mainbus0: apid 16 (application processor)
> cpu3: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04
> cpu3:
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
> cpu3: 256KB 64b/line 8-way L2 cache
> cpu3: smt 0, core 8, package 0
> cpu4 at mainbus0: apid 18 (application processor)
> cpu4: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04
> cpu4:
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SE

Re: patch: new fix for vmctl create

2021-05-12 Thread Dave Voutila


James Cook writes:

> Hi tech@,
>
> The below patch removes calls to realpath(3) when looking up a qcow2
> base image. Previous thread:
> https://marc.info/?t=16156249642&r=1&w=2
>
> In short, the calls were failing inside vmctl, because of unveil. The
> other thread has alternative solutions but I think this is simplest.
>
> I included a regression test demonstrating the vmctl bug, in case
> there's interest. I tested vmd manually as described in the other
> thread.
>
> I also added a check in case dirname(3) fails --- I don't think it
> currently can, but better safe than sorry, I figure. (Noticed by Dave
> in the other thread.)
>
> - James
>
>

mlarkin@ and I got around to reviewing this today. After some
discussion, the decision was to simply remove the usage of unveil in the
vmctl disk creation/conversion commands. I just committed that change,
so it should be in cvs mirrors and snaps soon.

Thanks again for raising the issue and working with me through ideas. In
the end deletion looked like a better than modification :-) (We were
unveiling user supplied inputs...and then unveiling more user supplied
inputs in the form of the base image(s)...not the best usage of unveil.)

-dv



Re: patch: new fix for vmctl create

2021-05-12 Thread Mike Larkin
On Mon, Mar 15, 2021 at 08:21:56AM +, James Cook wrote:
> Hi tech@,
>
> The below patch removes calls to realpath(3) when looking up a qcow2
> base image. Previous thread:
> https://marc.info/?t=16156249642&r=1&w=2
>
> In short, the calls were failing inside vmctl, because of unveil. The
> other thread has alternative solutions but I think this is simplest.
>
> I included a regression test demonstrating the vmctl bug, in case
> there's interest. I tested vmd manually as described in the other
> thread.
>
> I also added a check in case dirname(3) fails --- I don't think it
> currently can, but better safe than sorry, I figure. (Noticed by Dave
> in the other thread.)
>
> - James
>

After looking at this a bit, we decided to remove the unveil parts around
the base images, since the realpath removal below would also affect vmd.

dv@ just committed that. Thanks for the diff and research!

>
> diff --git a/regress/usr.sbin/Makefile b/regress/usr.sbin/Makefile
> index 60e2178d3c7..146f9c9f322 100644
> --- a/regress/usr.sbin/Makefile
> +++ b/regress/usr.sbin/Makefile
> @@ -15,6 +15,7 @@ SUBDIR += rpki-client
>  SUBDIR += snmpd
>  SUBDIR += switchd
>  SUBDIR += syslogd
> +SUBDIR += vmctl
>
>  .if ${MACHINE} == "amd64" || ${MACHINE} == "i386"
>  SUBDIR += vmd
> diff --git a/regress/usr.sbin/vmctl/Makefile b/regress/usr.sbin/vmctl/Makefile
> new file mode 100644
> index 000..8fa87f0f6f0
> --- /dev/null
> +++ b/regress/usr.sbin/vmctl/Makefile
> @@ -0,0 +1,34 @@
> +# $OpenBSD$
> +
> +REGRESS_TARGETS = run-regress-convert-with-base-path
> +
> +run-regress-convert-with-base-path:
> + # non-relative base path
> + rm -f *.qcow2
> + vmctl create -s 1m base.qcow2
> + vmctl create -b ${PWD}/base.qcow2 source.qcow2
> + vmctl create -i source.qcow2 dest.qcow2
> +
> + # relative base path; two base images
> + rm -f *.qcow2
> + vmctl create -s 1m base0.qcow2
> + vmctl create -b base0.qcow2 base1.qcow2
> + vmctl create -b base1.qcow2 source.qcow2
> + vmctl create -i source.qcow2 dest.qcow2
> +
> + # copy from a different directory
> + rm -rf dir *.qcow2
> + vmctl create -s 1m base.qcow2
> + vmctl create -b base.qcow2 source.qcow2
> + mkdir dir
> + cd dir; vmctl create -i ../source.qcow2 dest.qcow2
> +
> + # base accessed through symlink
> + rm -rf dir sym *.qcow2
> + mkdir dir
> + cd dir; vmctl create -s 1m base.qcow2
> + cd dir; vmctl create -b base.qcow2 source.qcow2
> + ln -s dir sym
> + vmctl create -i sym/source.qcow2 dest.qcow2
> +
> +.include 
> diff --git a/usr.sbin/vmd/vioqcow2.c b/usr.sbin/vmd/vioqcow2.c
> index 34d0f116cc4..be8609f1644 100644
> --- a/usr.sbin/vmd/vioqcow2.c
> +++ b/usr.sbin/vmd/vioqcow2.c
> @@ -145,8 +145,8 @@ virtio_qcow2_init(struct virtio_backing *file, off_t 
> *szp, int *fd, size_t nfd)
>  ssize_t
>  virtio_qcow2_get_base(int fd, char *path, size_t npath, const char *dpath)
>  {
> + char pathbuf[PATH_MAX];
>   char dpathbuf[PATH_MAX];
> - char expanded[PATH_MAX];
>   struct qcheader header;
>   uint64_t backingoff;
>   uint32_t backingsz;
> @@ -180,27 +180,23 @@ virtio_qcow2_get_base(int fd, char *path, size_t npath, 
> const char *dpath)
>* rather than relative to the directory vmd happens to be running in,
>* since this is the only userful interpretation.
>*/
> - if (path[0] == '/') {
> - if (realpath(path, expanded) == NULL ||
> - strlcpy(path, expanded, npath) >= npath) {
> - log_warnx("unable to resolve %s", path);
> + if (path[0] != '/') {
> + if (strlcpy(pathbuf, path, sizeof(pathbuf)) >=
> + sizeof(pathbuf)) {
> + log_warnx("path too long: %s", path);
>   return -1;
>   }
> - } else {
>   if (strlcpy(dpathbuf, dpath, sizeof(dpathbuf)) >=
>   sizeof(dpathbuf)) {
>   log_warnx("path too long: %s", dpath);
>   return -1;
>   }
> - s = dirname(dpathbuf);
> - if (snprintf(expanded, sizeof(expanded),
> - "%s/%s", s, path) >= (int)sizeof(expanded)) {
> - log_warnx("path too long: %s/%s", s, path);
> + if ((s = dirname(dpathbuf)) == NULL) {
> + log_warn("dirname");
>   return -1;
>   }
> - if (npath < PATH_MAX ||
> - realpath(expanded, path) == NULL) {
> - log_warnx("unable to resolve %s", path);
> + if (snprintf(path, npath, "%s/%s", s, pathbuf) >= (int)npath) {
> + log_warnx("path too long: %s/%s", s, path);
>   return -1;
>   }
>   }
>



Re: setitimer(2): don't round up it_value

2021-05-12 Thread Paul de Weerd
Hi Scott,

Thanks for this work!  I see significant improvement with my test code
(see below; obviously focussing quite specificially on one thing).
Before your diff (snapshot from a few days ago; OpenBSD 6.9-current
(GENERIC.MP) #1: Mon May  3 11:04:25 MDT 2021) I got:

[weerd@pom] $ ./measure
Running for 100 loops

timeout  80us min/avg/max/std-dev: 800054/809727.250/810115/1609.611 us
timeout  40us min/avg/max/std-dev: 409668/41.406/410362/121.568 us
timeout  20us min/avg/max/std-dev: 209421/209997.953/210553/97.505 us
timeout  10us min/avg/max/std-dev: 109818/109997.078/110167/39.038 us
timeout   5us min/avg/max/std-dev: 59887/59995.539/60108/27.651 us
timeout   25000us min/avg/max/std-dev: 29648/29993.961/30316/50.629 us
timeout   12500us min/avg/max/std-dev: 19910/19994.100/20098/24.265 us
timeout6250us min/avg/max/std-dev: 19932/19994.020/20067/20.760 us
timeout3125us min/avg/max/std-dev: 19806/19993.980/20221/33.121 us
timeout1562us min/avg/max/std-dev: 19519/19993.971/20470/71.215 us
timeout 781us min/avg/max/std-dev: 19778/19993.980/20214/35.454 us
timeout 390us min/avg/max/std-dev: 19784/19994.029/20188/32.626 us
timeout 195us min/avg/max/std-dev: 19891/19994.230/20096/23.237 us
timeout  97us min/avg/max/std-dev: 19938/19994.170/20044/17.468 us
timeout  48us min/avg/max/std-dev: 19876/19994.010/20115/26.334 us
timeout  24us min/avg/max/std-dev: 19535/19994.199/20458/67.628 us
timeout  12us min/avg/max/std-dev: 19941/19994.240/20079/18.478 us
timeout   6us min/avg/max/std-dev: 19949/19994.150/20051/16.916 us
timeout   3us min/avg/max/std-dev: 19880/19994.221/20109/23.377 us
timeout   1us min/avg/max/std-dev: 19940/19994.289/20053/17.575 us

[I should add that these numbers are from an otherwise mostly idle
machine; they improved when my machine was busy building a new kernel]

After upgrading my local snap, cvs-up'ing, patching in your diff and
building + rebooting into a new kernel:

[weerd@pom] $ ./measure
Running for 100 loops

timeout  80us min/avg/max/std-dev: 800041/808010.312/810021/2992.553 us
timeout  40us min/avg/max/std-dev: 401639/402001.281/402373/-nan us
timeout  20us min/avg/max/std-dev: 200959/200997.922/201041/-nan us
timeout  10us min/avg/max/std-dev: 100456/100495.977/100533/37.185 us
timeout   5us min/avg/max/std-dev: 50141/50244.809/50351/30.089 us
timeout   25000us min/avg/max/std-dev: 28917/30144.590/31396/175.702 us
timeout   12500us min/avg/max/std-dev: 20048/20094.420/20143/10.583 us
timeout6250us min/avg/max/std-dev: 10016/10044.260/10084/9.683 us
timeout3125us min/avg/max/std-dev: 9974/10044.060/10140/14.810 us
timeout1562us min/avg/max/std-dev: 10016/10044.190/10078/6.827 us
timeout 781us min/avg/max/std-dev: 9964/10044.140/10143/14.516 us
timeout 390us min/avg/max/std-dev: 9578/10044.150/10507/66.234 us
timeout 195us min/avg/max/std-dev: 9981/10044.170/10122/16.570 us
timeout  97us min/avg/max/std-dev: 10010/10044.150/10117/10.721 us
timeout  48us min/avg/max/std-dev: 10004/10044.140/10083/10.331 us
timeout  24us min/avg/max/std-dev: 10016/10044.220/10078/6.487 us
timeout  12us min/avg/max/std-dev: 10014/10044.210/10079/6.802 us
timeout   6us min/avg/max/std-dev: 10013/10044.300/10085/9.871 us
timeout   3us min/avg/max/std-dev: 10007/10044.030/10090/8.477 us
timeout   1us min/avg/max/std-dev: 9980/10044.350/10135/13.301 us

(obviously some bug in the std-dev calculation there with the -nan)

Thanks again,

Paul

--- measure.c 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define LOOPCOUNT 100

volatile sig_atomic_t trigger;

void
sighdlr(int signum) {
trigger = signum;
}

int
main() {
int long long   count, d, sum, min, max, tsumsq;
longtimeout;
struct timeval  str, end;
struct itimervalnew_timer;
double  avg, dev;

new_timer.it_interval.tv_sec = 0;
new_timer.it_interval.tv_usec = 0;
new_timer.it_value.tv_sec = 0;

signal(SIGALRM, sighdlr);

printf("Running for %d loops\n\n", LOOPCOUNT);

for (timeout = 80; timeout != 0; timeout /= 2) {
new_timer.it_value.tv_usec = timeout;
min = 10;
max = 0;
sum = 0;
tsumsq = 0;
for (count = 0; count != LOOPCOUNT; count++) {
trigger = 0;
setitimer(ITIMER_REAL, &new_timer, NULL);
gettimeofday(&str, NULL);
while (trigger == 0)
pause();
gettimeofday(&end, NULL);
d = (end.tv_sec - str.tv_sec) * 100 + \
end.tv_usec - str.tv_usec;
   

setitimer(2): don't round up it_value

2021-05-12 Thread Scott Cheloha
Hi,

Paul de Weerd mentioned off-list that the initial expiration for an
ITIMER_REAL timer is always at least one tick.  I looked into it and
yes, this is the case, because the kernel rounds it_value up to one
tick if it is non-zero.

After thinking about it a bit I don't think we should do this
rounding.  At least, not for the initial expiration.

Rounding the it_interval member of an itimerval value up to one tick
makes sense: if we don't round it up we might spin for a long time in
realitexpire() and itimerdecr() when we reload the timer.

However there is no such risk with the it_value member.  We only use
it once and then it gets clobbered.

Currently the rounding is done in itimerfix(), which takes a timeval
pointer as argument.  Given that itimerfix() is used nowhere else in
the kernel I think the easiest thing to do here is to rewrite
itimerfix() to take an itimerval pointer as argument and have it do
all input validation and normalization for setitimer(2) in one go:

- Validate it_value, return EINVAL if not.

- Validate it_interval, return EINVAL if not.

- Clear it_interval if it_value is unset.

- Round it_interval up if necessary.

The 100 million second upper bound for it_value and it_interval is
arbitrary and will probably change in the future, so I have isolated
that check from the others.

While we're changing the itimerfix() prototype we may as well pull it
out of sys/time.h.  As I said before, it isn't used anywhere else.

OK?

Index: sys/time.h
===
RCS file: /cvs/src/sys/sys/time.h,v
retrieving revision 1.58
diff -u -p -r1.58 time.h
--- sys/time.h  13 Jan 2021 16:28:50 -  1.58
+++ sys/time.h  12 May 2021 17:06:30 -
@@ -307,7 +307,6 @@ struct proc;
 intclock_gettime(struct proc *, clockid_t, struct timespec *);
 
 void   cancel_all_itimers(void);
-intitimerfix(struct timeval *);
 intitimerdecr(struct itimerspec *, long);
 intsettime(const struct timespec *);
 intratecheck(struct timeval *, const struct timeval *);
Index: kern/kern_time.c
===
RCS file: /cvs/src/sys/kern/kern_time.c,v
retrieving revision 1.151
diff -u -p -r1.151 kern_time.c
--- kern/kern_time.c23 Dec 2020 20:45:02 -  1.151
+++ kern/kern_time.c12 May 2021 17:06:30 -
@@ -52,6 +52,8 @@
 
 #include 
 
+int itimerfix(struct itimerval *);
+
 /* 
  * Time of day and interval timer support.
  *
@@ -628,10 +630,9 @@ sys_setitimer(struct proc *p, void *v, r
error = copyin(SCARG(uap, itv), &aitv, sizeof(aitv));
if (error)
return error;
-   if (itimerfix(&aitv.it_value) || itimerfix(&aitv.it_interval))
-   return EINVAL;
-   if (!timerisset(&aitv.it_value))
-   timerclear(&aitv.it_interval);
+   error = itimerfix(&aitv);
+   if (error)
+   return error;
newitvp = &aitv;
}
if (SCARG(uap, oitv) != NULL) {
@@ -701,21 +702,34 @@ out:
 }
 
 /*
- * Check that a proposed value to load into the .it_value or
- * .it_interval part of an interval timer is acceptable.
+ * Check if the given setitimer(2) timer is valid.  Clear it_interval
+ * if it_value is unset.  Round it_interval up to the minimum interval
+ * if necessary.
  */
 int
-itimerfix(struct timeval *tv)
+itimerfix(struct itimerval *itv)
 {
+   struct timeval min_interval = { .tv_sec = 0, .tv_usec = tick };
 
-   if (tv->tv_sec < 0 || tv->tv_sec > 1 ||
-   tv->tv_usec < 0 || tv->tv_usec >= 100)
-   return (EINVAL);
+   if (itv->it_value.tv_sec < 0 || !timerisvalid(&itv->it_value))
+   return EINVAL;
+   if (itv->it_value.tv_sec > 1)
+   return EINVAL;
 
-   if (tv->tv_sec == 0 && tv->tv_usec != 0 && tv->tv_usec < tick)
-   tv->tv_usec = tick;
+   if (itv->it_interval.tv_sec < 0 || !timerisvalid(&itv->it_interval))
+   return EINVAL;
+   if (itv->it_interval.tv_sec > 1)
+   return EINVAL;
 
-   return (0);
+   if (!timerisset(&itv->it_value))
+   timerclear(&itv->it_interval);
+
+   if (timerisset(&itv->it_interval)) {
+   if (timercmp(&itv->it_interval, &min_interval, <))
+   itv->it_interval = min_interval;
+   }
+
+   return 0;
 }
 
 /*



bgpd strict community negotiation

2021-05-12 Thread Claudio Jeker
RFC5492 is fairly explicit when a capability should be enabled on a
session:

   A BGP speaker that supports a particular capability may use this
   capability with its peer after the speaker determines (as described
   above) that the peer supports this capability.  Simply put, a given
   capability can be used on a peering if that capability has been
   advertised by both peers.  If either peer has not advertised it, the
   capability cannot be used.

Adjust capa_neg_calc() to follow this strict model.
This affects route refersh and graceful restart. For graceful restart this
requires to flush the RIBs immediatly.

Also ignore and warn about RREFRESH messages that are received on a
session where route refesh is disabled.

-- 
:wq Claudio

Index: session.c
===
RCS file: /cvs/src/usr.sbin/bgpd/session.c,v
retrieving revision 1.414
diff -u -p -r1.414 session.c
--- session.c   6 May 2021 09:18:54 -   1.414
+++ session.c   12 May 2021 16:33:42 -
@@ -1636,7 +1636,7 @@ session_neighbor_rrefresh(struct peer *p
 {
u_int8_ti;
 
-   if (!p->capa.peer.refresh)
+   if (!p->capa.neg.refresh)
return (-1);
 
for (i = 0; i < AID_MAX; i++) {
@@ -2257,6 +2257,11 @@ parse_refresh(struct peer *peer)
return (0);
}
 
+   if (!peer->capa.neg.refresh) {
+   log_peer_warnx(&peer->conf, "peer sent unexpected refresh");
+   return (0);
+   }
+
if (imsg_rde(IMSG_REFRESH, peer->conf.id, &aid, sizeof(aid)) == -1)
return (-1);
 
@@ -2546,16 +2551,15 @@ capa_neg_calc(struct peer *p)
 {
u_int8_ti, hasmp = 0;
 
-   /* refresh: does not realy matter here, use peer setting */
-   p->capa.neg.refresh = p->capa.peer.refresh;
+   /* a capability is accepted only if both sides announced it */
 
-   /* as4byte: both side must announce capability */
-   if (p->capa.ann.as4byte && p->capa.peer.as4byte)
-   p->capa.neg.as4byte = 1;
-   else
-   p->capa.neg.as4byte = 0;
+   p->capa.neg.refresh =
+   (p->capa.ann.refresh && p->capa.peer.refresh) != 0;
+
+   p->capa.neg.as4byte =
+   (p->capa.ann.as4byte && p->capa.peer.as4byte) != 0;
 
-   /* MP: both side must announce capability */
+   /* MP: both side must agree on the AFI,SAFI pair */
for (i = 0; i < AID_MAX; i++) {
if (p->capa.ann.mp[i] && p->capa.peer.mp[i])
p->capa.neg.mp[i] = 1;
@@ -2569,18 +2573,21 @@ capa_neg_calc(struct peer *p)
p->capa.neg.mp[AID_INET] = 1;
 
/*
-* graceful restart: only the peer capabilities are of interest here.
+* graceful restart: the peer capabilities are of interest here.
 * It is necessary to compare the new values with the previous ones
 * and act acordingly. AFI/SAFI that are not part in the MP capability
 * are treated as not being present.
+* Also make sure that a flush happens if the session stopped
+* supporting graceful restart.
 */
 
for (i = 0; i < AID_MAX; i++) {
int8_t  negflags;
 
/* disable GR if the AFI/SAFI is not present */
-   if (p->capa.peer.grestart.flags[i] & CAPA_GR_PRESENT &&
-   p->capa.neg.mp[i] == 0)
+   if (p->capa.ann.grestart.restart == 0 ||
+   (p->capa.peer.grestart.flags[i] & CAPA_GR_PRESENT &&
+   p->capa.neg.mp[i] == 0))
p->capa.peer.grestart.flags[i] = 0; /* disable */
/* look at current GR state and decide what to do */
negflags = p->capa.neg.grestart.flags[i];
@@ -2600,6 +2607,8 @@ capa_neg_calc(struct peer *p)
}
p->capa.neg.grestart.timeout = p->capa.peer.grestart.timeout;
p->capa.neg.grestart.restart = p->capa.peer.grestart.restart;
+   if (p->capa.ann.grestart.restart == 0)
+   p->capa.neg.grestart.restart = 0;
 
return (0);
 }



Re: Fix IPsec NAT-T for L2TP/IPsec

2021-05-12 Thread Vitaliy Makkoveev



> On 12 May 2021, at 18:42, YASUOKA Masahiko  wrote:
> 
> On Wed, 12 May 2021 17:26:51 +0300
> Vitaliy Makkoveev  wrote:
>> On Wed, May 12, 2021 at 07:11:09PM +0900, YASUOKA Masahiko wrote:
>>> Radek reported a problem to misc@ that multiple Windows clients behind a NAT
>>> cannot use a L2TP/IPsec server simultaneously.
>>> 
>>> https://marc.info/?t=16099681611&r=1&w=2
>>> 
>>> There is two problems.  First is pipex(4) doesn't pass the proper
>>> ipsecflowinfo to ip_output().  Second is the IPsec policy check which is
>>> done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is not
>>> cached.  This happens when its flow is shared by another tdb (for another
>>> client of the same NAT).
>>> 
>>> The following 2 diffs fix these problem.
>>> 
>>> comment?
>>> ok?
>>> 
>> 
>> Hi.
>> 
>> I have two comments for the diff 1:
>> 
>> 1. You should add PACKET_TAG_IPSEC_FLOWINFO description to
>>m_tag_get(9).
>> 2. You introduced mbuf(9) leak in pipex_l2tp_output() error path. I
>>   pointed the place in your diff.
> 
> Good catch.  Thanks.
> 

m_freem(9) accepts NULL so this check before is redundant.

It seems to me that "Used by the IPv4 stack to specify the IPsec flow
of an output IP packet. The tag contains a u_int32_t identifying the
IPsec flow.” is enough. Anyway it’s better to ask jmc@.

Also I like to remove PACKET_TAG_PIPEX with separate diff.

The rest of this diff looks ok by me.

> 
> Let me update the diff.
> 
> Index: sys/net/pipex.c
> ===
> RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v
> retrieving revision 1.132
> diff -u -p -r1.132 pipex.c
> --- sys/net/pipex.c   10 Mar 2021 10:21:48 -  1.132
> +++ sys/net/pipex.c   12 May 2021 15:33:33 -
> @@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc
> #ifdef INET6
>   struct ip6_hdr *ip6;
> #endif
> + struct m_tag *mtag;
> 
>   hlen = sizeof(struct pipex_l2tp_header) +
>   ((pipex_session_is_l2tp_data_sequencing_on(session))
> @@ -1704,6 +1705,15 @@ pipex_l2tp_output(struct mbuf *m0, struc
>   ip->ip_tos = 0;
>   ip->ip_off = 0;
> 
> + if (session->proto.l2tp.ipsecflowinfo > 0) {
> + if ((mtag = m_tag_get(PACKET_TAG_IPSEC_FLOWINFO,
> + sizeof(u_int32_t), M_NOWAIT)) == NULL)
> + goto drop;
> + *(u_int32_t *)(mtag + 1) =
> + session->proto.l2tp.ipsecflowinfo;
> + m_tag_prepend(m0, mtag);
> + }
> +
>   ip_send(m0);
>   break;
> #ifdef INET6
> @@ -1733,6 +1743,8 @@ pipex_l2tp_output(struct mbuf *m0, struc
> 
>   return;
> drop:
> + if (m0 != NULL)
> + m_freem(m0);
>   session->stat.oerrors++;
> }
> 
> Index: sys/netinet/ip_input.c
> ===
> RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v
> retrieving revision 1.359
> diff -u -p -r1.359 ip_input.c
> --- sys/netinet/ip_input.c30 Apr 2021 13:52:48 -  1.359
> +++ sys/netinet/ip_input.c12 May 2021 15:31:52 -
> @@ -1790,6 +1790,8 @@ ip_send_do_dispatch(void *xmq, int flags
>   struct mbuf_queue *mq = xmq;
>   struct mbuf *m;
>   struct mbuf_list ml;
> + struct m_tag *mtag;
> + u_int32_t ipsecflowinfo = 0;
> 
>   mq_delist(mq, &ml);
>   if (ml_empty(&ml))
> @@ -1797,7 +1799,12 @@ ip_send_do_dispatch(void *xmq, int flags
> 
>   NET_LOCK();
>   while ((m = ml_dequeue(&ml)) != NULL) {
> - ip_output(m, NULL, NULL, flags, NULL, NULL, 0);
> + if ((mtag = m_tag_find(m, PACKET_TAG_IPSEC_FLOWINFO, NULL))
> + != NULL) {
> + ipsecflowinfo = *(u_int32_t *)(mtag + 1);
> + m_tag_delete(m, mtag);
> + }
> + ip_output(m, NULL, NULL, flags, NULL, NULL, ipsecflowinfo);
>   }
>   NET_UNLOCK();
> }
> Index: sys/sys/mbuf.h
> ===
> RCS file: /disk/cvs/openbsd/src/sys/sys/mbuf.h,v
> retrieving revision 1.252
> diff -u -p -r1.252 mbuf.h
> --- sys/sys/mbuf.h25 Feb 2021 02:43:31 -  1.252
> +++ sys/sys/mbuf.h12 May 2021 15:31:52 -
> @@ -469,6 +469,7 @@ struct m_tag *m_tag_next(struct mbuf *, 
> /* Packet tag types */
> #define PACKET_TAG_IPSEC_IN_DONE  0x0001  /* IPsec applied, in */
> #define PACKET_TAG_IPSEC_OUT_DONE 0x0002  /* IPsec applied, out */
> +#define PACKET_TAG_IPSEC_FLOWINFO0x0004  /* IPsec flowinfo */
> #define PACKET_TAG_WIREGUARD  0x0040  /* WireGuard data */
> #define PACKET_TAG_GRE0x0080  /* GRE processing done 
> */
> #define PACKET_TAG_DLT0x0100 /* data link layer type 
> */
> @@ -479,7 +480,7 @@ struct m_tag *m_tag_next(struct mbuf *, 
> #define PACKET_TAG_CARP_BAL_IP0

Re: Fix IPsec NAT-T for L2TP/IPsec

2021-05-12 Thread YASUOKA Masahiko
On Wed, 12 May 2021 17:26:51 +0300
Vitaliy Makkoveev  wrote:
> On Wed, May 12, 2021 at 07:11:09PM +0900, YASUOKA Masahiko wrote:
>> Radek reported a problem to misc@ that multiple Windows clients behind a NAT
>> cannot use a L2TP/IPsec server simultaneously.
>> 
>> https://marc.info/?t=16099681611&r=1&w=2
>> 
>> There is two problems.  First is pipex(4) doesn't pass the proper
>> ipsecflowinfo to ip_output().  Second is the IPsec policy check which is
>> done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is not
>> cached.  This happens when its flow is shared by another tdb (for another
>> client of the same NAT).
>> 
>> The following 2 diffs fix these problem.
>> 
>> comment?
>> ok?
>> 
> 
> Hi.
> 
> I have two comments for the diff 1:
> 
> 1. You should add PACKET_TAG_IPSEC_FLOWINFO description to
> m_tag_get(9).
> 2. You introduced mbuf(9) leak in pipex_l2tp_output() error path. I
>pointed the place in your diff.

Good catch.  Thanks.


Let me update the diff.

Index: sys/net/pipex.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v
retrieving revision 1.132
diff -u -p -r1.132 pipex.c
--- sys/net/pipex.c 10 Mar 2021 10:21:48 -  1.132
+++ sys/net/pipex.c 12 May 2021 15:33:33 -
@@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc
 #ifdef INET6
struct ip6_hdr *ip6;
 #endif
+   struct m_tag *mtag;
 
hlen = sizeof(struct pipex_l2tp_header) +
((pipex_session_is_l2tp_data_sequencing_on(session))
@@ -1704,6 +1705,15 @@ pipex_l2tp_output(struct mbuf *m0, struc
ip->ip_tos = 0;
ip->ip_off = 0;
 
+   if (session->proto.l2tp.ipsecflowinfo > 0) {
+   if ((mtag = m_tag_get(PACKET_TAG_IPSEC_FLOWINFO,
+   sizeof(u_int32_t), M_NOWAIT)) == NULL)
+   goto drop;
+   *(u_int32_t *)(mtag + 1) =
+   session->proto.l2tp.ipsecflowinfo;
+   m_tag_prepend(m0, mtag);
+   }
+
ip_send(m0);
break;
 #ifdef INET6
@@ -1733,6 +1743,8 @@ pipex_l2tp_output(struct mbuf *m0, struc
 
return;
 drop:
+   if (m0 != NULL)
+   m_freem(m0);
session->stat.oerrors++;
 }
 
Index: sys/netinet/ip_input.c
===
RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v
retrieving revision 1.359
diff -u -p -r1.359 ip_input.c
--- sys/netinet/ip_input.c  30 Apr 2021 13:52:48 -  1.359
+++ sys/netinet/ip_input.c  12 May 2021 15:31:52 -
@@ -1790,6 +1790,8 @@ ip_send_do_dispatch(void *xmq, int flags
struct mbuf_queue *mq = xmq;
struct mbuf *m;
struct mbuf_list ml;
+   struct m_tag *mtag;
+   u_int32_t ipsecflowinfo = 0;
 
mq_delist(mq, &ml);
if (ml_empty(&ml))
@@ -1797,7 +1799,12 @@ ip_send_do_dispatch(void *xmq, int flags
 
NET_LOCK();
while ((m = ml_dequeue(&ml)) != NULL) {
-   ip_output(m, NULL, NULL, flags, NULL, NULL, 0);
+   if ((mtag = m_tag_find(m, PACKET_TAG_IPSEC_FLOWINFO, NULL))
+   != NULL) {
+   ipsecflowinfo = *(u_int32_t *)(mtag + 1);
+   m_tag_delete(m, mtag);
+   }
+   ip_output(m, NULL, NULL, flags, NULL, NULL, ipsecflowinfo);
}
NET_UNLOCK();
 }
Index: sys/sys/mbuf.h
===
RCS file: /disk/cvs/openbsd/src/sys/sys/mbuf.h,v
retrieving revision 1.252
diff -u -p -r1.252 mbuf.h
--- sys/sys/mbuf.h  25 Feb 2021 02:43:31 -  1.252
+++ sys/sys/mbuf.h  12 May 2021 15:31:52 -
@@ -469,6 +469,7 @@ struct m_tag *m_tag_next(struct mbuf *, 
 /* Packet tag types */
 #define PACKET_TAG_IPSEC_IN_DONE   0x0001  /* IPsec applied, in */
 #define PACKET_TAG_IPSEC_OUT_DONE  0x0002  /* IPsec applied, out */
+#define PACKET_TAG_IPSEC_FLOWINFO  0x0004  /* IPsec flowinfo */
 #define PACKET_TAG_WIREGUARD   0x0040  /* WireGuard data */
 #define PACKET_TAG_GRE 0x0080  /* GRE processing done */
 #define PACKET_TAG_DLT 0x0100 /* data link layer type */
@@ -479,7 +480,7 @@ struct m_tag *m_tag_next(struct mbuf *, 
 #define PACKET_TAG_CARP_BAL_IP 0x4000  /* carp(4) ip balanced marker */
 
 #define MTAG_BITS \
-("\20\1IPSEC_IN_DONE\2IPSEC_OUT_DONE\3IPSEC_IN_CRYPTO_DONE" \
+("\20\1IPSEC_IN_DONE\2IPSEC_OUT_DONE\3IPSEC_FLOWINFO" \
 "\4IPSEC_OUT_CRYPTO_NEEDED\5IPSEC_PENDING_TDB\6BRIDGE\7WG\10GRE\11DLT" \
 "\12PF_DIVERT\14PF_REASSEMBLED\15SRCROUTE\16TUNNEL\17CARP_BAL_IP")
 
Index: share/man/man9/mbuf_tags.9
===
RCS file: /disk/cvs/openbsd/src/share/man/man9/mbuf_tags.9,v
retrieving revision 1.41
diff -u -p -r

Re: Fix IPsec NAT-T for L2TP/IPsec

2021-05-12 Thread Vitaliy Makkoveev
On Wed, May 12, 2021 at 07:11:09PM +0900, YASUOKA Masahiko wrote:
> Hi,
> 
> Radek reported a problem to misc@ that multiple Windows clients behind a NAT
> cannot use a L2TP/IPsec server simultaneously.
> 
> https://marc.info/?t=16099681611&r=1&w=2
> 
> There is two problems.  First is pipex(4) doesn't pass the proper
> ipsecflowinfo to ip_output().  Second is the IPsec policy check which is
> done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is not
> cached.  This happens when its flow is shared by another tdb (for another
> client of the same NAT).
> 
> The following 2 diffs fix these problem.
> 
> comment?
> ok?
> 

Hi.

I have two comments for the diff 1:

1. You should add PACKET_TAG_IPSEC_FLOWINFO description to
m_tag_get(9).
2. You introduced mbuf(9) leak in pipex_l2tp_output() error path. I
   pointed the place in your diff.

I'll see diff 2 later.

> diff #1
> 
> Fix IPsec NAT-T work with pipex.
> 
> Index: sys/net/pipex.c
> ===
> RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v
> retrieving revision 1.132
> diff -u -p -r1.132 pipex.c
> --- sys/net/pipex.c   10 Mar 2021 10:21:48 -  1.132
> +++ sys/net/pipex.c   12 May 2021 09:38:32 -
> @@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc
>  #ifdef INET6
>   struct ip6_hdr *ip6;
>  #endif
> + struct m_tag *mtag;
> 
>   hlen = sizeof(struct pipex_l2tp_header) +
>   ((pipex_session_is_l2tp_data_sequencing_on(session))
> @@ -1703,6 +1704,15 @@ pipex_l2tp_output(struct mbuf *m0, struc
>   ip->ip_ttl = MAXTTL;
>   ip->ip_tos = 0;
>   ip->ip_off = 0;
> +
> + if (session->proto.l2tp.ipsecflowinfo > 0) {
> + if ((mtag = m_tag_get(PACKET_TAG_IPSEC_FLOWINFO,
> + sizeof(u_int32_t), M_NOWAIT)) == NULL)
> + goto drop;

mbuf(9) will leak here.

> + *(u_int32_t *)(mtag + 1) =
> + session->proto.l2tp.ipsecflowinfo;
> + m_tag_prepend(m0, mtag);
> + }
> 
>   ip_send(m0);
>   break;



Re: [Diff] Implement multiple device cloning for hotplug

2021-05-12 Thread Claudio Jeker
On Wed, May 12, 2021 at 10:03:13AM -0400, Ashton Fagg wrote:
> joshua stein  writes:
> 
> > I'm glad I could inspire you to repost the work I already did years 
> > ago.
> 
> I'm not sure if you're being sarcastic.
> 
> > But either way, if a driver is causing a panic because it responds 
> > before it is ready, the same thing would happen without hotplug if 
> > it was communicated with at the wrong time (like some script running 
> > in a loop).  Those drivers should just not respond to ioctls or 
> > whatever they are doing before they are ready rather than just 
> > hoping that no one sees their attachment too early.
> 
> So the issue I have with iscsid is actually different to this because
> it's entirely in userland.
> 
> To fix it in a robust way, you need to know when a device has attached
> and if it's ready or not, but there's extra requirements for iscsid that
> would require you to match those /dev/sd* devices back to the configured
> targets.
> 
> At a minimum, we would need to figure out (after device attachment):
> 
> a) which scsibus it has shown up on 
> 
> b) if the scsibus it has appeared on is indeed the one you'd expect it
> to be on (/dev/vscsi0 will have the same scsibus as devices initiated
> from iscsid, as far as I can tell)
> 
> c) does the target/lun info match an initiated target with a
> successful session established.
> 
> d) then indicate whether the disk device is actually ready
> 
> There's no panic in this case, the "iscsictl reload" call in the init
> script returns before the devices are up and ready since we're only
> looking at connection completion right now. Only returning once we know
> there's a ready device corresponding to every connected target we expect
> to have one avoids the problem where fsck tries to do it's thing on
> devices that aren't ready.
> 
> Matching back to the iscsi targets avoids some brittleness and edge
> cases that could be encountered if we were just looking for generic scsi
> disk attachments/readiness.
> 
> c) is fairly easy with the diff that I submitted (and was merged) a
> while ago, since there is book-keeping code inside iscsid to work
> this out that would be very easy to extend.
> 
> So the hotplug idea would solve part of this, but not all of it. There's
> still parts of this that are not easy and aren't clear to me how to
> fully resolve.

For iscsid I think the better approach would be to extend vscsi(4) to
allow the system to know when a device was attached.

Now this does not solve all problems with iscsid since iscsid has no
knowledge about which disks are presented by the initator and I don't
think iscsid should know about this either.

-- 
:wq Claudio



Re: [Diff] Implement multiple device cloning for hotplug

2021-05-12 Thread Ashton Fagg
joshua stein  writes:

> I'm glad I could inspire you to repost the work I already did years 
> ago.

I'm not sure if you're being sarcastic.

> But either way, if a driver is causing a panic because it responds 
> before it is ready, the same thing would happen without hotplug if 
> it was communicated with at the wrong time (like some script running 
> in a loop).  Those drivers should just not respond to ioctls or 
> whatever they are doing before they are ready rather than just 
> hoping that no one sees their attachment too early.

So the issue I have with iscsid is actually different to this because
it's entirely in userland.

To fix it in a robust way, you need to know when a device has attached
and if it's ready or not, but there's extra requirements for iscsid that
would require you to match those /dev/sd* devices back to the configured
targets.

At a minimum, we would need to figure out (after device attachment):

a) which scsibus it has shown up on 

b) if the scsibus it has appeared on is indeed the one you'd expect it
to be on (/dev/vscsi0 will have the same scsibus as devices initiated
from iscsid, as far as I can tell)

c) does the target/lun info match an initiated target with a
successful session established.

d) then indicate whether the disk device is actually ready

There's no panic in this case, the "iscsictl reload" call in the init
script returns before the devices are up and ready since we're only
looking at connection completion right now. Only returning once we know
there's a ready device corresponding to every connected target we expect
to have one avoids the problem where fsck tries to do it's thing on
devices that aren't ready.

Matching back to the iscsi targets avoids some brittleness and edge
cases that could be encountered if we were just looking for generic scsi
disk attachments/readiness.

c) is fairly easy with the diff that I submitted (and was merged) a
while ago, since there is book-keeping code inside iscsid to work
this out that would be very easy to extend.

So the hotplug idea would solve part of this, but not all of it. There's
still parts of this that are not easy and aren't clear to me how to
fully resolve.



hvn(4): don't input mbufs if interface is not running

2021-05-12 Thread Patrick Wildt
Hi,

when hvn(4) attaches it sends commands and waits for replies to come
back in, hence the interrupt function is being polled.  Unfortunately
it seems that the 'receive pipe' has both command completion and data
packets.  As it turns out, while hvn(4) is just setting up the pipes,
it can already receive packets, which I have seen happening on Hyper-V.

This essentially means that if_input() is being called *before* the
card is set up (or UP).  This seems wrong.  Apparently on drivers like
em(4) we only read packets if IFF_RUNNING is set.  I think in the case
of hvn(4), we should drop packets unless IFF_RUNNING is set.

Opinions?

Patrick

diff --git a/sys/dev/pv/if_hvn.c b/sys/dev/pv/if_hvn.c
index f12e2f935ca..4306f717baf 100644
--- a/sys/dev/pv/if_hvn.c
+++ b/sys/dev/pv/if_hvn.c
@@ -1470,7 +1470,10 @@ hvn_rndis_input(struct hvn_softc *sc, uint64_t tid, void 
*arg)
}
hvn_nvs_ack(sc, tid);
 
-   if_input(ifp, &ml);
+   if (ifp->if_flags & IFF_RUNNING)
+   if_input(ifp, &ml);
+   else
+   ml_purge(&ml);
 }
 
 static inline struct mbuf *



Re: [Diff] Implement multiple device cloning for hotplug

2021-05-12 Thread joshua stein
On Tue, 11 May 2021 at 22:37:55 -0400, Ashton Fagg wrote:
> Inspiration for this diff was drawn from Joshua Stein [1], seeminly he
> had a use-case that was somewhat similar to mine at one point but it
> doesn't look like this was ever committed. Nor can I find any discussion
> on it.

I'm glad I could inspire you to repost the work I already did years 
ago.


On Tue, 11 May 2021 at 20:53:17 -0600, Theo de Raadt wrote:
> Unfortunately there is no instrumentation in drivers or subsystems to
> capture "ready to do work", and create the events at that point instead.
> hotplug is hooked into subr_autoconf.c, this code is for handling the
> device heirarchy, but it is not involved in operation, and thus unaware
> of "readiness".  Fixing this would require a subsystem by subsystem study,
> figuring out where the glue should exist, and creating the events at
> the CORRECT time.

When this same objection was raised in 2018 when I proposed this, I 
said that I looked through the bugs archives and could only find 
references to hotplug-related issues 8 years prior, and they were 
considered solved:

https://marc.info/?l=openbsd-bugs&m=127990199813627&w=2

I also proposed this diff, which individual drivers could use to 
defer hotplug notification from their attach routine if they needed 
to do extra work before becoming "ready".  No one spoke up of any 
drivers needing such functionality, and all of this work was ignored 
since you and Mark wanted the Xorg input device notifications to 
come through wscons instead of hotplug.

But either way, if a driver is causing a panic because it responds 
before it is ready, the same thing would happen without hotplug if 
it was communicated with at the wrong time (like some script running 
in a loop).  Those drivers should just not respond to ioctls or 
whatever they are doing before they are ready rather than just 
hoping that no one sees their attachment too early.


diff --git sys/dev/hotplug.c sys/dev/hotplug.c
index e1e7bad95c9..434c43f4135 100644
--- sys/dev/hotplug.c
+++ sys/dev/hotplug.c
@@ -73,6 +73,12 @@ hotplug_device_attach(enum devclass class, char *name)
hotplug_put_event(&he);
 }
 
+void
+hotplug_defer_attach(struct device *dev)
+{
+   dev->dv_flags |= DVF_HOTPLUG_DEFER;
+}
+
 void
 hotplug_device_detach(enum devclass class, char *name)
 {
diff --git sys/kern/subr_autoconf.c sys/kern/subr_autoconf.c
index 13fafc6994e..e09f11486c0 100644
--- sys/kern/subr_autoconf.c
+++ sys/kern/subr_autoconf.c
@@ -403,7 +403,7 @@ config_attach(struct device *parent, void *match, void 
*aux, cfprint_t print)
(*ca->ca_attach)(parent, dev, aux);
config_process_deferred_children(dev);
 #if NHOTPLUG > 0
-   if (!cold)
+   if (!cold && !(dev->dv_flags & DVF_HOTPLUG_DEFER))
hotplug_device_attach(cd->cd_class, dev->dv_xname);
 #endif
 
diff --git sys/sys/device.h sys/sys/device.h
index 90827f53503..b73b248d040 100644
--- sys/sys/device.h
+++ sys/sys/device.h
@@ -81,7 +81,8 @@ struct device {
 };
 
 /* dv_flags */
-#defineDVF_ACTIVE  0x0001  /* device is activated */
+#defineDVF_ACTIVE  0x0001  /* device is activated 
*/
+#defineDVF_HOTPLUG_DEFER   0x0002  /* defer hotplug 
announce */
 
 TAILQ_HEAD(devicelist, device);
 
diff --git sys/sys/hotplug.h sys/sys/hotplug.h
index f2da3f6cfd8..7e68482a428 100644
--- sys/sys/hotplug.h
+++ sys/sys/hotplug.h
@@ -35,6 +35,7 @@ struct hotplug_event {
 #ifdef _KERNEL
 void   hotplug_device_attach(enum devclass, char *);
 void   hotplug_device_detach(enum devclass, char *);
+void   hotplug_defer_attach(struct device *);
 #endif
 
 #endif /* _SYS_HOTPLUG_H_ */



Re: nvme: add timeout to nvme_poll() loop

2021-05-12 Thread Ashton Fagg
Ping?

Ashton Fagg  writes:

> I noticed this when looking through the nvme.c code the other day.
>
> Currently, nvme_poll() has a loop like this:
>
>   while (!ISSET(state.c.flags, htole16(NVME_CQE_PHASE))) {
>   if (nvme_q_complete(sc, q) == 0)
>   delay(10);
>
> /* XXX no timeout? */ 
>   }
>
> The comment is indicative - there probably should be a timeout here in
> case things are terribly wrong and the queue never gets fully processed.
>
> NetBSD appears to have a similar construct in their version of the
> driver:
>
> https://github.com/NetBSD/src/blob/trunk/sys/dev/ic/nvme.c
>
> The following diff adds a counter to fail the poll after 10 seconds of
> spinning. I've been torturing my test machine (which has an nvme boot
> drive) with this applied - no issues so far.
>
> AFAICT, returning anything non-zero should be ok as it looks like
> nothing specifically checks the return code for anything other being
> zero. But, I may be wrong and perhaps one of the various NVME_CQE_*
> masks is more suitable.
>
> Thanks.

diff --git a/sys/dev/ic/nvme.c b/sys/dev/ic/nvme.c
index 9a79c8b1bfe..4e112f482ee 100644
--- a/sys/dev/ic/nvme.c
+++ b/sys/dev/ic/nvme.c
@@ -117,6 +117,11 @@ void	nvme_scsi_inquiry(struct scsi_xfer *);
 void	nvme_scsi_capacity16(struct scsi_xfer *);
 void	nvme_scsi_capacity(struct scsi_xfer *);
 
+/* Here, we define a delay of 10 microseconds between poll attempts.
+ * And, we allow 1e6 attempts. This corresponds to a ~10 second wait. */
+#define NVME_POLL_DELAY_USECS 10
+#define NVME_POLL_MAX_ATTEMPTS 1E6
+
 #define nvme_read4(_s, _r) \
 	bus_space_read_4((_s)->sc_iot, (_s)->sc_ioh, (_r))
 #define nvme_write4(_s, _r, _v) \
@@ -918,6 +923,7 @@ nvme_poll(struct nvme_softc *sc, struct nvme_queue *q, struct nvme_ccb *ccb,
 	void (*done)(struct nvme_softc *, struct nvme_ccb *, struct nvme_cqe *);
 	void *cookie;
 	u_int16_t flags;
+	int tries = 0;
 
 	memset(&state, 0, sizeof(state));
 	(*fill)(sc, ccb, &state.s);
@@ -931,9 +937,13 @@ nvme_poll(struct nvme_softc *sc, struct nvme_queue *q, struct nvme_ccb *ccb,
 	nvme_q_submit(sc, q, ccb, nvme_poll_fill);
 	while (!ISSET(state.c.flags, htole16(NVME_CQE_PHASE))) {
 		if (nvme_q_complete(sc, q) == 0)
-			delay(10);
+			delay(NVME_POLL_DELAY_USECS);
 
-		/* XXX no timeout? */
+		if (++tries >= NVME_POLL_MAX_ATTEMPTS) {
+			/* Something appears to be quite wrong... */
+			printf("%s: poll timeout.\n", DEVNAME(sc));
+			return (1);
+		}
 	}
 
 	ccb->ccb_cookie = cookie;


Re: [Diff] Implement multiple device cloning for hotplug

2021-05-12 Thread Ashton Fagg
"Theo de Raadt"  writes:

> We don't need a new subsystem.
>
> We just the operation changed, so it sends the event when the device is ready,
> rather than before it is ready.

Ok fair enough. More curiosity than anything on my part.

I need to find a way to make this information available for iscsid
irrespective of the broader issue. I think it might be not "as simple"
as just waiting for an attach and subsequent ready event to show up for
the iscsi use case. I have some ideas, I'll come back with a diff rather
than waste everyone's time on this. But happy to discuss privately if
anyone is interested.



Fix IPsec NAT-T for L2TP/IPsec

2021-05-12 Thread YASUOKA Masahiko

Hi,

Radek reported a problem to misc@ that multiple Windows clients 
behind a NAT cannot use a L2TP/IPsec server simultaneously.


https://marc.info/?t=16099681611&r=1&w=2

There is two problems.  First is pipex(4) doesn't pass the proper 
ipsecflowinfo to ip_output().  Second is the IPsec policy check which 
is done by ipsp_spd_lookup() returns -1 (EINVAL) if the given tdb is 
not cached.  This happens when its flow is shared by another tdb (for 
another client of the same NAT).


The following 2 diffs fix these problem.

comment?
ok?

diff #1

Fix IPsec NAT-T work with pipex.

Index: sys/net/pipex.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/pipex.c,v
retrieving revision 1.132
diff -u -p -r1.132 pipex.c
--- sys/net/pipex.c 10 Mar 2021 10:21:48 -  1.132
+++ sys/net/pipex.c 12 May 2021 09:38:32 -
@@ -1628,6 +1628,7 @@ pipex_l2tp_output(struct mbuf *m0, struc
 #ifdef INET6
struct ip6_hdr *ip6;
 #endif
+   struct m_tag *mtag;

hlen = sizeof(struct pipex_l2tp_header) +
((pipex_session_is_l2tp_data_sequencing_on(session))
@@ -1703,6 +1704,15 @@ pipex_l2tp_output(struct mbuf *m0, struc
ip->ip_ttl = MAXTTL;
ip->ip_tos = 0;
ip->ip_off = 0;
+
+   if (session->proto.l2tp.ipsecflowinfo > 0) {
+			if ((mtag = 
m_tag_get(PACKET_TAG_IPSEC_FLOWINFO,

+   sizeof(u_int32_t), M_NOWAIT)) == NULL)
+   goto drop;
+   *(u_int32_t *)(mtag + 1) =
+   session->proto.l2tp.ipsecflowinfo;
+   m_tag_prepend(m0, mtag);
+   }

ip_send(m0);
break;
Index: sys/netinet/ip_input.c
===
RCS file: /disk/cvs/openbsd/src/sys/netinet/ip_input.c,v
retrieving revision 1.359
diff -u -p -r1.359 ip_input.c
--- sys/netinet/ip_input.c  30 Apr 2021 13:52:48 -  1.359
+++ sys/netinet/ip_input.c  12 May 2021 09:38:32 -
@@ -1790,6 +1790,8 @@ ip_send_do_dispatch(void *xmq, int flags
struct mbuf_queue *mq = xmq;
struct mbuf *m;
struct mbuf_list ml;
+   struct m_tag *mtag;
+   u_int32_t ipsecflowinfo = 0;

mq_delist(mq, &ml);
if (ml_empty(&ml))
@@ -1797,7 +1799,12 @@ ip_send_do_dispatch(void *xmq, int flags

NET_LOCK();
while ((m = ml_dequeue(&ml)) != NULL) {
-   ip_output(m, NULL, NULL, flags, NULL, NULL, 0);
+		if ((mtag = m_tag_find(m, PACKET_TAG_IPSEC_FLOWINFO, 
NULL))

+   != NULL) {
+   ipsecflowinfo = *(u_int32_t *)(mtag + 1);
+   m_tag_delete(m, mtag);
+   }
+		ip_output(m, NULL, NULL, flags, NULL, NULL, 
ipsecflowinfo);

}
NET_UNLOCK();
 }
Index: sys/sys/mbuf.h
===
RCS file: /disk/cvs/openbsd/src/sys/sys/mbuf.h,v
retrieving revision 1.252
diff -u -p -r1.252 mbuf.h
--- sys/sys/mbuf.h  25 Feb 2021 02:43:31 -  1.252
+++ sys/sys/mbuf.h  12 May 2021 09:38:32 -
@@ -469,6 +469,7 @@ struct m_tag *m_tag_next(struct mbuf *,
 /* Packet tag types */
 #define PACKET_TAG_IPSEC_IN_DONE	0x0001  /* IPsec applied, in 
*/
 #define PACKET_TAG_IPSEC_OUT_DONE	0x0002  /* IPsec applied, out 
*/

+#define PACKET_TAG_IPSEC_FLOWINFO  0x0004  /* IPsec flowinfo */
 #define PACKET_TAG_WIREGUARD   0x0040  /* WireGuard data */
 #define PACKET_TAG_GRE			0x0080  /* GRE 
processing done */
 #define PACKET_TAG_DLT			0x0100 /* data link 
layer type */

@@ -479,7 +480,7 @@ struct m_tag *m_tag_next(struct mbuf *,
 #define PACKET_TAG_CARP_BAL_IP		0x4000  /* carp(4) ip 
balanced marker */


 #define MTAG_BITS \
-("\20\1IPSEC_IN_DONE\2IPSEC_OUT_DONE\3IPSEC_IN_CRYPTO_DONE" \
+("\20\1IPSEC_IN_DONE\2IPSEC_OUT_DONE\3IPSEC_FLOWINFO" \
 
"\4IPSEC_OUT_CRYPTO_NEEDED\5IPSEC_PENDING_TDB\6BRIDGE\7WG\10GRE\11DLT" 
\
 
"\12PF_DIVERT\14PF_REASSEMBLED\15SRCROUTE\16TUNNEL\17CARP_BAL_IP")





diff #2

Make the IPsec flow can have multiple `ipsec_ids' so that
ipsp_spd_lookup() can check whether the `ipsec_ids` of the given tdb 
is

belonged with a flow shared by mutlple clients behind a NAT.

Index: sys/net/pfkeyv2.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/pfkeyv2.c,v
retrieving revision 1.211
diff -u -p -r1.211 pfkeyv2.c
--- sys/net/pfkeyv2.c   4 May 2021 09:28:04 -   1.211
+++ sys/net/pfkeyv2.c   12 May 2021 10:07:11 -
@@ -1106,6 +1106,7 @@ pfkeyv2_send(struct socket *so, void *me
int i, j, rval = 0, mode = PFKEYV2_SENDMESSAGE_BROADCAST;
int delflag = 0;
struct sockaddr_encap encapdst, encapnetmask;
+   struct ipsec_ids *ids, *ids0;
struct ipsec_policy *ipo;
struct ipsec_acquire *ipa;
stru

Re: running network stack forwarding in parallel

2021-05-12 Thread Hrvoje Popovski
On 21.4.2021. 21:36, Alexander Bluhm wrote:
> We need more MP preassure to find such bugs and races.  I think now
> is a good time to give this diff broader testing and commit it.
> You need interfaces with multiple queues to see a difference.

Hi,

while forwarding ip4 traffic over box with parallel diff and aggr
interfaces, and then aggr is destroyed, i'm getting witness log below...

i can't reproduce this log without parallel diff and i'm getting this
log only first time destroying aggr interface ..



witness: lock order reversal:
 1st 0x82139de0 netlock (netlock)
 2nd 0x82120bb8 timeout (timeout)
lock order "timeout"(rwlock) -> "netlock"(rwlock) first seen at:
#0  rw_enter_write+0x43
#1  mld6_fasttimeo+0x14
#2  pffasttimo+0x67
#3  timeout_run+0x93
#4  softclock_thread+0x11d
#5  proc_trampoline+0x1c
lock order "netlock"(rwlock) -> "timeout"(rwlock) first seen at:
#0  timeout_del_barrier+0x41
#1  aggr_p_dtor+0x17b
#2  aggr_clone_destroy+0x91
#3  if_clone_destroy+0xd8
#4  ifioctl+0x1d2
#5  soo_ioctl+0x171
#6  sys_ioctl+0x2c4
#7  syscall+0x3b9
#8  Xsyscall+0x128


r620-1# cat /etc/hostname.aggr0
trunkport ix0
lladdr ec:f4:bb:da:f7:f8
inet 192.168.42.1 255.255.255.0
!route add 16/8 192.168.42.11
up

r620-1# cat /etc/hostname.aggr1
trunkport ix1
inet 192.168.43.1 255.255.255.0
!route add 48/8 192.168.43.11
up



OpenBSD 6.9-current (GENERIC.MP) #166: Wed May 12 10:18:11 CEST 2021
hrv...@r620-1.srce.hr:/sys/arch/amd64/compile/GENERIC.MP
real mem = 17115840512 (16322MB)
avail mem = 16450007040 (15687MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xcf42c000 (99 entries)
bios0: vendor Dell Inc. version "2.9.0" date 12/06/2019
bios0: Dell Inc. PowerEdge R620
acpi0 at bios0: ACPI 3.0
acpi0: sleep states S0 S4 S5
acpi0: tables DSDT FACP APIC SPCR HPET DMAR MCFG WDAT SLIC ERST HEST
BERT EINJ TCPA PC__ SRAT SSDT
acpi0: wakeup devices PCI0(S5)
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 4 (boot processor)
cpu0: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.44 MHz, 06-3e-04
cpu0:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 2, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 99MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.1, IBE
cpu1 at mainbus0: apid 6 (application processor)
cpu1: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.04 MHz, 06-3e-04
cpu1:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 0, core 3, package 0
cpu2 at mainbus0: apid 8 (application processor)
cpu2: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04
cpu2:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu2: 256KB 64b/line 8-way L2 cache
cpu2: smt 0, core 4, package 0
cpu3 at mainbus0: apid 16 (application processor)
cpu3: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04
cpu3:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu3: 256KB 64b/line 8-way L2 cache
cpu3: smt 0, core 8, package 0
cpu4 at mainbus0: apid 18 (application processor)
cpu4: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.03 MHz, 06-3e-04
cpu4:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SEN