Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-03-05 Thread Theo de Raadt
Claudio Jeker  wrote:

> Thanks for the details. I suggest to commit this so we can get some tests
> quickly and hopefully it helps to fix some of the wg(4) crashes people
> hit. According to your other mail there are some additional races that
> require fixing.

quickly means now, and we have about 2 weeks to pull it out (it has a
.h file diff in it, but userland does not really look at it)



Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-03-05 Thread Claudio Jeker
On Tue, Mar 05, 2024 at 12:16:37PM +0300, Vitaliy Makkoveev wrote:
> On Mon, Mar 04, 2024 at 01:00:32PM +0100, Claudio Jeker wrote:
> > On Wed, Feb 28, 2024 at 06:03:14PM +0300, Vitaliy Makkoveev wrote:
> > > On Wed, Feb 28, 2024 at 02:45:44PM +0100, Martin Pieuchot wrote:
> > > > > > > 
> > > > > > > Not only wg(4). Depends on interface queue usage, ifq_start() 
> > > > > > > schedules
> > > > > > > (*if_qstart)() or calls it, so all the interfaces with use 
> > > > > > > rwlock(9) in
> > > > > > > (*if_qstart)() handler are in risk.
> > > > > > > 
> > > > > > > What about to always schedule (*if_qstart)()?
> > > > > > 
> > > > > > Why would you want to introduce additional latence?
> > > > > > 
> > > > > 
> > > > > I suppose it the less evil than strictly deny rwlocks in 
> > > > > (*if_qstart)().
> > > > > Anyway it will be scheduled unless `seq_len' exceeds the watermark. 
> > > > 
> > > > Please no.  This is not going to happen.  wg(4) has to be fixed.  Let's
> > > > not change the design of the kernel every time a bug is found.
> > > > 
> > > 
> > > I'm not the fan of ifq_start() behaviour.
> > > 
> > > wg(4) needs to convert `t_lock', `r_keypair_lock' and `c_lock' rwlocks
> > > to mutexes. I used mtx_init_flags() to keep existing names.
> > 
> > You do not mention if you verified that any of those codepath may sleep or
> > not.
> 
> I use wg(4), so, obviously, I verified this.
> 
> wg_qstart() calls the following routines: noise_remote_ready(),
> wg_tag_get(), wg_timers_event_want_initiation() and wg_queue_out().
> 
> noise_remote_ready() has the sleep point produced by `r_keypair_lock'
> rwlock. We also acquire `c_lock' rwlock with `r_keypair_lock' rwlock
> held. I changed them both to `r_keypair_mtx' and `c_mtx' mutexes.
> 
> `c_lock' serialized paths noise_counter_send() and noise_counter_recv()
> have no sleep points.
> 
> `r_keypair_lock' serialized paths noise_remote_begin_session(),
> noise_remote_clear(), noise_remote_expire_current(),
> noise_remote_ready(), noise_remote_encrypt() and  noise_remote_decrypt()
> have no sleep points. The suspicious noise_remote_keypair_allocate()
> only performs operations around list.
> 
> wg_tag_get() has no sleep points.
> 
> wg_timers_event_want_initiation() has the sleep point produced by
> `t_lock' rwlock. We also acquire `t_handshake_mtx' mutex with
> `t_lock' held. I changed `t_lock' to `t_mtx' mutex.
> 
> wg_queue_out() has no sleep points. 
> 
> `c_lock' protected paths has no sleep points.
> 
> The `t_mtx' is the simplest for review. It protects `t_disabled' and
> `t_need_another_keepalive' and the serialized paths are timeout or task
> scheduling.
> 
> > ...I don't have the time to go over all those codepaths and check.
> >  
> 
> I thought the reported is interesting to fix this and provides some
> feedback.

Thanks for the details. I suggest to commit this so we can get some tests
quickly and hopefully it helps to fix some of the wg(4) crashes people
hit. According to your other mail there are some additional races that
require fixing.

Diff looks OK to me -- but I don't use wg(4).
 
> > > Index: sys/net/if_wg.c
> > > ===
> > > RCS file: /cvs/src/sys/net/if_wg.c,v
> > > retrieving revision 1.36
> > > diff -u -p -r1.36 if_wg.c
> > > --- sys/net/if_wg.c   18 Jan 2024 08:46:41 -  1.36
> > > +++ sys/net/if_wg.c   28 Feb 2024 14:49:16 -
> > > @@ -150,8 +150,8 @@ struct wg_index {
> > >  };
> > >  
> > >  struct wg_timers {
> > > - /* t_lock is for blocking wg_timers_event_* when setting t_disabled. */
> > > - struct rwlockt_lock;
> > > + /* t_mtx is for blocking wg_timers_event_* when setting t_disabled. */
> > > + struct mutex t_mtx;
> > >  
> > >   int  t_disabled;
> > >   int  t_need_another_keepalive;
> > > @@ -930,7 +930,7 @@ void
> > >  wg_timers_init(struct wg_timers *t)
> > >  {
> > >   bzero(t, sizeof(*t));
> > > - rw_init(>t_lock, "wg_timers");
> > > + mtx_init_flags(>t_mtx, IPL_NET, "wg_timers", 0);
> > >   mtx_init(>t_handshake_mtx, IPL_NET);
> > >  
> > >   timeout_set(>t_new_handshake, wg_timers_run_new_handshake, t);
> > > @@ -945,19 +945,19 @@ wg_timers_init(struct wg_timers *t)
> > >  void
> > >  wg_timers_enable(struct wg_timers *t)
> > >  {
> > > - rw_enter_write(>t_lock);
> > > + mtx_enter(>t_mtx);
> > >   t->t_disabled = 0;
> > > - rw_exit_write(>t_lock);
> > > + mtx_leave(>t_mtx);
> > >   wg_timers_run_persistent_keepalive(t);
> > >  }
> > >  
> > >  void
> > >  wg_timers_disable(struct wg_timers *t)
> > >  {
> > > - rw_enter_write(>t_lock);
> > > + mtx_enter(>t_mtx);
> > >   t->t_disabled = 1;
> > >   t->t_need_another_keepalive = 0;
> > > - rw_exit_write(>t_lock);
> > > + mtx_leave(>t_mtx);
> > >  
> > >   timeout_del_barrier(>t_new_handshake);
> > >   timeout_del_barrier(>t_send_keepalive);
> > > @@ -969,12 +969,12 @@ wg_timers_disable(struct wg_timers *t)
> > >  void
> > >  

Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-03-05 Thread Vitaliy Makkoveev
On Mon, Mar 04, 2024 at 01:00:32PM +0100, Claudio Jeker wrote:
> On Wed, Feb 28, 2024 at 06:03:14PM +0300, Vitaliy Makkoveev wrote:
> > On Wed, Feb 28, 2024 at 02:45:44PM +0100, Martin Pieuchot wrote:
> > > > > > 
> > > > > > Not only wg(4). Depends on interface queue usage, ifq_start() 
> > > > > > schedules
> > > > > > (*if_qstart)() or calls it, so all the interfaces with use 
> > > > > > rwlock(9) in
> > > > > > (*if_qstart)() handler are in risk.
> > > > > > 
> > > > > > What about to always schedule (*if_qstart)()?
> > > > > 
> > > > > Why would you want to introduce additional latence?
> > > > > 
> > > > 
> > > > I suppose it the less evil than strictly deny rwlocks in (*if_qstart)().
> > > > Anyway it will be scheduled unless `seq_len' exceeds the watermark. 
> > > 
> > > Please no.  This is not going to happen.  wg(4) has to be fixed.  Let's
> > > not change the design of the kernel every time a bug is found.
> > > 
> > 
> > I'm not the fan of ifq_start() behaviour.
> > 
> > wg(4) needs to convert `t_lock', `r_keypair_lock' and `c_lock' rwlocks
> > to mutexes. I used mtx_init_flags() to keep existing names.
> 
> You do not mention if you verified that any of those codepath may sleep or
> not.

I use wg(4), so, obviously, I verified this.

wg_qstart() calls the following routines: noise_remote_ready(),
wg_tag_get(), wg_timers_event_want_initiation() and wg_queue_out().

noise_remote_ready() has the sleep point produced by `r_keypair_lock'
rwlock. We also acquire `c_lock' rwlock with `r_keypair_lock' rwlock
held. I changed them both to `r_keypair_mtx' and `c_mtx' mutexes.

`c_lock' serialized paths noise_counter_send() and noise_counter_recv()
have no sleep points.

`r_keypair_lock' serialized paths noise_remote_begin_session(),
noise_remote_clear(), noise_remote_expire_current(),
noise_remote_ready(), noise_remote_encrypt() and  noise_remote_decrypt()
have no sleep points. The suspicious noise_remote_keypair_allocate()
only performs operations around list.

wg_tag_get() has no sleep points.

wg_timers_event_want_initiation() has the sleep point produced by
`t_lock' rwlock. We also acquire `t_handshake_mtx' mutex with
`t_lock' held. I changed `t_lock' to `t_mtx' mutex.

wg_queue_out() has no sleep points. 

`c_lock' protected paths has no sleep points.

The `t_mtx' is the simplest for review. It protects `t_disabled' and
`t_need_another_keepalive' and the serialized paths are timeout or task
scheduling.

> ...I don't have the time to go over all those codepaths and check.
>  

I thought the reported is interesting to fix this and provides some
feedback.

> > Index: sys/net/if_wg.c
> > ===
> > RCS file: /cvs/src/sys/net/if_wg.c,v
> > retrieving revision 1.36
> > diff -u -p -r1.36 if_wg.c
> > --- sys/net/if_wg.c 18 Jan 2024 08:46:41 -  1.36
> > +++ sys/net/if_wg.c 28 Feb 2024 14:49:16 -
> > @@ -150,8 +150,8 @@ struct wg_index {
> >  };
> >  
> >  struct wg_timers {
> > -   /* t_lock is for blocking wg_timers_event_* when setting t_disabled. */
> > -   struct rwlockt_lock;
> > +   /* t_mtx is for blocking wg_timers_event_* when setting t_disabled. */
> > +   struct mutex t_mtx;
> >  
> > int  t_disabled;
> > int  t_need_another_keepalive;
> > @@ -930,7 +930,7 @@ void
> >  wg_timers_init(struct wg_timers *t)
> >  {
> > bzero(t, sizeof(*t));
> > -   rw_init(>t_lock, "wg_timers");
> > +   mtx_init_flags(>t_mtx, IPL_NET, "wg_timers", 0);
> > mtx_init(>t_handshake_mtx, IPL_NET);
> >  
> > timeout_set(>t_new_handshake, wg_timers_run_new_handshake, t);
> > @@ -945,19 +945,19 @@ wg_timers_init(struct wg_timers *t)
> >  void
> >  wg_timers_enable(struct wg_timers *t)
> >  {
> > -   rw_enter_write(>t_lock);
> > +   mtx_enter(>t_mtx);
> > t->t_disabled = 0;
> > -   rw_exit_write(>t_lock);
> > +   mtx_leave(>t_mtx);
> > wg_timers_run_persistent_keepalive(t);
> >  }
> >  
> >  void
> >  wg_timers_disable(struct wg_timers *t)
> >  {
> > -   rw_enter_write(>t_lock);
> > +   mtx_enter(>t_mtx);
> > t->t_disabled = 1;
> > t->t_need_another_keepalive = 0;
> > -   rw_exit_write(>t_lock);
> > +   mtx_leave(>t_mtx);
> >  
> > timeout_del_barrier(>t_new_handshake);
> > timeout_del_barrier(>t_send_keepalive);
> > @@ -969,12 +969,12 @@ wg_timers_disable(struct wg_timers *t)
> >  void
> >  wg_timers_set_persistent_keepalive(struct wg_timers *t, uint16_t interval)
> >  {
> > -   rw_enter_read(>t_lock);
> > +   mtx_enter(>t_mtx);
> > if (!t->t_disabled) {
> > t->t_persistent_keepalive_interval = interval;
> > wg_timers_run_persistent_keepalive(t);
> > }
> > -   rw_exit_read(>t_lock);
> > +   mtx_leave(>t_mtx);
> >  }
> >  
> >  int
> > @@ -1020,16 +1020,16 @@ wg_timers_event_data_sent(struct wg_time
> > int msecs = NEW_HANDSHAKE_TIMEOUT * 1000;
> > msecs += arc4random_uniform(REKEY_TIMEOUT_JITTER);
> >  

Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-03-04 Thread Claudio Jeker
On Wed, Feb 28, 2024 at 06:03:14PM +0300, Vitaliy Makkoveev wrote:
> On Wed, Feb 28, 2024 at 02:45:44PM +0100, Martin Pieuchot wrote:
> > > > > 
> > > > > Not only wg(4). Depends on interface queue usage, ifq_start() 
> > > > > schedules
> > > > > (*if_qstart)() or calls it, so all the interfaces with use rwlock(9) 
> > > > > in
> > > > > (*if_qstart)() handler are in risk.
> > > > > 
> > > > > What about to always schedule (*if_qstart)()?
> > > > 
> > > > Why would you want to introduce additional latence?
> > > > 
> > > 
> > > I suppose it the less evil than strictly deny rwlocks in (*if_qstart)().
> > > Anyway it will be scheduled unless `seq_len' exceeds the watermark. 
> > 
> > Please no.  This is not going to happen.  wg(4) has to be fixed.  Let's
> > not change the design of the kernel every time a bug is found.
> > 
> 
> I'm not the fan of ifq_start() behaviour.
> 
> wg(4) needs to convert `t_lock', `r_keypair_lock' and `c_lock' rwlocks
> to mutexes. I used mtx_init_flags() to keep existing names.

You do not mention if you verified that any of those codepath may sleep or
not. I don't have the time to go over all those codepaths and check.
 
> Index: sys/net/if_wg.c
> ===
> RCS file: /cvs/src/sys/net/if_wg.c,v
> retrieving revision 1.36
> diff -u -p -r1.36 if_wg.c
> --- sys/net/if_wg.c   18 Jan 2024 08:46:41 -  1.36
> +++ sys/net/if_wg.c   28 Feb 2024 14:49:16 -
> @@ -150,8 +150,8 @@ struct wg_index {
>  };
>  
>  struct wg_timers {
> - /* t_lock is for blocking wg_timers_event_* when setting t_disabled. */
> - struct rwlockt_lock;
> + /* t_mtx is for blocking wg_timers_event_* when setting t_disabled. */
> + struct mutex t_mtx;
>  
>   int  t_disabled;
>   int  t_need_another_keepalive;
> @@ -930,7 +930,7 @@ void
>  wg_timers_init(struct wg_timers *t)
>  {
>   bzero(t, sizeof(*t));
> - rw_init(>t_lock, "wg_timers");
> + mtx_init_flags(>t_mtx, IPL_NET, "wg_timers", 0);
>   mtx_init(>t_handshake_mtx, IPL_NET);
>  
>   timeout_set(>t_new_handshake, wg_timers_run_new_handshake, t);
> @@ -945,19 +945,19 @@ wg_timers_init(struct wg_timers *t)
>  void
>  wg_timers_enable(struct wg_timers *t)
>  {
> - rw_enter_write(>t_lock);
> + mtx_enter(>t_mtx);
>   t->t_disabled = 0;
> - rw_exit_write(>t_lock);
> + mtx_leave(>t_mtx);
>   wg_timers_run_persistent_keepalive(t);
>  }
>  
>  void
>  wg_timers_disable(struct wg_timers *t)
>  {
> - rw_enter_write(>t_lock);
> + mtx_enter(>t_mtx);
>   t->t_disabled = 1;
>   t->t_need_another_keepalive = 0;
> - rw_exit_write(>t_lock);
> + mtx_leave(>t_mtx);
>  
>   timeout_del_barrier(>t_new_handshake);
>   timeout_del_barrier(>t_send_keepalive);
> @@ -969,12 +969,12 @@ wg_timers_disable(struct wg_timers *t)
>  void
>  wg_timers_set_persistent_keepalive(struct wg_timers *t, uint16_t interval)
>  {
> - rw_enter_read(>t_lock);
> + mtx_enter(>t_mtx);
>   if (!t->t_disabled) {
>   t->t_persistent_keepalive_interval = interval;
>   wg_timers_run_persistent_keepalive(t);
>   }
> - rw_exit_read(>t_lock);
> + mtx_leave(>t_mtx);
>  }
>  
>  int
> @@ -1020,16 +1020,16 @@ wg_timers_event_data_sent(struct wg_time
>   int msecs = NEW_HANDSHAKE_TIMEOUT * 1000;
>   msecs += arc4random_uniform(REKEY_TIMEOUT_JITTER);
>  
> - rw_enter_read(>t_lock);
> + mtx_enter(>t_mtx);
>   if (!t->t_disabled && !timeout_pending(>t_new_handshake))
>   timeout_add_msec(>t_new_handshake, msecs);
> - rw_exit_read(>t_lock);
> + mtx_leave(>t_mtx);
>  }
>  
>  void
>  wg_timers_event_data_received(struct wg_timers *t)
>  {
> - rw_enter_read(>t_lock);
> + mtx_enter(>t_mtx);
>   if (!t->t_disabled) {
>   if (!timeout_pending(>t_send_keepalive))
>   timeout_add_sec(>t_send_keepalive,
> @@ -1037,7 +1037,7 @@ wg_timers_event_data_received(struct wg_
>   else
>   t->t_need_another_keepalive = 1;
>   }
> - rw_exit_read(>t_lock);
> + mtx_leave(>t_mtx);
>  }
>  
>  void
> @@ -1055,11 +1055,11 @@ wg_timers_event_any_authenticated_packet
>  void
>  wg_timers_event_any_authenticated_packet_traversal(struct wg_timers *t)
>  {
> - rw_enter_read(>t_lock);
> + mtx_enter(>t_mtx);
>   if (!t->t_disabled && t->t_persistent_keepalive_interval > 0)
>   timeout_add_sec(>t_persistent_keepalive,
>   t->t_persistent_keepalive_interval);
> - rw_exit_read(>t_lock);
> + mtx_leave(>t_mtx);
>  }
>  
>  void
> @@ -1068,10 +1068,10 @@ wg_timers_event_handshake_initiated(stru
>   int msecs = REKEY_TIMEOUT * 1000;
>   msecs += arc4random_uniform(REKEY_TIMEOUT_JITTER);
>  
> - rw_enter_read(>t_lock);
> + mtx_enter(>t_mtx);
>   if (!t->t_disabled)
>   

Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-02-28 Thread Vitaliy Makkoveev
On Wed, Feb 28, 2024 at 02:45:44PM +0100, Martin Pieuchot wrote:
> > > > 
> > > > Not only wg(4). Depends on interface queue usage, ifq_start() schedules
> > > > (*if_qstart)() or calls it, so all the interfaces with use rwlock(9) in
> > > > (*if_qstart)() handler are in risk.
> > > > 
> > > > What about to always schedule (*if_qstart)()?
> > > 
> > > Why would you want to introduce additional latence?
> > > 
> > 
> > I suppose it the less evil than strictly deny rwlocks in (*if_qstart)().
> > Anyway it will be scheduled unless `seq_len' exceeds the watermark. 
> 
> Please no.  This is not going to happen.  wg(4) has to be fixed.  Let's
> not change the design of the kernel every time a bug is found.
> 

I'm not the fan of ifq_start() behaviour.

wg(4) needs to convert `t_lock', `r_keypair_lock' and `c_lock' rwlocks
to mutexes. I used mtx_init_flags() to keep existing names.

Index: sys/net/if_wg.c
===
RCS file: /cvs/src/sys/net/if_wg.c,v
retrieving revision 1.36
diff -u -p -r1.36 if_wg.c
--- sys/net/if_wg.c 18 Jan 2024 08:46:41 -  1.36
+++ sys/net/if_wg.c 28 Feb 2024 14:49:16 -
@@ -150,8 +150,8 @@ struct wg_index {
 };
 
 struct wg_timers {
-   /* t_lock is for blocking wg_timers_event_* when setting t_disabled. */
-   struct rwlockt_lock;
+   /* t_mtx is for blocking wg_timers_event_* when setting t_disabled. */
+   struct mutex t_mtx;
 
int  t_disabled;
int  t_need_another_keepalive;
@@ -930,7 +930,7 @@ void
 wg_timers_init(struct wg_timers *t)
 {
bzero(t, sizeof(*t));
-   rw_init(>t_lock, "wg_timers");
+   mtx_init_flags(>t_mtx, IPL_NET, "wg_timers", 0);
mtx_init(>t_handshake_mtx, IPL_NET);
 
timeout_set(>t_new_handshake, wg_timers_run_new_handshake, t);
@@ -945,19 +945,19 @@ wg_timers_init(struct wg_timers *t)
 void
 wg_timers_enable(struct wg_timers *t)
 {
-   rw_enter_write(>t_lock);
+   mtx_enter(>t_mtx);
t->t_disabled = 0;
-   rw_exit_write(>t_lock);
+   mtx_leave(>t_mtx);
wg_timers_run_persistent_keepalive(t);
 }
 
 void
 wg_timers_disable(struct wg_timers *t)
 {
-   rw_enter_write(>t_lock);
+   mtx_enter(>t_mtx);
t->t_disabled = 1;
t->t_need_another_keepalive = 0;
-   rw_exit_write(>t_lock);
+   mtx_leave(>t_mtx);
 
timeout_del_barrier(>t_new_handshake);
timeout_del_barrier(>t_send_keepalive);
@@ -969,12 +969,12 @@ wg_timers_disable(struct wg_timers *t)
 void
 wg_timers_set_persistent_keepalive(struct wg_timers *t, uint16_t interval)
 {
-   rw_enter_read(>t_lock);
+   mtx_enter(>t_mtx);
if (!t->t_disabled) {
t->t_persistent_keepalive_interval = interval;
wg_timers_run_persistent_keepalive(t);
}
-   rw_exit_read(>t_lock);
+   mtx_leave(>t_mtx);
 }
 
 int
@@ -1020,16 +1020,16 @@ wg_timers_event_data_sent(struct wg_time
int msecs = NEW_HANDSHAKE_TIMEOUT * 1000;
msecs += arc4random_uniform(REKEY_TIMEOUT_JITTER);
 
-   rw_enter_read(>t_lock);
+   mtx_enter(>t_mtx);
if (!t->t_disabled && !timeout_pending(>t_new_handshake))
timeout_add_msec(>t_new_handshake, msecs);
-   rw_exit_read(>t_lock);
+   mtx_leave(>t_mtx);
 }
 
 void
 wg_timers_event_data_received(struct wg_timers *t)
 {
-   rw_enter_read(>t_lock);
+   mtx_enter(>t_mtx);
if (!t->t_disabled) {
if (!timeout_pending(>t_send_keepalive))
timeout_add_sec(>t_send_keepalive,
@@ -1037,7 +1037,7 @@ wg_timers_event_data_received(struct wg_
else
t->t_need_another_keepalive = 1;
}
-   rw_exit_read(>t_lock);
+   mtx_leave(>t_mtx);
 }
 
 void
@@ -1055,11 +1055,11 @@ wg_timers_event_any_authenticated_packet
 void
 wg_timers_event_any_authenticated_packet_traversal(struct wg_timers *t)
 {
-   rw_enter_read(>t_lock);
+   mtx_enter(>t_mtx);
if (!t->t_disabled && t->t_persistent_keepalive_interval > 0)
timeout_add_sec(>t_persistent_keepalive,
t->t_persistent_keepalive_interval);
-   rw_exit_read(>t_lock);
+   mtx_leave(>t_mtx);
 }
 
 void
@@ -1068,10 +1068,10 @@ wg_timers_event_handshake_initiated(stru
int msecs = REKEY_TIMEOUT * 1000;
msecs += arc4random_uniform(REKEY_TIMEOUT_JITTER);
 
-   rw_enter_read(>t_lock);
+   mtx_enter(>t_mtx);
if (!t->t_disabled)
timeout_add_msec(>t_retry_handshake, msecs);
-   rw_exit_read(>t_lock);
+   mtx_leave(>t_mtx);
 }
 
 void
@@ -1085,7 +1085,7 @@ wg_timers_event_handshake_responded(stru
 void
 wg_timers_event_handshake_complete(struct wg_timers *t)
 {
-   rw_enter_read(>t_lock);
+   mtx_enter(>t_mtx);
if (!t->t_disabled) {
mtx_enter(>t_handshake_mtx);

Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-02-28 Thread Martin Pieuchot
On 28/02/24(Wed) 16:39, Vitaliy Makkoveev wrote:
> On Wed, Feb 28, 2024 at 02:22:31PM +0100, Mark Kettenis wrote:
> > > Date: Wed, 28 Feb 2024 16:16:09 +0300
> > > From: Vitaliy Makkoveev 
> > > 
> > > On Wed, Feb 28, 2024 at 12:36:26PM +0100, Claudio Jeker wrote:
> > > > On Wed, Feb 28, 2024 at 12:26:43PM +0100, Marko Cupać wrote:
> > > > > Hi,
> > > > > 
> > > > > thank you for looking into it, and for the advice.
> > > > > 
> > > > > On Wed, 28 Feb 2024 10:13:06 +
> > > > > Stuart Henderson  wrote:
> > > > > 
> > > > > > Please try to re-type at least the most important bits from a
> > > > > > screenshot so readers can quickly see which subsystems are involved.
> > > > > 
> > > > > Below is manual transcript of whole screenshot, hopefully no typos.
> > > > > 
> > > > > If you have any advice on what should I do if it happens again in 
> > > > > order
> > > > > to get as much info for debuggers as possible, please let me know.
> > > > > 
> > > > > splassert: assertwaitok: want 0 have 4
> > > > > panic: kernel diagnostic assertion "p->p_wchan == NULL" failed: file 
> > > > > "/usr/src/sys/kern/kern_sched.c", line 373
> > > > > Stopped at db_enter+0x14: popq %rbp
> > > > >TIDPID  UID   PRFLAGS  PFLAGS  CPU  COMMAND
> > > > > 199248  36172  577  0x10   01  openvpn
> > > > > 490874  474460   0x14000   0x2002  wg_handshake
> > > > >  71544   93110   0x14000   0x2003  softnet0
> > > > > db_enter() at db_enter+0x14
> > > > > panic(820a4b9f) at panic+0xc3
> > > > > __assert(82121fcb,8209ae5f,175,82092fbf) at 
> > > > > assert+0x29
> > > > > sched_chooseproc() at sched_chooseproc+0x26d
> > > > > mi_switch() at mi_switch+0x17f
> > > > > sleep_finish(0,1) at sleep_finish+0x107
> > > > > rw_enter(88003cf0,2) at rw_enter+0x1ad
> > > > > noise_remote_ready(88003bf0) at noise_remote_ready+0x33
> > > > > wg_qstart(fff80a622a8) at wg_qstart+0x18c
> > > > > ifq_serialize(80a622a8,80a62390) at ifq_serialize+0xfd
> > > > > hfsc_deferred(80a62000) at hfsc_deferred+0x68
> > > > > softclock_process_tick_timeout(8115e248,1) at 
> > > > > softclock_process_tick_timeout+0xfb
> > > > > softclock(0) at softclock+0xb8
> > > > > softintr_dispatch(0) at softintr_dispatch+0xeb
> > > > > end trace frame: 0x800020dbc730, count:0
> > > > > 
> > > > 
> > > > WTF! wg(4) is just broken. How the hell should a sleeping rw_lock work
> > > > when called from inside a timeout aka softclock? This is interrupt 
> > > > context
> > > > code is not allowed to sleep there.
> > > > 
> > > 
> > > Not only wg(4). Depends on interface queue usage, ifq_start() schedules
> > > (*if_qstart)() or calls it, so all the interfaces with use rwlock(9) in
> > > (*if_qstart)() handler are in risk.
> > > 
> > > What about to always schedule (*if_qstart)()?
> > 
> > Why would you want to introduce additional latence?
> > 
> 
> I suppose it the less evil than strictly deny rwlocks in (*if_qstart)().
> Anyway it will be scheduled unless `seq_len' exceeds the watermark. 

Please no.  This is not going to happen.  wg(4) has to be fixed.  Let's
not change the design of the kernel every time a bug is found.



Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-02-28 Thread Claudio Jeker
On Wed, Feb 28, 2024 at 02:37:31PM +0100, Martin Pieuchot wrote:
> On 28/02/24(Wed) 12:36, Claudio Jeker wrote:
> > On Wed, Feb 28, 2024 at 12:26:43PM +0100, Marko Cupać wrote:
> > > Hi,
> > > 
> > > thank you for looking into it, and for the advice.
> > > 
> > > On Wed, 28 Feb 2024 10:13:06 +
> > > Stuart Henderson  wrote:
> > > 
> > > > Please try to re-type at least the most important bits from a
> > > > screenshot so readers can quickly see which subsystems are involved.
> > > 
> > > Below is manual transcript of whole screenshot, hopefully no typos.
> > > 
> > > If you have any advice on what should I do if it happens again in order
> > > to get as much info for debuggers as possible, please let me know.
> > > 
> > > splassert: assertwaitok: want 0 have 4
> > > panic: kernel diagnostic assertion "p->p_wchan == NULL" failed: file 
> > > "/usr/src/sys/kern/kern_sched.c", line 373
> > > Stopped at db_enter+0x14: popq %rbp
> > >TIDPID  UID   PRFLAGS  PFLAGS  CPU  COMMAND
> > > 199248  36172  577  0x10   01  openvpn
> > > 490874  474460   0x14000   0x2002  wg_handshake
> > >  71544   93110   0x14000   0x2003  softnet0
> > > db_enter() at db_enter+0x14
> > > panic(820a4b9f) at panic+0xc3
> > > __assert(82121fcb,8209ae5f,175,82092fbf) at 
> > > assert+0x29
> > > sched_chooseproc() at sched_chooseproc+0x26d
> > > mi_switch() at mi_switch+0x17f
> > > sleep_finish(0,1) at sleep_finish+0x107
> > > rw_enter(88003cf0,2) at rw_enter+0x1ad
> > > noise_remote_ready(88003bf0) at noise_remote_ready+0x33
> > > wg_qstart(fff80a622a8) at wg_qstart+0x18c
> > > ifq_serialize(80a622a8,80a62390) at ifq_serialize+0xfd
> > > hfsc_deferred(80a62000) at hfsc_deferred+0x68
> > > softclock_process_tick_timeout(8115e248,1) at 
> > > softclock_process_tick_timeout+0xfb
> > > softclock(0) at softclock+0xb8
> > > softintr_dispatch(0) at softintr_dispatch+0xeb
> > > end trace frame: 0x800020dbc730, count:0
> > > 
> > 
> > WTF! wg(4) is just broken. How the hell should a sleeping rw_lock work
> > when called from inside a timeout aka softclock? This is interrupt context
> > code is not allowed to sleep there.
> 
> It is indeed.  It isn't clear to me why `r_keypair_lock' is a rwlock and
> not a mutex.  Does anything sleep in this code?  Why did they pick
> sleeping locks in the first place?

Maybe because multiple readers are possible with an rwlock. 

-- 
:wq Claudio



Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-02-28 Thread Vitaliy Makkoveev
On Wed, Feb 28, 2024 at 02:22:31PM +0100, Mark Kettenis wrote:
> > Date: Wed, 28 Feb 2024 16:16:09 +0300
> > From: Vitaliy Makkoveev 
> > 
> > On Wed, Feb 28, 2024 at 12:36:26PM +0100, Claudio Jeker wrote:
> > > On Wed, Feb 28, 2024 at 12:26:43PM +0100, Marko Cupać wrote:
> > > > Hi,
> > > > 
> > > > thank you for looking into it, and for the advice.
> > > > 
> > > > On Wed, 28 Feb 2024 10:13:06 +
> > > > Stuart Henderson  wrote:
> > > > 
> > > > > Please try to re-type at least the most important bits from a
> > > > > screenshot so readers can quickly see which subsystems are involved.
> > > > 
> > > > Below is manual transcript of whole screenshot, hopefully no typos.
> > > > 
> > > > If you have any advice on what should I do if it happens again in order
> > > > to get as much info for debuggers as possible, please let me know.
> > > > 
> > > > splassert: assertwaitok: want 0 have 4
> > > > panic: kernel diagnostic assertion "p->p_wchan == NULL" failed: file 
> > > > "/usr/src/sys/kern/kern_sched.c", line 373
> > > > Stopped at db_enter+0x14: popq %rbp
> > > >TIDPID  UID   PRFLAGS  PFLAGS  CPU  COMMAND
> > > > 199248  36172  577  0x10   01  openvpn
> > > > 490874  474460   0x14000   0x2002  wg_handshake
> > > >  71544   93110   0x14000   0x2003  softnet0
> > > > db_enter() at db_enter+0x14
> > > > panic(820a4b9f) at panic+0xc3
> > > > __assert(82121fcb,8209ae5f,175,82092fbf) at 
> > > > assert+0x29
> > > > sched_chooseproc() at sched_chooseproc+0x26d
> > > > mi_switch() at mi_switch+0x17f
> > > > sleep_finish(0,1) at sleep_finish+0x107
> > > > rw_enter(88003cf0,2) at rw_enter+0x1ad
> > > > noise_remote_ready(88003bf0) at noise_remote_ready+0x33
> > > > wg_qstart(fff80a622a8) at wg_qstart+0x18c
> > > > ifq_serialize(80a622a8,80a62390) at ifq_serialize+0xfd
> > > > hfsc_deferred(80a62000) at hfsc_deferred+0x68
> > > > softclock_process_tick_timeout(8115e248,1) at 
> > > > softclock_process_tick_timeout+0xfb
> > > > softclock(0) at softclock+0xb8
> > > > softintr_dispatch(0) at softintr_dispatch+0xeb
> > > > end trace frame: 0x800020dbc730, count:0
> > > > 
> > > 
> > > WTF! wg(4) is just broken. How the hell should a sleeping rw_lock work
> > > when called from inside a timeout aka softclock? This is interrupt context
> > > code is not allowed to sleep there.
> > > 
> > 
> > Not only wg(4). Depends on interface queue usage, ifq_start() schedules
> > (*if_qstart)() or calls it, so all the interfaces with use rwlock(9) in
> > (*if_qstart)() handler are in risk.
> > 
> > What about to always schedule (*if_qstart)()?
> 
> Why would you want to introduce additional latence?
> 

I suppose it the less evil than strictly deny rwlocks in (*if_qstart)().
Anyway it will be scheduled unless `seq_len' exceeds the watermark. 



Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-02-28 Thread Martin Pieuchot
On 28/02/24(Wed) 12:36, Claudio Jeker wrote:
> On Wed, Feb 28, 2024 at 12:26:43PM +0100, Marko Cupać wrote:
> > Hi,
> > 
> > thank you for looking into it, and for the advice.
> > 
> > On Wed, 28 Feb 2024 10:13:06 +
> > Stuart Henderson  wrote:
> > 
> > > Please try to re-type at least the most important bits from a
> > > screenshot so readers can quickly see which subsystems are involved.
> > 
> > Below is manual transcript of whole screenshot, hopefully no typos.
> > 
> > If you have any advice on what should I do if it happens again in order
> > to get as much info for debuggers as possible, please let me know.
> > 
> > splassert: assertwaitok: want 0 have 4
> > panic: kernel diagnostic assertion "p->p_wchan == NULL" failed: file 
> > "/usr/src/sys/kern/kern_sched.c", line 373
> > Stopped at db_enter+0x14: popq %rbp
> >TIDPID  UID   PRFLAGS  PFLAGS  CPU  COMMAND
> > 199248  36172  577  0x10   01  openvpn
> > 490874  474460   0x14000   0x2002  wg_handshake
> >  71544   93110   0x14000   0x2003  softnet0
> > db_enter() at db_enter+0x14
> > panic(820a4b9f) at panic+0xc3
> > __assert(82121fcb,8209ae5f,175,82092fbf) at 
> > assert+0x29
> > sched_chooseproc() at sched_chooseproc+0x26d
> > mi_switch() at mi_switch+0x17f
> > sleep_finish(0,1) at sleep_finish+0x107
> > rw_enter(88003cf0,2) at rw_enter+0x1ad
> > noise_remote_ready(88003bf0) at noise_remote_ready+0x33
> > wg_qstart(fff80a622a8) at wg_qstart+0x18c
> > ifq_serialize(80a622a8,80a62390) at ifq_serialize+0xfd
> > hfsc_deferred(80a62000) at hfsc_deferred+0x68
> > softclock_process_tick_timeout(8115e248,1) at 
> > softclock_process_tick_timeout+0xfb
> > softclock(0) at softclock+0xb8
> > softintr_dispatch(0) at softintr_dispatch+0xeb
> > end trace frame: 0x800020dbc730, count:0
> > 
> 
> WTF! wg(4) is just broken. How the hell should a sleeping rw_lock work
> when called from inside a timeout aka softclock? This is interrupt context
> code is not allowed to sleep there.

It is indeed.  It isn't clear to me why `r_keypair_lock' is a rwlock and
not a mutex.  Does anything sleep in this code?  Why did they pick
sleeping locks in the first place?



Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-02-28 Thread Mark Kettenis
> Date: Wed, 28 Feb 2024 16:16:09 +0300
> From: Vitaliy Makkoveev 
> 
> On Wed, Feb 28, 2024 at 12:36:26PM +0100, Claudio Jeker wrote:
> > On Wed, Feb 28, 2024 at 12:26:43PM +0100, Marko Cupać wrote:
> > > Hi,
> > > 
> > > thank you for looking into it, and for the advice.
> > > 
> > > On Wed, 28 Feb 2024 10:13:06 +
> > > Stuart Henderson  wrote:
> > > 
> > > > Please try to re-type at least the most important bits from a
> > > > screenshot so readers can quickly see which subsystems are involved.
> > > 
> > > Below is manual transcript of whole screenshot, hopefully no typos.
> > > 
> > > If you have any advice on what should I do if it happens again in order
> > > to get as much info for debuggers as possible, please let me know.
> > > 
> > > splassert: assertwaitok: want 0 have 4
> > > panic: kernel diagnostic assertion "p->p_wchan == NULL" failed: file 
> > > "/usr/src/sys/kern/kern_sched.c", line 373
> > > Stopped at db_enter+0x14: popq %rbp
> > >TIDPID  UID   PRFLAGS  PFLAGS  CPU  COMMAND
> > > 199248  36172  577  0x10   01  openvpn
> > > 490874  474460   0x14000   0x2002  wg_handshake
> > >  71544   93110   0x14000   0x2003  softnet0
> > > db_enter() at db_enter+0x14
> > > panic(820a4b9f) at panic+0xc3
> > > __assert(82121fcb,8209ae5f,175,82092fbf) at 
> > > assert+0x29
> > > sched_chooseproc() at sched_chooseproc+0x26d
> > > mi_switch() at mi_switch+0x17f
> > > sleep_finish(0,1) at sleep_finish+0x107
> > > rw_enter(88003cf0,2) at rw_enter+0x1ad
> > > noise_remote_ready(88003bf0) at noise_remote_ready+0x33
> > > wg_qstart(fff80a622a8) at wg_qstart+0x18c
> > > ifq_serialize(80a622a8,80a62390) at ifq_serialize+0xfd
> > > hfsc_deferred(80a62000) at hfsc_deferred+0x68
> > > softclock_process_tick_timeout(8115e248,1) at 
> > > softclock_process_tick_timeout+0xfb
> > > softclock(0) at softclock+0xb8
> > > softintr_dispatch(0) at softintr_dispatch+0xeb
> > > end trace frame: 0x800020dbc730, count:0
> > > 
> > 
> > WTF! wg(4) is just broken. How the hell should a sleeping rw_lock work
> > when called from inside a timeout aka softclock? This is interrupt context
> > code is not allowed to sleep there.
> > 
> 
> Not only wg(4). Depends on interface queue usage, ifq_start() schedules
> (*if_qstart)() or calls it, so all the interfaces with use rwlock(9) in
> (*if_qstart)() handler are in risk.
> 
> What about to always schedule (*if_qstart)()?

Why would you want to introduce additional latence?

> Index: sys/net/hfsc.c
> ===
> RCS file: /cvs/src/sys/net/hfsc.c,v
> retrieving revision 1.49
> diff -u -p -r1.49 hfsc.c
> --- sys/net/hfsc.c11 Apr 2023 00:45:09 -  1.49
> +++ sys/net/hfsc.c28 Feb 2024 13:15:22 -
> @@ -953,8 +953,7 @@ hfsc_deferred(void *arg)
>   if (!HFSC_ENABLED(ifq))
>   return;
>  
> - if (!ifq_empty(ifq))
> - ifq_start(ifq);
> + ifq_start_deferred(ifq);
>  
>   hif = ifq_q_enter(>if_snd, ifq_hfsc_ops);
>   if (hif == NULL)
> Index: sys/net/ifq.c
> ===
> RCS file: /cvs/src/sys/net/ifq.c,v
> retrieving revision 1.53
> diff -u -p -r1.53 ifq.c
> --- sys/net/ifq.c 10 Nov 2023 15:51:24 -  1.53
> +++ sys/net/ifq.c 28 Feb 2024 13:15:22 -
> @@ -133,6 +133,12 @@ ifq_start(struct ifqueue *ifq)
>   } else
>   task_add(ifq->ifq_softnet, >ifq_bundle);
>  }
> +void
> +ifq_start_deferred(struct ifqueue *ifq)
> +{
> + if (ifq_len(ifq))
> + task_add(ifq->ifq_softnet, >ifq_bundle);
> +}
>  
>  void
>  ifq_start_task(void *p)
> Index: sys/net/ifq.h
> ===
> RCS file: /cvs/src/sys/net/ifq.h,v
> retrieving revision 1.41
> diff -u -p -r1.41 ifq.h
> --- sys/net/ifq.h 10 Nov 2023 15:51:24 -  1.41
> +++ sys/net/ifq.h 28 Feb 2024 13:15:22 -
> @@ -430,6 +430,7 @@ void   ifq_destroy(struct ifqueue *);
>  void  ifq_add_data(struct ifqueue *, struct if_data *);
>  int   ifq_enqueue(struct ifqueue *, struct mbuf *);
>  void  ifq_start(struct ifqueue *);
> +void  ifq_start_deferred(struct ifqueue *);
>  struct mbuf  *ifq_deq_begin(struct ifqueue *);
>  void  ifq_deq_commit(struct ifqueue *, struct mbuf *);
>  void  ifq_deq_rollback(struct ifqueue *, struct mbuf *);
> 
> 



Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-02-28 Thread Vitaliy Makkoveev
On Wed, Feb 28, 2024 at 12:36:26PM +0100, Claudio Jeker wrote:
> On Wed, Feb 28, 2024 at 12:26:43PM +0100, Marko Cupać wrote:
> > Hi,
> > 
> > thank you for looking into it, and for the advice.
> > 
> > On Wed, 28 Feb 2024 10:13:06 +
> > Stuart Henderson  wrote:
> > 
> > > Please try to re-type at least the most important bits from a
> > > screenshot so readers can quickly see which subsystems are involved.
> > 
> > Below is manual transcript of whole screenshot, hopefully no typos.
> > 
> > If you have any advice on what should I do if it happens again in order
> > to get as much info for debuggers as possible, please let me know.
> > 
> > splassert: assertwaitok: want 0 have 4
> > panic: kernel diagnostic assertion "p->p_wchan == NULL" failed: file 
> > "/usr/src/sys/kern/kern_sched.c", line 373
> > Stopped at db_enter+0x14: popq %rbp
> >TIDPID  UID   PRFLAGS  PFLAGS  CPU  COMMAND
> > 199248  36172  577  0x10   01  openvpn
> > 490874  474460   0x14000   0x2002  wg_handshake
> >  71544   93110   0x14000   0x2003  softnet0
> > db_enter() at db_enter+0x14
> > panic(820a4b9f) at panic+0xc3
> > __assert(82121fcb,8209ae5f,175,82092fbf) at 
> > assert+0x29
> > sched_chooseproc() at sched_chooseproc+0x26d
> > mi_switch() at mi_switch+0x17f
> > sleep_finish(0,1) at sleep_finish+0x107
> > rw_enter(88003cf0,2) at rw_enter+0x1ad
> > noise_remote_ready(88003bf0) at noise_remote_ready+0x33
> > wg_qstart(fff80a622a8) at wg_qstart+0x18c
> > ifq_serialize(80a622a8,80a62390) at ifq_serialize+0xfd
> > hfsc_deferred(80a62000) at hfsc_deferred+0x68
> > softclock_process_tick_timeout(8115e248,1) at 
> > softclock_process_tick_timeout+0xfb
> > softclock(0) at softclock+0xb8
> > softintr_dispatch(0) at softintr_dispatch+0xeb
> > end trace frame: 0x800020dbc730, count:0
> > 
> 
> WTF! wg(4) is just broken. How the hell should a sleeping rw_lock work
> when called from inside a timeout aka softclock? This is interrupt context
> code is not allowed to sleep there.
> 

Not only wg(4). Depends on interface queue usage, ifq_start() schedules
(*if_qstart)() or calls it, so all the interfaces with use rwlock(9) in
(*if_qstart)() handler are in risk.

What about to always schedule (*if_qstart)()?

Index: sys/net/hfsc.c
===
RCS file: /cvs/src/sys/net/hfsc.c,v
retrieving revision 1.49
diff -u -p -r1.49 hfsc.c
--- sys/net/hfsc.c  11 Apr 2023 00:45:09 -  1.49
+++ sys/net/hfsc.c  28 Feb 2024 13:15:22 -
@@ -953,8 +953,7 @@ hfsc_deferred(void *arg)
if (!HFSC_ENABLED(ifq))
return;
 
-   if (!ifq_empty(ifq))
-   ifq_start(ifq);
+   ifq_start_deferred(ifq);
 
hif = ifq_q_enter(>if_snd, ifq_hfsc_ops);
if (hif == NULL)
Index: sys/net/ifq.c
===
RCS file: /cvs/src/sys/net/ifq.c,v
retrieving revision 1.53
diff -u -p -r1.53 ifq.c
--- sys/net/ifq.c   10 Nov 2023 15:51:24 -  1.53
+++ sys/net/ifq.c   28 Feb 2024 13:15:22 -
@@ -133,6 +133,12 @@ ifq_start(struct ifqueue *ifq)
} else
task_add(ifq->ifq_softnet, >ifq_bundle);
 }
+void
+ifq_start_deferred(struct ifqueue *ifq)
+{
+   if (ifq_len(ifq))
+   task_add(ifq->ifq_softnet, >ifq_bundle);
+}
 
 void
 ifq_start_task(void *p)
Index: sys/net/ifq.h
===
RCS file: /cvs/src/sys/net/ifq.h,v
retrieving revision 1.41
diff -u -p -r1.41 ifq.h
--- sys/net/ifq.h   10 Nov 2023 15:51:24 -  1.41
+++ sys/net/ifq.h   28 Feb 2024 13:15:22 -
@@ -430,6 +430,7 @@ void ifq_destroy(struct ifqueue *);
 voidifq_add_data(struct ifqueue *, struct if_data *);
 int ifq_enqueue(struct ifqueue *, struct mbuf *);
 voidifq_start(struct ifqueue *);
+voidifq_start_deferred(struct ifqueue *);
 struct mbuf*ifq_deq_begin(struct ifqueue *);
 voidifq_deq_commit(struct ifqueue *, struct mbuf *);
 voidifq_deq_rollback(struct ifqueue *, struct mbuf *);



Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-02-28 Thread Claudio Jeker
On Wed, Feb 28, 2024 at 12:26:43PM +0100, Marko Cupać wrote:
> Hi,
> 
> thank you for looking into it, and for the advice.
> 
> On Wed, 28 Feb 2024 10:13:06 +
> Stuart Henderson  wrote:
> 
> > Please try to re-type at least the most important bits from a
> > screenshot so readers can quickly see which subsystems are involved.
> 
> Below is manual transcript of whole screenshot, hopefully no typos.
> 
> If you have any advice on what should I do if it happens again in order
> to get as much info for debuggers as possible, please let me know.
> 
> splassert: assertwaitok: want 0 have 4
> panic: kernel diagnostic assertion "p->p_wchan == NULL" failed: file 
> "/usr/src/sys/kern/kern_sched.c", line 373
> Stopped at db_enter+0x14: popq %rbp
>TIDPID  UID   PRFLAGS  PFLAGS  CPU  COMMAND
> 199248  36172  577  0x10   01  openvpn
> 490874  474460   0x14000   0x2002  wg_handshake
>  71544   93110   0x14000   0x2003  softnet0
> db_enter() at db_enter+0x14
> panic(820a4b9f) at panic+0xc3
> __assert(82121fcb,8209ae5f,175,82092fbf) at 
> assert+0x29
> sched_chooseproc() at sched_chooseproc+0x26d
> mi_switch() at mi_switch+0x17f
> sleep_finish(0,1) at sleep_finish+0x107
> rw_enter(88003cf0,2) at rw_enter+0x1ad
> noise_remote_ready(88003bf0) at noise_remote_ready+0x33
> wg_qstart(fff80a622a8) at wg_qstart+0x18c
> ifq_serialize(80a622a8,80a62390) at ifq_serialize+0xfd
> hfsc_deferred(80a62000) at hfsc_deferred+0x68
> softclock_process_tick_timeout(8115e248,1) at 
> softclock_process_tick_timeout+0xfb
> softclock(0) at softclock+0xb8
> softintr_dispatch(0) at softintr_dispatch+0xeb
> end trace frame: 0x800020dbc730, count:0
> 

WTF! wg(4) is just broken. How the hell should a sleeping rw_lock work
when called from inside a timeout aka softclock? This is interrupt context
code is not allowed to sleep there.

-- 
:wq Claudio



Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-02-28 Thread Marko Cupać
Hi,

thank you for looking into it, and for the advice.

On Wed, 28 Feb 2024 10:13:06 +
Stuart Henderson  wrote:

> Please try to re-type at least the most important bits from a
> screenshot so readers can quickly see which subsystems are involved.

Below is manual transcript of whole screenshot, hopefully no typos.

If you have any advice on what should I do if it happens again in order
to get as much info for debuggers as possible, please let me know.

splassert: assertwaitok: want 0 have 4
panic: kernel diagnostic assertion "p->p_wchan == NULL" failed: file 
"/usr/src/sys/kern/kern_sched.c", line 373
Stopped at db_enter+0x14: popq %rbp
   TIDPID  UID   PRFLAGS  PFLAGS  CPU  COMMAND
199248  36172  577  0x10   01  openvpn
490874  474460   0x14000   0x2002  wg_handshake
 71544   93110   0x14000   0x2003  softnet0
db_enter() at db_enter+0x14
panic(820a4b9f) at panic+0xc3
__assert(82121fcb,8209ae5f,175,82092fbf) at assert+0x29
sched_chooseproc() at sched_chooseproc+0x26d
mi_switch() at mi_switch+0x17f
sleep_finish(0,1) at sleep_finish+0x107
rw_enter(88003cf0,2) at rw_enter+0x1ad
noise_remote_ready(88003bf0) at noise_remote_ready+0x33
wg_qstart(fff80a622a8) at wg_qstart+0x18c
ifq_serialize(80a622a8,80a62390) at ifq_serialize+0xfd
hfsc_deferred(80a62000) at hfsc_deferred+0x68
softclock_process_tick_timeout(8115e248,1) at 
softclock_process_tick_timeout+0xfb
softclock(0) at softclock+0xb8
softintr_dispatch(0) at softintr_dispatch+0xeb
end trace frame: 0x800020dbc730, count:0

-- 
Before enlightenment - chop wood, draw water.
After  enlightenment - chop wood, draw water.

Marko Cupać
https://www.mimar.rs/



Re: panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2024-02-28 Thread Stuart Henderson
Please try to re-type at least the most important bits from a
screenshot so readers can quickly see which subsystems are involved.


splassert: assertwaitok: want 0 have 4
assertion p->wchan == NULL failed, kern_sched.c line 373

active procs: openvpn wg_handshake softnet0

trace includes rw_enter, sleep_finish, noise_remote_ready, wg_qstart,
ifq_serialize, hfsc_deferred


On 2024/02/28 10:12, Marko Cupać wrote:
> Hi,
> 
> I have fairly busy CARP pair which serves as a hub for ~30 spokes which
> tunnel to it over combination of GRE/IPSEC, tunnel mode ipsec and
> wireguard. It also serves as host-to-site VPN concentrator for npppd,
> openvpn and wireguard clients. Besides firewalling, pf also does prio
> and queue, as well as pflow. There's also read-only snmpd for querying
> interface statistics from another snmpd_exporter / prometheus / grafana
> host.
> 
> Yesterday one of the two crashed (7.4, syspatched up to 011_ssh). I have
> attached photo (sorry I had no access to serial console).
> 
> I'd be glad to provide more information.
> 
> Below's dmesg.
> 
> OpenBSD 7.4 (GENERIC.MP) #2: Fri Dec  8 15:39:04 MST 2023
> 
> r...@syspatch-74-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> real mem = 17027289088 (16238MB)
> avail mem = 16491524096 (15727MB)
> random: good seed from bootblocks
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.8 @ 0x788c5000 (241 entries)
> bios0: vendor HP version "P89" date 11/23/2021
> bios0: HP ProLiant DL360 Gen9
> efi0 at bios0: UEFI 2.4
> efi0: HP rev 0x25c00
> acpi0 at bios0: ACPI 5.0
> acpi0: sleep states S0 S5
> acpi0: tables DSDT FACP UEFI MCEJ SSDT HEST BERT ERST EINJ BGRT HPET PMCT 
> WDDT APIC MCFG SLIT SRAT SPMI RASF SPCR MSCT BDAT PCCT DMAR SSDT SSDT SSDT
> acpi0: wakeup devices PEX4(S4) BR05(S4) BR03(S4) BR07(S4)
> acpitimer0 at acpi0: 3579545 Hz, 24 bits
> acpihpet0 at acpi0: 14318179 Hz
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz, 2597.01 MHz, 06-4f-01, patch 
> 0b40
> cpu0: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,RDSEED,ADX,SMAP,PT,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
> cpu0: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 256KB 
> 64b/line 8-way L2 cache, 10MB 64b/line 20-way L3 cache
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
> cpu0: apic clock running at 99MHz
> cpu0: mwait min=64, max=64, C-substates=0.2.1.2, IBE
> cpu1 at mainbus0: apid 2 (application processor)
> cpu1: Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz, 2597.02 MHz, 06-4f-01, patch 
> 0b40
> cpu1: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,RDSEED,ADX,SMAP,PT,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
> cpu1: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 256KB 
> 64b/line 8-way L2 cache, 10MB 64b/line 20-way L3 cache
> cpu1: smt 0, core 1, package 0
> cpu2 at mainbus0: apid 4 (application processor)
> cpu2: Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz, 2597.07 MHz, 06-4f-01, patch 
> 0b40
> cpu2: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,RDSEED,ADX,SMAP,PT,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
> cpu2: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 256KB 
> 64b/line 8-way L2 cache, 10MB 64b/line 20-way L3 cache
> cpu2: smt 0, core 2, package 0
> cpu3 at mainbus0: apid 6 (application processor)
> cpu3: Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz, 2597.06 MHz, 06-4f-01, patch 
> 0b40
> cpu3: 
> 

panic: kernel diagnostic assertion "p->p_wchan == NULL" failed

2020-07-19 Thread jcs
>Synopsis:      panic: kernel diagnostic assertion "p->p_wchan == NULL" failed
>Category:  kernel
>Environment:
System  : OpenBSD 6.7
Details : OpenBSD 6.7-current (GENERIC.MP) #359: Sun Jul 19 
01:36:17 MDT 2020
 
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

Architecture: OpenBSD.amd64
Machine : amd64
>Description:
splassert: assertwaitok: want 0 have 4
panic: kernel diagnostic assertion "p->p_wchan == NULL" failed: file 
"/usr/src/sys/kern/kern_sched.c", line 353
Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND  
 
db_enter() at db_enter+0x10
panic(81e00828) at panic+0x128
__assert(81e662e1,81df182a,161,81dead80) at 
__assert+0x2b
sched_chooseproc() at sched_chooseproc+0x23c
mi_switch() at mi_switch+0x16a
sleep_finish(800022520218,1) at sleep_finish+0x84
sleep_finish_all(800022520218,1) at sleep_finish_all+0x21
tsleep(80ac7200,119,81e5f7e9,1) at tsleep+0xd6
ttyretype(80ac7200) at ttyretype+0x2c1
ttyrub(61,80ac7200) at ttyrub+0x296
ttyinput(7f,80ac7200) at ttyinput+0x894
wsdisplay_kbdinput(80aa4800,8200,8214fee0,1) at 
wsdisplay_kbdinput+0xaa
wskbd_repeat(8059f400) at wskbd_repeat+0x54
softclock(0) at softclock+0x143
end trace frame: 0x800022520510, count: 0
https://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{0}>PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND  
  
 95235  417386  13622   1000  30x100083  nanosleep sleep  
 92726 714  1  0  30x100083  ttyin getty  
 54535  367099  1  0  30x100083  ttyin getty  
 61298  377358  1  0  30x100083  ttyin getty  
 50914   25032  1  0  30x100083  ttyin getty  
 13622  256267  1   1000  30x8b  pause zsh
 27970   89345  1  0  30x100098  poll  cron   
 27086  124705  1  0  30x80  kqreadapmd   
   635  283574  1 99  30x100090  poll  sndiod 
 94106  27  1110  30x100090  poll  sndiod 
 15101  427156  50294 95  30x100092  kqreadsmtpd  
  2622  447026  50294103  30x100092  kqreadsmtpd  
  8133  120267  50294 95  30x100092  kqreadsmtpd  
 58744  371200  50294 95  30x100092  kqreadsmtpd  
  6766  246437  50294 95  30x100092  kqreadsmtpd  
 73248  319291  50294 95  30x100092  kqreadsmtpd  
 50294  359602  1  0  30x100080  kqreadsmtpd  
 69630  152270  1  0  30x80  selectsshd   
 42837  498062  1  0  30x100080  poll  ntpd   
 93719  309023  88599 83  30x100092  poll  ntpd   
 88599  505189  1 83  30x100092  poll  ntpd   
 46954  486069  88747 74  30x100092  bpf   pflogd 
 88747  191508  1  0  30x80  netio pflogd 
 27991  489183  76346 73  30x100090  kqreadsyslogd
 76346   22579  1  0  30x100082  netio syslogd
  3500  500706  1 77  30x100090  poll  dhclient   
 34202   74190  1  0  30x80  poll  dhclient   
 25647  229537  39826115  30x100092  kqreadslaacd 
 93406  409450  39826115  30x100092  kqreadslaacd 
 39826  140627  1  0  30x100080  kqreadslaacd 
 83062  368808  0  0  3 0x14200  bored i915-userptr-acq
 36069  209829  0  0  3 0x14200  bored i915_flip  
 93437  164764  0  0  3 0x14200  bored i915_modeset   
 15782  454241  0  0  3 0x14200  bored i915-dp
 47677   30967  0  0  3 0x14200  bored i915   
 12555  298661  0  0  3 0x14200  bored smr
 99642  122374  0  0  3 0x14200  pgzerozerothread 
 38899  508278  0  0  3 0x14200  aiodoned  aiodoned   
 83433  375664  0  0  3 0x14200  syncerupdate 
 57438  226169  0  0  3 0x14200  cleaner   cleaner
 50849  208387  0  0  3 0x14200  reaperreaper 
 76534  345574  0  0  3 0x14200  pgdaemon  pagedaemon 
 25908   84585  0  0  3 0x14200  bored srdis  
   983  490844  0  0