Re: rtfree(): "rt->rt_refcnt > 0" assertion

2021-10-02 Thread Vitaliy Makkoveev
On Sat, Oct 02, 2021 at 11:27:48AM +0200, Martin Pieuchot wrote:
> On 15/09/21(Wed) 01:23, Vitaliy Makkoveev wrote:
> > We have weird `rt_refcnt' check in rtfree():
> 
> > 
> > 497 rtfree(struct rtentry *rt)
> > 498 {
> > ...
> > 504 refcnt = (int)atomic_dec_int_nv(&rt->rt_refcnt);
> > 505 if (refcnt <= 0) {
> > 506 KASSERT(!ISSET(rt->rt_flags, RTF_UP));
> > 507 KASSERT(!RT_ROOT(rt));
> > 508 atomic_dec_int(&rttrash);
> > 509 if (refcnt < 0) {
> > 510 printf("rtfree: %p not freed (neg refs)\n", rt);
> > 511 return;
> > 512 }  
> > 
> > We underflow `rt_refcnt' when we missed to get reference to this `rt' or
> > we did extra release. This is the bug which should be exposed. But
> > according current code this `rt' is just leaked. Also it's easy to miss
> > this error condition because we only print error message.
> 
> Yes, all of this is intentional.  If you screw up reference counting a
> leak is better than a crash that you can't debug.  At least people can
> report there is a leak.  
> 

Sorry, I don't assume the leak more helpful for debug. This error
message could be easily missed and not reported, but the panic could not
be ignored.

> > I propose to put "rt->rt_refcnt > 0" assertion before we decrement
> > `rt_refcnt'. This makes reference counting errors more notable when they
> > are. Also I changed `rt_refcnt' definition to unsigned integer.
> 
> Such assert isn't safe because the value can be dereferenced by another
> thread.  You can only assert for the value of a private variable or if
> you're under a critical section.
> 

May be I'm wrong but the sequence:

KASSERT(rt->rt_refcnt > 0);
if (atomic_dec_int_nv(&rt->rt_refcnt) == 0) {

should not be reordered. This is the 'int' variable, so the 'read' (or
'load') and 'atomic decrement' commands could not interrupt each other.
We could have the case when many rtfree(9) threads hit this assertion,
but we could not have the situation when no one rtfree(9) thread hits
this. The critical section is not required here. We have the same logic
is other places, like closef():

1230 closef(struct file *fp, struct proc *p)
1231 {
1232 struct filedesc *fdp;
1233 
1234 if (fp == NULL)
1235 return (0);
1236 
1237 KASSERTMSG(fp->f_count >= 2, "count (%u) < 2", fp->f_count);
1238 
1239 atomic_dec_int(&fp->f_count);

> > I didn't find any "rtfree: ... not freed (neg refs)" report, so it
> > looks like we can't hit this assertion, but I like to commit this
> > diff after release. But nothing stops to test it and provide feedback :)
> 
> We did hit it many many times in the past.  Please do not change this.
>

I searched "rtfree: ... not freed (neg refs)" by google but didn't find
anything. I propose to put this diff into the snaps before commit. We
should instantly receive reports if we have refcounting issues. Also we
should instantly receive reports if someone introduced them in the
future.

> > Index: sys/net/route.c
> > ===
> > RCS file: /cvs/src/sys/net/route.c,v
> > retrieving revision 1.399
> > diff -u -p -r1.399 route.c
> > --- sys/net/route.c 25 May 2021 22:45:09 -  1.399
> > +++ sys/net/route.c 14 Sep 2021 21:47:11 -
> > @@ -496,20 +496,15 @@ rtref(struct rtentry *rt)
> >  void
> >  rtfree(struct rtentry *rt)
> >  {
> > -   int  refcnt;
> > -
> > if (rt == NULL)
> > return;
> >  
> > -   refcnt = (int)atomic_dec_int_nv(&rt->rt_refcnt);
> > -   if (refcnt <= 0) {
> > +   KASSERT(rt->rt_refcnt > 0);
> > +
> > +   if (atomic_dec_int_nv(&rt->rt_refcnt) == 0) {
> > KASSERT(!ISSET(rt->rt_flags, RTF_UP));
> > KASSERT(!RT_ROOT(rt));
> > atomic_dec_int(&rttrash);
> > -   if (refcnt < 0) {
> > -   printf("rtfree: %p not freed (neg refs)\n", rt);
> > -   return;
> > -   }
> >  
> > KERNEL_LOCK();
> > rt_timer_remove_all(rt);
> > Index: sys/net/route.h
> > ===
> > RCS file: /cvs/src/sys/net/route.h,v
> > retrieving revision 1.185
> > diff -u -p -r1.185 route.h
> > --- sys/net/route.h 17 Mar 2021 09:05:42 -  1.185
> > +++ sys/net/route.h 14 Sep 2021 21:47:11 -
> > @@ -113,7 +113,7 @@ struct rtentry {
> > struct rt_kmetrics rt_rmx;  /* metrics used by rx'ing protocols */
> > unsigned int rt_ifidx;  /* the answer: interface to use */
> > unsigned int rt_flags;  /* up/down?, host/net */
> > -   int  rt_refcnt; /* # held references */
> > +   unsigned int rt_refcnt; /* # held references */
> > int  rt_plen;   /* prefix length */
> > uint16_t rt_labelid;/* route label ID */
> > uint8_t

Re: forwarding in parallel ipsec workaround

2021-10-02 Thread Chris Cappuccio
Hrvoje Popovski [hrv...@srce.hr] wrote:
> 
> box didn't panic, just stopped forwarding traffic through tunnel.

any chance any progress has been made here? is there any newer versions
of these diffs floating around?



Re: Variable type fix in parse.y (all of them)

2021-10-02 Thread Christian Weisgerber
Here's another attempt, incorporating millert's feedback and adding
a few more casts:

Index: bin/chio/parse.y
===
RCS file: /cvs/src/bin/chio/parse.y,v
retrieving revision 1.23
diff -u -p -r1.23 parse.y
--- bin/chio/parse.y15 Oct 2020 19:42:56 -  1.23
+++ bin/chio/parse.y2 Oct 2021 19:42:06 -
@@ -179,9 +179,9 @@ lookup(char *s)
 
 #define MAXPUSHBACK128
 
-u_char *parsebuf;
+char   *parsebuf;
 int parseindex;
-u_char  pushback_buffer[MAXPUSHBACK];
+charpushback_buffer[MAXPUSHBACK];
 int pushback_index = 0;
 
 int
@@ -192,7 +192,7 @@ lgetc(int quotec)
if (parsebuf) {
/* Read character from the parsebuffer instead of input. */
if (parseindex >= 0) {
-   c = parsebuf[parseindex++];
+   c = (unsigned char)parsebuf[parseindex++];
if (c != '\0')
return (c);
parsebuf = NULL;
@@ -201,7 +201,7 @@ lgetc(int quotec)
}
 
if (pushback_index)
-   return (pushback_buffer[--pushback_index]);
+   return ((unsigned char)pushback_buffer[--pushback_index]);
 
if (quotec) {
if ((c = getc(file->stream)) == EOF) {
@@ -242,10 +242,10 @@ lungetc(int c)
if (parseindex >= 0)
return (c);
}
-   if (pushback_index < MAXPUSHBACK-1)
-   return (pushback_buffer[pushback_index++] = c);
-   else
+   if (pushback_index + 1 >= MAXPUSHBACK)
return (EOF);
+   pushback_buffer[pushback_index++] = c;
+   return (c);
 }
 
 int
@@ -272,8 +272,8 @@ findeol(void)
 int
 yylex(void)
 {
-   u_char   buf[8096];
-   u_char  *p;
+   char buf[8096];
+   char*p;
int  quotec, next, c;
int  token;
 
@@ -353,8 +353,8 @@ yylex(void)
} else {
 nodigits:
while (p > buf + 1)
-   lungetc(*--p);
-   c = *--p;
+   lungetc((unsigned char)*--p);
+   c = (unsigned char)*--p;
if (c == '-')
return (c);
}
Index: sbin/dhcpleased/parse.y
===
RCS file: /cvs/src/sbin/dhcpleased/parse.y,v
retrieving revision 1.4
diff -u -p -r1.4 parse.y
--- sbin/dhcpleased/parse.y 20 Sep 2021 11:46:22 -  1.4
+++ sbin/dhcpleased/parse.y 2 Oct 2021 19:17:33 -
@@ -463,10 +463,10 @@ findeol(void)
 int
 yylex(void)
 {
-   unsigned charbuf[8096];
-   unsigned char   *p, *val;
-   int  quotec, next, c;
-   int  token;
+   char buf[8096];
+   char*p, *val;
+   int  quotec, next, c;
+   int  token;
 
 top:
p = buf;
@@ -502,7 +502,7 @@ top:
p = val + strlen(val) - 1;
lungetc(DONE_EXPAND);
while (p >= val) {
-   lungetc(*p);
+   lungetc((unsigned char)*p);
p--;
}
lungetc(START_EXPAND);
@@ -578,8 +578,8 @@ top:
} else {
 nodigits:
while (p > buf + 1)
-   lungetc(*--p);
-   c = *--p;
+   lungetc((unsigned char)*--p);
+   c = (unsigned char)*--p;
if (c == '-')
return (c);
}
Index: sbin/iked/parse.y
===
RCS file: /cvs/src/sbin/iked/parse.y,v
retrieving revision 1.132
diff -u -p -r1.132 parse.y
--- sbin/iked/parse.y   18 Sep 2021 16:45:52 -  1.132
+++ sbin/iked/parse.y   2 Oct 2021 19:07:12 -
@@ -1510,10 +1510,10 @@ findeol(void)
 int
 yylex(void)
 {
-   unsigned charbuf[8096];
-   unsigned char   *p, *val;
-   int  quotec, next, c;
-   int  token;
+   char buf[8096];
+   char*p, *val;
+   int  quotec, next, c;
+   int  token;
 
 top:
p = buf;
@@ -1549,7 +1549,7 @@ top:
p = val + strlen(val) - 1;
lungetc(DONE_EXPAND);
while (p >= val) {
-   lungetc(*p);
+   lungetc((unsigned char)*p);
p--;
}
lungetc(START_EXPAND);
@@ -1625,8 +1625,8 @@ top:
} else {
 nodigits:
while (p > buf + 1)
-   lungetc(*--p);
-   c = *--p;
+   lungetc((unsigned char)*--p);
+   c = (unsigned char)*--p;
if (c == 

Re: mi_switch() & setting `p_stat'

2021-10-02 Thread Philip Guenther
On Sat, Oct 2, 2021 at 8:57 AM Martin Pieuchot  wrote:

> When a thread running on a CPU schedules itself out, it does the following
> (pseudo_code):
>
> SCHED_LOCK()
> curproc->p_stat = SSLEEP;
> // some more operations
> mi_switch()
>
> The problem with this is that any instrumentation between setting `p_stat'
> and cpu_switchto() is incorrect because 'curproc' is still being executed
> and is not yet sleeping.  Its `p_stat' should be SONPROC and not SSLEEP.
>
...

> To fix this we should set `p_stat' as late a possible, diff below does that
> just before calling cpu_switchto().
>
...

> --- kern/kern_sig.c 28 Sep 2021 10:00:18 -  1.283
> +++ kern/kern_sig.c 2 Oct 2021 17:00:52 -
> @@ -1347,7 +1347,6 @@ proc_stop(struct proc *p, int sw)
> SCHED_ASSERT_LOCKED();
>  #endif
>
> -   p->p_stat = SSTOP;
> atomic_clearbits_int(&pr->ps_flags, PS_WAITED);
> atomic_setbits_int(&pr->ps_flags, PS_STOPPED);
> atomic_setbits_int(&p->p_flag, P_SUSPSIG);
> @@ -1357,7 +1356,7 @@ proc_stop(struct proc *p, int sw)
>  */
> softintr_schedule(proc_stop_si);
> if (sw)
> -   mi_switch();
> +   mi_switch(SSTOP);
>

This pair of chunks is wrong, as then the proc_stop(p, 0) call in
ptsignal() doesn't change the process from SSLEEP to SSTOP.  Sending a stop
signal to a blocked process needs to change its state.


Philip Guenther


Re: mi_switch() & setting `p_stat'

2021-10-02 Thread Mark Kettenis
> Date: Sat, 2 Oct 2021 20:35:41 +0200
> From: Martin Pieuchot 
> 
> On 02/10/21(Sat) 20:24, Mark Kettenis wrote:
> > > Date: Sat, 2 Oct 2021 19:55:49 +0200
> > > From: Martin Pieuchot 
> > > 
> > > When a thread running on a CPU schedules itself out, it does the following
> > > (pseudo_code):
> > > 
> > >   SCHED_LOCK()
> > >   curproc->p_stat = SSLEEP;
> > >   // some more operations
> > >   mi_switch()
> > > 
> > > The problem with this is that any instrumentation between setting `p_stat'
> > > and cpu_switchto() is incorrect because 'curproc' is still being executed
> > > and is not yet sleeping.  Its `p_stat' should be SONPROC and not SSLEEP.
> > 
> > Hmm, well, we're holding the scheduler lock, so nothing should really
> > look at our state at this point...
> 
> I added many TRACEPOINT() to investigate the scheduler's behaviour.  They
> look at those states.
> 
> > > It is possible to reproduce the problem with the following btrace(8) 
> > > script:
> > > 
> > >   tracepoint:sched:enqueue { printf("%d -> enqueue (%d)\n", arg0, arg1); }
> > >   tracepoint:sched:dequeue { printf("%d <- dequeue (%d)\n", arg0, arg1); }
> > >   tracepoint:sched:on__cpu { printf("%d -- on cpu (%d)\n", tid, pid); }
> > > 
> > > At which point the KASSERT() in wakeup_n() triggers if `curproc' is going 
> > > to
> > > sleep and its sleep channel collides with the running btrace(8) program:
> > > 
> > >   dt_prov_static_hook() at dt_prov_static_hook+0xe4
> > >   remrunqueue() at remrunqueue+0x1a4
> > >   sched_chooseproc() at sched_chooseproc+0x200
> > >   mi_switch() at mi_switch+0x178
> > >   sleep_finish() at sleep_finish+0x1d0
> > >   tsleep() at tsleep+0x100
> > >   biowait() at biowait+0x4c
> > >   ffs_read() at ffs_read+0x1c0
> > >   VOP_READ() at VOP_READ+0x44
> > >   vn_read() at vn_read+0x84
> > >   dofilereadv() at dofilereadv+0x8c
> > >   sys_read() at sys_read+0x5c
> > 
> > which suggests that something fishy is going on here.  Did we
> > accidentally introduce a sleeping point in the scheduler?
> 
> There's no sleeping point but a call to wakeup().  This wakeup() is
> supposed to wake a btrace(8) process.  But if the curproc, which just
> added itself to the global sleep queue, ends up in the same bucket as
> the btrace process, the KASSERT() line 565 of kern/kern_synch.c will
> trigger:
> 
> /*
>  * If the rwlock passed to rwsleep() is contended, the
>  * CPU will end up calling wakeup() between sleep_setup()
>  * and sleep_finish().
>  */
> if (p == curproc) {
> KASSERT(p->p_stat == SONPROC);
> continue;
> }

Ah, right.  But that means the comment isn't accurate.  At least there
are other cases that make us hit that codepath.

How useful is that KASSERT in catching actual bugs?

> > > To fix this we should set `p_stat' as late a possible, diff below does 
> > > that
> > > just before calling cpu_switchto().
> > > 
> > > Note that there's an exception for SRUN because setrunqueue() change 
> > > `p_stat'
> > > to indicate the thread is on a queue.  I'll discuss that in an upcoming 
> > > diff.
> > > 
> > > ok?
> > 
> > I'm not necessarily against this diff, but it may hide bugs.  And...
> 
> Updated diff that uses a char.

That looks better.

> Index: kern/kern_sched.c
> ===
> RCS file: /cvs/src/sys/kern/kern_sched.c,v
> retrieving revision 1.73
> diff -u -p -r1.73 kern_sched.c
> --- kern/kern_sched.c 9 Sep 2021 18:41:39 -   1.73
> +++ kern/kern_sched.c 2 Oct 2021 17:00:52 -
> @@ -144,10 +144,9 @@ sched_idle(void *v)
>*/
>   SCHED_LOCK(s);
>   cpuset_add(&sched_idle_cpus, ci);
> - p->p_stat = SSLEEP;
>   p->p_cpu = ci;
>   atomic_setbits_int(&p->p_flag, P_CPUPEG);
> - mi_switch();
> + mi_switch(SSLEEP);
>   cpuset_del(&sched_idle_cpus, ci);
>   SCHED_UNLOCK(s);
>  
> @@ -159,8 +158,7 @@ sched_idle(void *v)
>   struct proc *dead;
>  
>   SCHED_LOCK(s);
> - p->p_stat = SSLEEP;
> - mi_switch();
> + mi_switch(SSLEEP);
>   SCHED_UNLOCK(s);
>  
>   while ((dead = LIST_FIRST(&spc->spc_deadproc))) {
> @@ -625,7 +623,7 @@ sched_peg_curproc(struct cpu_info *ci)
>   atomic_setbits_int(&p->p_flag, P_CPUPEG);
>   setrunqueue(ci, p, p->p_usrpri);
>   p->p_ru.ru_nvcsw++;
> - mi_switch();
> + mi_switch(SRUN);
>   SCHED_UNLOCK(s);
>  }
>  
> Index: kern/kern_synch.c
> ===
> RCS file: /cvs/src/sys/kern/kern_synch.c,v
> retrieving revision 1.179
> diff -u -p -r1.179 kern_synch.c
> --- kern/kern_synch.c 9 Sep 2021 18:41:39 -   1.179
> +++ kern/kern_synch.c 2 Oct 2021 17:00:52 -
> @@ -421,10 +421,9 @@ sleep_finish(struct s

Re: Relayd daily crash ca_dispatch_relay invalid

2021-10-02 Thread Sebastian Benoit
abyx...@mnetic.ch(abyx...@mnetic.ch) on 2021.10.01 09:56:32 -0400:
> On Fri, Oct 1, 2021, at 09:44, Stuart Henderson wrote:
> > On 2021/10/01 14:43, Stuart Henderson wrote:
> >> On 2021/10/01 09:29, abyx...@mnetic.ch wrote:
> >> > I'm getting a daily crash (call to fatalx). No clue what triggers it and 
> >> > the logging is really sparse. I couldn't follow what the code in ca.c is 
> >> > actually doing (what the hash belongs to that is triggering the crash). 
> >> > A snip from /var/log/daemon is reproduced below. There are no other log 
> >> > messages in any logs around the same time frame as the relayd shutdown. 
> >> > Also, that fd passing failed for https is concerning. Any suggestions in 
> >> > debugging this? OpenBSD 6.9, dmesg at bottom. 
> >> > 
> >> > 
> >> > grep relayd /var/log/daemon
> >> > 876:Sep 30 15:07:39 mnetic relayd[222]: adding 1 hosts from table 
> >> > axolotlfeeds:7053 (no check)
> >> > 877:Sep 30 15:07:39 mnetic relayd[44589]: adding 1 hosts from table 
> >> > axolotlfeeds:7053 (no check)
> >> > 878:Sep 30 15:07:39 mnetic relayd[57595]: adding 1 hosts from table 
> >> > axolotlfeeds:7053 (no check)
> >> > 909:Oct  1 00:42:20 mnetic relayd[95946]: ca: ca_dispatch_relay: invalid 
> >> > relay hash 
> >> > 'SHA256:f545fa75bd01d82c05359e21ae769d525c2971b8221a8e50e415fe3e62ea6553'
> >> > 910:Oct  1 00:42:20 mnetic relayd[44589]: relay: pipe closed
> >> > 911:Oct  1 00:42:20 mnetic relayd[8473]: hce exiting, pid 8473
> >> > 912:Oct  1 00:42:20 mnetic relayd[15293]: pfe exiting, pid 15293
> >> > 913:Oct  1 00:42:20 mnetic relayd[29437]: ca exiting, pid 29437
> >> > 914:Oct  1 00:42:20 mnetic relayd[52845]: ca exiting, pid 52845
> >> > 915:Oct  1 00:42:20 mnetic relayd[99285]: parent terminating, pid 99285
> >> > 916:Oct  1 00:42:20 mnetic relayd[222]: relay exiting, pid 222
> >> > 917:Oct  1 00:42:20 mnetic relayd[57595]: relay exiting, pid 57595
> >> > 921:Oct  1 09:12:59 mnetic relayd[2129]: startup
> >> > 922:Oct  1 09:12:59 mnetic relayd[2129]: config_setrelay: fd passing 
> >> > failed for `https': Too many open files
> >>
> >> ^^^
> >> 
> >> By default the limit for relayd is controlled by the openfiles settings
> >> for the "daemon" class in /etc/login.conf. You can check current use
> >> with fstat | grep -c relayd. If you need to raise it you can add
> >> something like this to the file, replacing XXX with "what you normally
> >> see when it's busy plus some extra leeway" and then "rcctl restart
> >> relayd" so it takes effect.
> >> 
> >> relayd:\
> >> :openfiles-max=XXX:\
> >> :openfiles-cur=XXX:\
> >> :tc=default:
> >
> > Hmm. Rereading your mail it might not be that..but if the limit is too
> > low you could have other things failing as a result..
> 
> Looks like I was using 154 (of 128) files, so I bumped openfiles-cur slightly 
> to accommodate that. 
> Unfortunately it's now a day or two wait to see if that was contributing to 
> the failure. 

relayd should not fatal in such a case like this, but handle this a bit more
graceful. Please let me know if bumping the limit fixes it.




Re: mi_switch() & setting `p_stat'

2021-10-02 Thread Martin Pieuchot
On 02/10/21(Sat) 20:24, Mark Kettenis wrote:
> > Date: Sat, 2 Oct 2021 19:55:49 +0200
> > From: Martin Pieuchot 
> > 
> > When a thread running on a CPU schedules itself out, it does the following
> > (pseudo_code):
> > 
> > SCHED_LOCK()
> > curproc->p_stat = SSLEEP;
> > // some more operations
> > mi_switch()
> > 
> > The problem with this is that any instrumentation between setting `p_stat'
> > and cpu_switchto() is incorrect because 'curproc' is still being executed
> > and is not yet sleeping.  Its `p_stat' should be SONPROC and not SSLEEP.
> 
> Hmm, well, we're holding the scheduler lock, so nothing should really
> look at our state at this point...

I added many TRACEPOINT() to investigate the scheduler's behaviour.  They
look at those states.

> > It is possible to reproduce the problem with the following btrace(8) script:
> > 
> >   tracepoint:sched:enqueue { printf("%d -> enqueue (%d)\n", arg0, arg1); }
> >   tracepoint:sched:dequeue { printf("%d <- dequeue (%d)\n", arg0, arg1); }
> >   tracepoint:sched:on__cpu { printf("%d -- on cpu (%d)\n", tid, pid); }
> > 
> > At which point the KASSERT() in wakeup_n() triggers if `curproc' is going to
> > sleep and its sleep channel collides with the running btrace(8) program:
> > 
> >   dt_prov_static_hook() at dt_prov_static_hook+0xe4
> >   remrunqueue() at remrunqueue+0x1a4
> >   sched_chooseproc() at sched_chooseproc+0x200
> >   mi_switch() at mi_switch+0x178
> >   sleep_finish() at sleep_finish+0x1d0
> >   tsleep() at tsleep+0x100
> >   biowait() at biowait+0x4c
> >   ffs_read() at ffs_read+0x1c0
> >   VOP_READ() at VOP_READ+0x44
> >   vn_read() at vn_read+0x84
> >   dofilereadv() at dofilereadv+0x8c
> >   sys_read() at sys_read+0x5c
> 
> which suggests that something fishy is going on here.  Did we
> accidentally introduce a sleeping point in the scheduler?

There's no sleeping point but a call to wakeup().  This wakeup() is
supposed to wake a btrace(8) process.  But if the curproc, which just
added itself to the global sleep queue, ends up in the same bucket as
the btrace process, the KASSERT() line 565 of kern/kern_synch.c will
trigger:

/*
 * If the rwlock passed to rwsleep() is contended, the
 * CPU will end up calling wakeup() between sleep_setup()
 * and sleep_finish().
 */
if (p == curproc) {
KASSERT(p->p_stat == SONPROC);
continue;
}

> > To fix this we should set `p_stat' as late a possible, diff below does that
> > just before calling cpu_switchto().
> > 
> > Note that there's an exception for SRUN because setrunqueue() change 
> > `p_stat'
> > to indicate the thread is on a queue.  I'll discuss that in an upcoming 
> > diff.
> > 
> > ok?
> 
> I'm not necessarily against this diff, but it may hide bugs.  And...

Updated diff that uses a char.

Index: kern/kern_sched.c
===
RCS file: /cvs/src/sys/kern/kern_sched.c,v
retrieving revision 1.73
diff -u -p -r1.73 kern_sched.c
--- kern/kern_sched.c   9 Sep 2021 18:41:39 -   1.73
+++ kern/kern_sched.c   2 Oct 2021 17:00:52 -
@@ -144,10 +144,9 @@ sched_idle(void *v)
 */
SCHED_LOCK(s);
cpuset_add(&sched_idle_cpus, ci);
-   p->p_stat = SSLEEP;
p->p_cpu = ci;
atomic_setbits_int(&p->p_flag, P_CPUPEG);
-   mi_switch();
+   mi_switch(SSLEEP);
cpuset_del(&sched_idle_cpus, ci);
SCHED_UNLOCK(s);
 
@@ -159,8 +158,7 @@ sched_idle(void *v)
struct proc *dead;
 
SCHED_LOCK(s);
-   p->p_stat = SSLEEP;
-   mi_switch();
+   mi_switch(SSLEEP);
SCHED_UNLOCK(s);
 
while ((dead = LIST_FIRST(&spc->spc_deadproc))) {
@@ -625,7 +623,7 @@ sched_peg_curproc(struct cpu_info *ci)
atomic_setbits_int(&p->p_flag, P_CPUPEG);
setrunqueue(ci, p, p->p_usrpri);
p->p_ru.ru_nvcsw++;
-   mi_switch();
+   mi_switch(SRUN);
SCHED_UNLOCK(s);
 }
 
Index: kern/kern_synch.c
===
RCS file: /cvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.179
diff -u -p -r1.179 kern_synch.c
--- kern/kern_synch.c   9 Sep 2021 18:41:39 -   1.179
+++ kern/kern_synch.c   2 Oct 2021 17:00:52 -
@@ -421,10 +421,9 @@ sleep_finish(struct sleep_state *sls, in
}
 
if (do_sleep) {
-   p->p_stat = SSLEEP;
p->p_ru.ru_nvcsw++;
SCHED_ASSERT_LOCKED();
-   mi_switch();
+   mi_switch(SSLEEP);
} else {
unsleep(p);
}
@@ -603,7 +602,7 @@ sys_sched_yield(struct proc *p, void *v,
newprio = max(newprio, q->p_runpri);
setrunqueue(p->p_cpu, p, newprio);
p->p_ru.ru

Re: mi_switch() & setting `p_stat'

2021-10-02 Thread Mark Kettenis
> Date: Sat, 2 Oct 2021 19:55:49 +0200
> From: Martin Pieuchot 
> 
> When a thread running on a CPU schedules itself out, it does the following
> (pseudo_code):
> 
>   SCHED_LOCK()
>   curproc->p_stat = SSLEEP;
>   // some more operations
>   mi_switch()
> 
> The problem with this is that any instrumentation between setting `p_stat'
> and cpu_switchto() is incorrect because 'curproc' is still being executed
> and is not yet sleeping.  Its `p_stat' should be SONPROC and not SSLEEP.

Hmm, well, we're holding the scheduler lock, so nothing should really
look at our state at this point...

> It is possible to reproduce the problem with the following btrace(8) script:
> 
>   tracepoint:sched:enqueue { printf("%d -> enqueue (%d)\n", arg0, arg1); }
>   tracepoint:sched:dequeue { printf("%d <- dequeue (%d)\n", arg0, arg1); }
>   tracepoint:sched:on__cpu { printf("%d -- on cpu (%d)\n", tid, pid); }
> 
> At which point the KASSERT() in wakeup_n() triggers if `curproc' is going to
> sleep and its sleep channel collides with the running btrace(8) program:
> 
>   dt_prov_static_hook() at dt_prov_static_hook+0xe4
>   remrunqueue() at remrunqueue+0x1a4
>   sched_chooseproc() at sched_chooseproc+0x200
>   mi_switch() at mi_switch+0x178
>   sleep_finish() at sleep_finish+0x1d0
>   tsleep() at tsleep+0x100
>   biowait() at biowait+0x4c
>   ffs_read() at ffs_read+0x1c0
>   VOP_READ() at VOP_READ+0x44
>   vn_read() at vn_read+0x84
>   dofilereadv() at dofilereadv+0x8c
>   sys_read() at sys_read+0x5c

which suggests that something fishy is going on here.  Did we
accidentally introduce a sleeping point in the scheduler?

> To fix this we should set `p_stat' as late a possible, diff below does that
> just before calling cpu_switchto().
> 
> Note that there's an exception for SRUN because setrunqueue() change `p_stat'
> to indicate the thread is on a queue.  I'll discuss that in an upcoming diff.
> 
> ok?

I'm not necessarily against this diff, but it may hide bugs.  And...

> Index: kern/kern_sched.c
> ===
> RCS file: /cvs/src/sys/kern/kern_sched.c,v
> retrieving revision 1.73
> diff -u -p -r1.73 kern_sched.c
> --- kern/kern_sched.c 9 Sep 2021 18:41:39 -   1.73
> +++ kern/kern_sched.c 2 Oct 2021 17:00:52 -
> @@ -144,10 +144,9 @@ sched_idle(void *v)
>*/
>   SCHED_LOCK(s);
>   cpuset_add(&sched_idle_cpus, ci);
> - p->p_stat = SSLEEP;
>   p->p_cpu = ci;
>   atomic_setbits_int(&p->p_flag, P_CPUPEG);
> - mi_switch();
> + mi_switch(SSLEEP);
>   cpuset_del(&sched_idle_cpus, ci);
>   SCHED_UNLOCK(s);
>  
> @@ -159,8 +158,7 @@ sched_idle(void *v)
>   struct proc *dead;
>  
>   SCHED_LOCK(s);
> - p->p_stat = SSLEEP;
> - mi_switch();
> + mi_switch(SSLEEP);
>   SCHED_UNLOCK(s);
>  
>   while ((dead = LIST_FIRST(&spc->spc_deadproc))) {
> @@ -625,7 +623,7 @@ sched_peg_curproc(struct cpu_info *ci)
>   atomic_setbits_int(&p->p_flag, P_CPUPEG);
>   setrunqueue(ci, p, p->p_usrpri);
>   p->p_ru.ru_nvcsw++;
> - mi_switch();
> + mi_switch(SRUN);
>   SCHED_UNLOCK(s);
>  }
>  
> Index: kern/kern_synch.c
> ===
> RCS file: /cvs/src/sys/kern/kern_synch.c,v
> retrieving revision 1.179
> diff -u -p -r1.179 kern_synch.c
> --- kern/kern_synch.c 9 Sep 2021 18:41:39 -   1.179
> +++ kern/kern_synch.c 2 Oct 2021 17:00:52 -
> @@ -421,10 +421,9 @@ sleep_finish(struct sleep_state *sls, in
>   }
>  
>   if (do_sleep) {
> - p->p_stat = SSLEEP;
>   p->p_ru.ru_nvcsw++;
>   SCHED_ASSERT_LOCKED();
> - mi_switch();
> + mi_switch(SSLEEP);
>   } else {
>   unsleep(p);
>   }
> @@ -603,7 +602,7 @@ sys_sched_yield(struct proc *p, void *v,
>   newprio = max(newprio, q->p_runpri);
>   setrunqueue(p->p_cpu, p, newprio);
>   p->p_ru.ru_nvcsw++;
> - mi_switch();
> + mi_switch(SRUN);
>   SCHED_UNLOCK(s);
>  
>   return (0);
> Index: kern/kern_sig.c
> ===
> RCS file: /cvs/src/sys/kern/kern_sig.c,v
> retrieving revision 1.283
> diff -u -p -r1.283 kern_sig.c
> --- kern/kern_sig.c   28 Sep 2021 10:00:18 -  1.283
> +++ kern/kern_sig.c   2 Oct 2021 17:00:52 -
> @@ -1347,7 +1347,6 @@ proc_stop(struct proc *p, int sw)
>   SCHED_ASSERT_LOCKED();
>  #endif
>  
> - p->p_stat = SSTOP;
>   atomic_clearbits_int(&pr->ps_flags, PS_WAITED);
>   atomic_setbits_int(&pr->ps_flags, PS_STOPPED);
>   atomic_setbits_int(&p->p_flag, P_SUSPSIG);
> @@ -1357,7 +1356,7 @@ proc_stop(struct proc *p, int sw)
>*/
>   softintr_schedule(proc_stop_si);
>   if (sw)
> - mi_switch();
> + mi_switch

mi_switch() & setting `p_stat'

2021-10-02 Thread Martin Pieuchot
When a thread running on a CPU schedules itself out, it does the following
(pseudo_code):

SCHED_LOCK()
curproc->p_stat = SSLEEP;
// some more operations
mi_switch()

The problem with this is that any instrumentation between setting `p_stat'
and cpu_switchto() is incorrect because 'curproc' is still being executed
and is not yet sleeping.  Its `p_stat' should be SONPROC and not SSLEEP.

It is possible to reproduce the problem with the following btrace(8) script:

  tracepoint:sched:enqueue { printf("%d -> enqueue (%d)\n", arg0, arg1); }
  tracepoint:sched:dequeue { printf("%d <- dequeue (%d)\n", arg0, arg1); }
  tracepoint:sched:on__cpu { printf("%d -- on cpu (%d)\n", tid, pid); }

At which point the KASSERT() in wakeup_n() triggers if `curproc' is going to
sleep and its sleep channel collides with the running btrace(8) program:

  dt_prov_static_hook() at dt_prov_static_hook+0xe4
  remrunqueue() at remrunqueue+0x1a4
  sched_chooseproc() at sched_chooseproc+0x200
  mi_switch() at mi_switch+0x178
  sleep_finish() at sleep_finish+0x1d0
  tsleep() at tsleep+0x100
  biowait() at biowait+0x4c
  ffs_read() at ffs_read+0x1c0
  VOP_READ() at VOP_READ+0x44
  vn_read() at vn_read+0x84
  dofilereadv() at dofilereadv+0x8c
  sys_read() at sys_read+0x5c

To fix this we should set `p_stat' as late a possible, diff below does that
just before calling cpu_switchto().

Note that there's an exception for SRUN because setrunqueue() change `p_stat'
to indicate the thread is on a queue.  I'll discuss that in an upcoming diff.

ok?

Index: kern/kern_sched.c
===
RCS file: /cvs/src/sys/kern/kern_sched.c,v
retrieving revision 1.73
diff -u -p -r1.73 kern_sched.c
--- kern/kern_sched.c   9 Sep 2021 18:41:39 -   1.73
+++ kern/kern_sched.c   2 Oct 2021 17:00:52 -
@@ -144,10 +144,9 @@ sched_idle(void *v)
 */
SCHED_LOCK(s);
cpuset_add(&sched_idle_cpus, ci);
-   p->p_stat = SSLEEP;
p->p_cpu = ci;
atomic_setbits_int(&p->p_flag, P_CPUPEG);
-   mi_switch();
+   mi_switch(SSLEEP);
cpuset_del(&sched_idle_cpus, ci);
SCHED_UNLOCK(s);
 
@@ -159,8 +158,7 @@ sched_idle(void *v)
struct proc *dead;
 
SCHED_LOCK(s);
-   p->p_stat = SSLEEP;
-   mi_switch();
+   mi_switch(SSLEEP);
SCHED_UNLOCK(s);
 
while ((dead = LIST_FIRST(&spc->spc_deadproc))) {
@@ -625,7 +623,7 @@ sched_peg_curproc(struct cpu_info *ci)
atomic_setbits_int(&p->p_flag, P_CPUPEG);
setrunqueue(ci, p, p->p_usrpri);
p->p_ru.ru_nvcsw++;
-   mi_switch();
+   mi_switch(SRUN);
SCHED_UNLOCK(s);
 }
 
Index: kern/kern_synch.c
===
RCS file: /cvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.179
diff -u -p -r1.179 kern_synch.c
--- kern/kern_synch.c   9 Sep 2021 18:41:39 -   1.179
+++ kern/kern_synch.c   2 Oct 2021 17:00:52 -
@@ -421,10 +421,9 @@ sleep_finish(struct sleep_state *sls, in
}
 
if (do_sleep) {
-   p->p_stat = SSLEEP;
p->p_ru.ru_nvcsw++;
SCHED_ASSERT_LOCKED();
-   mi_switch();
+   mi_switch(SSLEEP);
} else {
unsleep(p);
}
@@ -603,7 +602,7 @@ sys_sched_yield(struct proc *p, void *v,
newprio = max(newprio, q->p_runpri);
setrunqueue(p->p_cpu, p, newprio);
p->p_ru.ru_nvcsw++;
-   mi_switch();
+   mi_switch(SRUN);
SCHED_UNLOCK(s);
 
return (0);
Index: kern/kern_sig.c
===
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.283
diff -u -p -r1.283 kern_sig.c
--- kern/kern_sig.c 28 Sep 2021 10:00:18 -  1.283
+++ kern/kern_sig.c 2 Oct 2021 17:00:52 -
@@ -1347,7 +1347,6 @@ proc_stop(struct proc *p, int sw)
SCHED_ASSERT_LOCKED();
 #endif
 
-   p->p_stat = SSTOP;
atomic_clearbits_int(&pr->ps_flags, PS_WAITED);
atomic_setbits_int(&pr->ps_flags, PS_STOPPED);
atomic_setbits_int(&p->p_flag, P_SUSPSIG);
@@ -1357,7 +1356,7 @@ proc_stop(struct proc *p, int sw)
 */
softintr_schedule(proc_stop_si);
if (sw)
-   mi_switch();
+   mi_switch(SSTOP);
 }
 
 /*
@@ -1979,8 +1978,7 @@ single_thread_check_locked(struct proc *
}
 
/* not exiting and don't need to unwind, so suspend */
-   p->p_stat = SSTOP;
-   mi_switch();
+   mi_switch(SSTOP);
} while (pr->ps_single != NULL);
}
 
Index: kern/sched_bsd.c
===
RCS file: /cvs/src/sys/kern/sched

Re: [PATCH] usr.sbin/ldapd: Match bind DN by suffix instead of complete DN.

2021-10-02 Thread vifino
On Sat Oct 2, 2021 at 6:36 PM CEST, Raf Czlonka wrote:
> On Sat, Oct 02, 2021 at 02:15:53PM BST, vifino wrote:
> > Index: ldapd.conf.5
> > ===
> > RCS file: /cvs/src/usr.sbin/ldapd/ldapd.conf.5,v
> > retrieving revision 1.27
> > diff -u -p -u -p -r1.27 ldapd.conf.5
> > --- ldapd.conf.524 Jun 2020 07:20:47 -  1.27
> > +++ ldapd.conf.52 Oct 2021 12:43:29 -
> > @@ -270,7 +270,7 @@ Finally, the filter rule can match a bin
> >  The filter rule matches by any bind dn, including anonymous binds.
> >  .It by Ar DN
> >  The filter rule matches only if the requestor has previously performed
> > -a bind as the specified distinguished name.
> > +a bind as the specified distinguished name or a decendant.
>^
> A spellchecker[0] would have caught that ;^)
Ah, yes, of course. The one thing I spent zero effort on.
I haven't quite grokked the workflow, first submitted patch and all.
I'll certainly run `spell` next time.

I'll keep this in mind for the next one. ;)
>
> [0] https://manpages.bsd.lv/part3-3-2.html
>
> Regards,
>
> Raf

Revised patch below, not that it's necessary.
- vifino

---

Index: auth.c
===
RCS file: /cvs/src/usr.sbin/ldapd/auth.c,v
retrieving revision 1.14
diff -u -p -u -p -r1.14 auth.c
--- auth.c  24 Oct 2019 12:39:26 -  1.14
+++ auth.c  2 Oct 2021 17:21:28 -
@@ -94,8 +94,13 @@ aci_matches(struct aci *aci, struct conn
if (strcmp(aci->subject, "@") == 0) {
if (strcmp(dn, conn->binddn) != 0)
return 0;
-   } else if (strcmp(aci->subject, conn->binddn) != 0)
-   return 0;
+   } else {
+   key.size = strlen(conn->binddn);
+   key.data = conn->binddn;
+
+   if (!has_suffix(&key, aci->subject))
+   return 0;
+   }
}
 
if (aci->attribute != NULL) {
Index: ldapd.conf.5
===
RCS file: /cvs/src/usr.sbin/ldapd/ldapd.conf.5,v
retrieving revision 1.27
diff -u -p -u -p -r1.27 ldapd.conf.5
--- ldapd.conf.524 Jun 2020 07:20:47 -  1.27
+++ ldapd.conf.52 Oct 2021 17:21:28 -
@@ -270,7 +270,7 @@ Finally, the filter rule can match a bin
 The filter rule matches by any bind dn, including anonymous binds.
 .It by Ar DN
 The filter rule matches only if the requestor has previously performed
-a bind as the specified distinguished name.
+a bind as the specified distinguished name or a descendant.
 .It by self
 The filter rule matches only if the requestor has previously performed
 a bind as the distinguished name that is being requested.



Re: [PATCH] usr.sbin/ldapd: Match bind DN by suffix instead of complete DN.

2021-10-02 Thread Raf Czlonka
On Sat, Oct 02, 2021 at 02:15:53PM BST, vifino wrote:
> Index: ldapd.conf.5
> ===
> RCS file: /cvs/src/usr.sbin/ldapd/ldapd.conf.5,v
> retrieving revision 1.27
> diff -u -p -u -p -r1.27 ldapd.conf.5
> --- ldapd.conf.5  24 Jun 2020 07:20:47 -  1.27
> +++ ldapd.conf.5  2 Oct 2021 12:43:29 -
> @@ -270,7 +270,7 @@ Finally, the filter rule can match a bin
>  The filter rule matches by any bind dn, including anonymous binds.
>  .It by Ar DN
>  The filter rule matches only if the requestor has previously performed
> -a bind as the specified distinguished name.
> +a bind as the specified distinguished name or a decendant.
   ^
A spellchecker[0] would have caught that ;^)

[0] https://manpages.bsd.lv/part3-3-2.html

Regards,

Raf



[PATCH] usr.sbin/ldapd: Match bind DN by suffix instead of complete DN.

2021-10-02 Thread vifino
Hello.

While using ldapd(8), I noticed that the allow/deny rules match the
bound DN in a (to me) inconvenient way.

Assuming I had a subtree ou=humans,dc=example,dc=org, users in that
subtree and I wanted to allow all users to get read access to other
users info, I would have to create a rule for each user if I had a
default deny all policy:

```
deny read,write,bind access to subtree root by any
allow bind access to children of "ou=humans,dc=example,dc=org"

allow read access to subtree "ou=humans,dc=example,dc=org" by \
"uid=bob,ou=humans,dc=example,dc=org"
allow read access to subtree "ou=humans,dc=example,dc=org" by \
"uid=alice,ou=humans,dc=example,dc=org"
allow read access to subtree "ou=humans,dc=example,dc=org" by \
"uid=jane,ou=humans,dc=example,dc=org"
```

Instead, I made the `by DN` part match the suffix,
so the above is still valid, but can also be simplified to the below:

```
deny read,write,bind access to subtree root by any
allow bind access to children of "ou=humans,dc=example,dc=org"

allow read access to subtree "ou=humans,dc=example,dc=org" by \
"ou=humans,dc=example,dc=org"
```

Since the order matters, you can also add a disallow for a specific
uid or something, of course.

I can't think of a need for exact bind matches, as this probably
does what people want. After all, explicitly needing to define
permissions for every member of the tree structure kinda ruins it, no?

Let me know if I need to do any changes.
- vifino

---

Index: auth.c
===
RCS file: /cvs/src/usr.sbin/ldapd/auth.c,v
retrieving revision 1.14
diff -u -p -u -p -r1.14 auth.c
--- auth.c  24 Oct 2019 12:39:26 -  1.14
+++ auth.c  2 Oct 2021 12:43:29 -
@@ -94,8 +94,13 @@ aci_matches(struct aci *aci, struct conn
if (strcmp(aci->subject, "@") == 0) {
if (strcmp(dn, conn->binddn) != 0)
return 0;
-   } else if (strcmp(aci->subject, conn->binddn) != 0)
-   return 0;
+   } else {
+   key.size = strlen(conn->binddn);
+   key.data = conn->binddn;
+
+   if (!has_suffix(&key, aci->subject))
+   return 0;
+   }
}
 
if (aci->attribute != NULL) {
Index: ldapd.conf.5
===
RCS file: /cvs/src/usr.sbin/ldapd/ldapd.conf.5,v
retrieving revision 1.27
diff -u -p -u -p -r1.27 ldapd.conf.5
--- ldapd.conf.524 Jun 2020 07:20:47 -  1.27
+++ ldapd.conf.52 Oct 2021 12:43:29 -
@@ -270,7 +270,7 @@ Finally, the filter rule can match a bin
 The filter rule matches by any bind dn, including anonymous binds.
 .It by Ar DN
 The filter rule matches only if the requestor has previously performed
-a bind as the specified distinguished name.
+a bind as the specified distinguished name or a decendant.
 .It by self
 The filter rule matches only if the requestor has previously performed
 a bind as the distinguished name that is being requested.



Re: Alternate cpu policy on battery

2021-10-02 Thread prx
Hello,
I've been using this diff for a week now.
I didn't encounter any issue.

As my laptop has a quite old battery, I noticed improvements : I can
work 30min longer on battery now.

The systemp is a bit slower though (ie libreoffice longer to start), but
that was expected and I must say I didn't feel any discomfort either.

>From a non-power-tech user point of view, this patch is a great
improvement.

Regards.

prx


* Solene Rapenne  le [25-09-2021 20:26:06 +0200]:
> Last time I proposed a CPU frequency scheduling change to make it
> more performance oriented, however this would hurt badly the battery
> life for nomad users.
> 
> As we don't like tweaks and complicated settings, I create a new
> CPU frequency policy that is oriented for nomad users, it reduces
> performance a bit but from my (short) experiments it would give a
> way better battery life while keeping the system responsive which
> is not allowing apm -L
> 
> The current auto mode is kept as it is now and I added a new
> powersaving mode that will be triggered automatically by apmd when
> the system is on battery AND automatic mode is activated.
> 
> So, on A/C you have the usual automatic policy and on battery it
> would switch on the new powersaving policy.
> 
> I hope this could be useful for nomad people and later we could
> tune the automatic mode (to be used on A/C) to be giving more
> performance, without affecting people on battery.
> 
> The code change is relatively simple, I'm pretty sure this can be
> improved but I wanted to share it as this for now so people can
> comment and/or try it out.
> 
> I created a new policy which is basically a copy/paste of the current
> one with a different name and differents thresholds, declared it
> in apmd. There is another change in apmd to check on event if we
> are using the policy auto and if on battery or not, this takes care
> of changing the policy automatically. I'm not sure it's at the right
> place. As for the code in the kernel, I'm quite sure it could be
> merged into one and use different thresholds depending on the policy
> name currently in use.
> 
> Kernel + usr.sbin/apmd + usr.sbin/apm recompilation required
> 
> Index: sys/kern//sched_bsd.c
> ===
> RCS file: /home/reposync/src/sys/kern/sched_bsd.c,v
> retrieving revision 1.69
> diff -u -p -r1.69 sched_bsd.c
> --- sys/kern//sched_bsd.c 9 Sep 2021 18:41:39 -   1.69
> +++ sys/kern//sched_bsd.c 25 Sep 2021 18:09:40 -
> @@ -531,6 +531,7 @@ void (*cpu_setperf)(int);
>  #define PERFPOL_MANUAL 0
>  #define PERFPOL_AUTO 1
>  #define PERFPOL_HIGH 2
> +#define PERFPOL_POWERSAVING 4
>  int perflevel = 100;
>  int perfpolicy = PERFPOL_MANUAL;
>  
> @@ -541,7 +542,9 @@ int perfpolicy = PERFPOL_MANUAL;
>  #include 
>  
>  void setperf_auto(void *);
> +void setperf_powersaving(void *);
>  struct timeout setperf_to = TIMEOUT_INITIALIZER(setperf_auto, NULL);
> +struct timeout setperf_to_powersaving = 
> TIMEOUT_INITIALIZER(setperf_powersaving, NULL);
>  
>  void
>  setperf_auto(void *v)
> @@ -606,6 +609,73 @@ setperf_auto(void *v)
>   timeout_add_msec(&setperf_to, 100);
>  }
>  
> +void
> +setperf_powersaving(void *v)
> +{
> + static uint64_t *idleticks, *totalticks;
> + static int downbeats;
> +
> + int i, j;
> + int speedup;
> + CPU_INFO_ITERATOR cii;
> + struct cpu_info *ci;
> + uint64_t idle, total, allidle, alltotal;
> +
> + if (perfpolicy != PERFPOL_POWERSAVING)
> + return;
> +
> + if (!idleticks)
> + if (!(idleticks = mallocarray(ncpusfound, sizeof(*idleticks),
> + M_DEVBUF, M_NOWAIT | M_ZERO)))
> + return;
> + if (!totalticks)
> + if (!(totalticks = mallocarray(ncpusfound, sizeof(*totalticks),
> + M_DEVBUF, M_NOWAIT | M_ZERO))) {
> + free(idleticks, M_DEVBUF,
> + sizeof(*idleticks) * ncpusfound);
> + return;
> + }
> +
> + alltotal = allidle = 0;
> + j = 0;
> + speedup = 0;
> + CPU_INFO_FOREACH(cii, ci) {
> + if (!cpu_is_online(ci))
> + continue;
> + total = 0;
> + for (i = 0; i < CPUSTATES; i++) {
> + total += ci->ci_schedstate.spc_cp_time[i];
> + }
> + total -= totalticks[j];
> + idle = ci->ci_schedstate.spc_cp_time[CP_IDLE] - idleticks[j];
> + // speedup if at least one core idle time < 33%
> + if (idle < total / 3)
> + speedup = 1;
> + alltotal += total;
> + allidle += idle;
> + idleticks[j] += idle;
> + totalticks[j] += total;
> + j++;
> + }
> + if (allidle < alltotal / 3)
> + speedup = 1;
> + if (speedup)
> + /* twice as long here because we check every 200ms */
> + dow

Re: patch: remove dead variable from sys___realpath()

2021-10-02 Thread Sebastien Marie
On Sat, Oct 02, 2021 at 12:31:29PM +0200, Sebastien Marie wrote:
> Hi,
> 
> The following diff removes a dead variable `c' which is not used.
> 
> it is a leftover from LOCKPARENT removal in NDINIT().
> 
> See:
>  - 
> https://github.com/openbsd/src/commit/0eff3f09cea339207a839f716e75649e24b667b9
>  - 
> https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/vfs_syscalls.c.diff?r1=1.349&r2=1.350

cvsweb is hard. it is:
  
https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/vfs_syscalls.c.diff?r1=1.336&r2=1.337

-- 
Sebastien Marie



patch: remove dead variable from sys___realpath()

2021-10-02 Thread Sebastien Marie
Hi,

The following diff removes a dead variable `c' which is not used.

it is a leftover from LOCKPARENT removal in NDINIT().

See:
 - 
https://github.com/openbsd/src/commit/0eff3f09cea339207a839f716e75649e24b667b9
 - 
https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/vfs_syscalls.c.diff?r1=1.349&r2=1.350

Comments or OK ?
-- 
Sebastien Marie

 
diff 1006e50b83c18340095bb2af9ff6ea6f01c9a107 
72e4555fa87bffa757ba4a351ce114e513650faf
blob - 3a5db48992eac3fb8adaf98e1ed4db186d76ab33
blob + 7e8dbd9bd21f2d3646d6c9fdafb0f5c289174f1f
--- sys/kern/vfs_syscalls.c
+++ sys/kern/vfs_syscalls.c
@@ -862,7 +862,7 @@ sys___realpath(struct proc *p, void *v, register_t *re
syscallarg(const char *) pathname;
syscallarg(char *) resolved;
} */ *uap = v;
-   char *pathname, *c;
+   char *pathname;
char *rpbuf;
struct nameidata nd;
size_t pathlen;
@@ -916,11 +916,6 @@ sys___realpath(struct proc *p, void *v, register_t *re
free(cwdbuf, M_TEMP, cwdlen);
}
 
-   /* find root "/" or "//" */
-   for (c = pathname; *c != '\0'; c++) {
-   if (*c != '/')
-   break;
-   }
NDINIT(&nd, LOOKUP, FOLLOW | SAVENAME | REALPATH, UIO_SYSSPACE,
pathname, p);
 



Re: rtfree(): "rt->rt_refcnt > 0" assertion

2021-10-02 Thread Martin Pieuchot
On 15/09/21(Wed) 01:23, Vitaliy Makkoveev wrote:
> We have weird `rt_refcnt' check in rtfree():

> 
> 497 rtfree(struct rtentry *rt)
> 498 {
>   ...
> 504 refcnt = (int)atomic_dec_int_nv(&rt->rt_refcnt);
> 505 if (refcnt <= 0) {
> 506 KASSERT(!ISSET(rt->rt_flags, RTF_UP));
> 507 KASSERT(!RT_ROOT(rt));
> 508 atomic_dec_int(&rttrash);
> 509 if (refcnt < 0) {
> 510 printf("rtfree: %p not freed (neg refs)\n", rt);
> 511 return;
> 512 }  
> 
> We underflow `rt_refcnt' when we missed to get reference to this `rt' or
> we did extra release. This is the bug which should be exposed. But
> according current code this `rt' is just leaked. Also it's easy to miss
> this error condition because we only print error message.

Yes, all of this is intentional.  If you screw up reference counting a
leak is better than a crash that you can't debug.  At least people can
report there is a leak.  

> I propose to put "rt->rt_refcnt > 0" assertion before we decrement
> `rt_refcnt'. This makes reference counting errors more notable when they
> are. Also I changed `rt_refcnt' definition to unsigned integer.

Such assert isn't safe because the value can be dereferenced by another
thread.  You can only assert for the value of a private variable or if
you're under a critical section.

> I didn't find any "rtfree: ... not freed (neg refs)" report, so it
> looks like we can't hit this assertion, but I like to commit this
> diff after release. But nothing stops to test it and provide feedback :)

We did hit it many many times in the past.  Please do not change this.

> Index: sys/net/route.c
> ===
> RCS file: /cvs/src/sys/net/route.c,v
> retrieving revision 1.399
> diff -u -p -r1.399 route.c
> --- sys/net/route.c   25 May 2021 22:45:09 -  1.399
> +++ sys/net/route.c   14 Sep 2021 21:47:11 -
> @@ -496,20 +496,15 @@ rtref(struct rtentry *rt)
>  void
>  rtfree(struct rtentry *rt)
>  {
> - int  refcnt;
> -
>   if (rt == NULL)
>   return;
>  
> - refcnt = (int)atomic_dec_int_nv(&rt->rt_refcnt);
> - if (refcnt <= 0) {
> + KASSERT(rt->rt_refcnt > 0);
> +
> + if (atomic_dec_int_nv(&rt->rt_refcnt) == 0) {
>   KASSERT(!ISSET(rt->rt_flags, RTF_UP));
>   KASSERT(!RT_ROOT(rt));
>   atomic_dec_int(&rttrash);
> - if (refcnt < 0) {
> - printf("rtfree: %p not freed (neg refs)\n", rt);
> - return;
> - }
>  
>   KERNEL_LOCK();
>   rt_timer_remove_all(rt);
> Index: sys/net/route.h
> ===
> RCS file: /cvs/src/sys/net/route.h,v
> retrieving revision 1.185
> diff -u -p -r1.185 route.h
> --- sys/net/route.h   17 Mar 2021 09:05:42 -  1.185
> +++ sys/net/route.h   14 Sep 2021 21:47:11 -
> @@ -113,7 +113,7 @@ struct rtentry {
>   struct rt_kmetrics rt_rmx;  /* metrics used by rx'ing protocols */
>   unsigned int rt_ifidx;  /* the answer: interface to use */
>   unsigned int rt_flags;  /* up/down?, host/net */
> - int  rt_refcnt; /* # held references */
> + unsigned int rt_refcnt; /* # held references */
>   int  rt_plen;   /* prefix length */
>   uint16_t rt_labelid;/* route label ID */
>   uint8_t  rt_priority;   /* routing priority to use */
> 



patch: vfs: merge *_badop to vop_generic_badop

2021-10-02 Thread Sebastien Marie
Hi,

The following diff was suggested by mpi@ some months ago.

It replaces spec_badop, fifo_badop, dead_badop and mfs_badop, which
are only calls to panic(9), to one unique function
vop_generic_badop().

No intented behaviour changes (outside the panic message which isn't
the same).

Comments or OK ?
-- 
Sebastien Marie


diff 5dc4728f64d299884a31fd6bf0344fc428f17017 /home/semarie/repos/openbsd/sys
blob - 06ab7114aa889fb2ceca3a3a7105acf017fa4396
file + isofs/cd9660/cd9660_vnops.c
--- isofs/cd9660/cd9660_vnops.c
+++ isofs/cd9660/cd9660_vnops.c
@@ -865,8 +865,8 @@ const struct vops cd9660_specvops = {
 
/* XXX: Keep in sync with spec_vops. */
.vop_lookup = vop_generic_lookup,
-   .vop_create = spec_badop,
-   .vop_mknod  = spec_badop,
+   .vop_create = vop_generic_badop,
+   .vop_mknod  = vop_generic_badop,
.vop_open   = spec_open,
.vop_close  = spec_close,
.vop_read   = spec_read,
@@ -876,15 +876,15 @@ const struct vops cd9660_specvops = {
.vop_kqfilter   = spec_kqfilter,
.vop_revoke = vop_generic_revoke,
.vop_fsync  = spec_fsync,
-   .vop_remove = spec_badop,
-   .vop_link   = spec_badop,
-   .vop_rename = spec_badop,
-   .vop_mkdir  = spec_badop,
-   .vop_rmdir  = spec_badop,
-   .vop_symlink= spec_badop,
-   .vop_readdir= spec_badop,
-   .vop_readlink   = spec_badop,
-   .vop_abortop= spec_badop,
+   .vop_remove = vop_generic_badop,
+   .vop_link   = vop_generic_badop,
+   .vop_rename = vop_generic_badop,
+   .vop_mkdir  = vop_generic_badop,
+   .vop_rmdir  = vop_generic_badop,
+   .vop_symlink= vop_generic_badop,
+   .vop_readdir= vop_generic_badop,
+   .vop_readlink   = vop_generic_badop,
+   .vop_abortop= vop_generic_badop,
.vop_bmap   = vop_generic_bmap,
.vop_strategy   = spec_strategy,
.vop_pathconf   = spec_pathconf,
@@ -907,8 +907,8 @@ const struct vops cd9660_fifovops = {
 
/* XXX: Keep in sync with fifo_vops. */
.vop_lookup = vop_generic_lookup,
-   .vop_create = fifo_badop,
-   .vop_mknod  = fifo_badop,
+   .vop_create = vop_generic_badop,
+   .vop_mknod  = vop_generic_badop,
.vop_open   = fifo_open,
.vop_close  = fifo_close,
.vop_read   = fifo_read,
@@ -918,17 +918,17 @@ const struct vops cd9660_fifovops = {
.vop_kqfilter   = fifo_kqfilter,
.vop_revoke = vop_generic_revoke,
.vop_fsync  = nullop,
-   .vop_remove = fifo_badop,
-   .vop_link   = fifo_badop,
-   .vop_rename = fifo_badop,
-   .vop_mkdir  = fifo_badop,
-   .vop_rmdir  = fifo_badop,
-   .vop_symlink= fifo_badop,
-   .vop_readdir= fifo_badop,
-   .vop_readlink   = fifo_badop,
-   .vop_abortop= fifo_badop,
+   .vop_remove = vop_generic_badop,
+   .vop_link   = vop_generic_badop,
+   .vop_rename = vop_generic_badop,
+   .vop_mkdir  = vop_generic_badop,
+   .vop_rmdir  = vop_generic_badop,
+   .vop_symlink= vop_generic_badop,
+   .vop_readdir= vop_generic_badop,
+   .vop_readlink   = vop_generic_badop,
+   .vop_abortop= vop_generic_badop,
.vop_bmap   = vop_generic_bmap,
-   .vop_strategy   = fifo_badop,
+   .vop_strategy   = vop_generic_badop,
.vop_pathconf   = fifo_pathconf,
.vop_advlock= fifo_advlock,
 };
blob - 0e0071ce7194dc5ebf5a59b0c6f14224a6b2b55d
file + kern/spec_vnops.c
--- kern/spec_vnops.c
+++ kern/spec_vnops.c
@@ -64,8 +64,8 @@ struct vnodechain speclisth[SPECHSZ];
 
 const struct vops spec_vops = {
.vop_lookup = vop_generic_lookup,
-   .vop_create = spec_badop,
-   .vop_mknod  = spec_badop,
+   .vop_create = vop_generic_badop,
+   .vop_mknod  = vop_generic_badop,
.vop_open   = spec_open,
.vop_close  = spec_close,
.vop_access = spec_access,
@@ -78,15 +78,15 @@ const struct vops spec_vops = {
.vop_kqfilter   = spec_kqfilter,
.vop_revoke = vop_generic_revoke,
.vop_fsync  = spec_fsync,
-   .vop_remove = spec_badop,
-   .vop_link   = spec_badop,
-   .vop_rename = spec_badop,
-   .vop_mkdir  = spec_badop,
-   .vop_rmdir  = spec_badop,
-   .vop_symlink= spec_badop,
-   .vop_readdir= spec_badop,
-   .vop_readlink   = spec_badop,
-   .vop_abortop= spec_badop,
+   .vop_remove = vop_generic_badop,
+   .vop_link   = vop_generic_badop,
+   .vop_rename = vop_generic_badop,
+   .vop_mkdir  = vop_generic_badop,
+   .vop_rmdir  = vop_generic_badop,
+   .vop_symlink= vop_generic_badop,
+   .vop_readdir= vop_generic_badop,
+   .vop_readli

Switch to kqueue based select(2)

2021-10-02 Thread Martin Pieuchot
As discussed during k2k21 I'd like to switch to the new select(2)
implementation early during this release cycle.  I'd like to first
make sure there's no regression for select(2) and poll(2) then work
towards improving the latency and removing the contention on those
syscalls. 

This has been largely tested but I'd happy to have more infos, at least
espie@ volunteered to put it in through a bulk ;)

ok?

Index: kern/sys_generic.c
===
RCS file: /cvs/src/sys/kern/sys_generic.c,v
retrieving revision 1.135
diff -u -p -r1.135 sys_generic.c
--- kern/sys_generic.c  8 Jan 2021 09:29:04 -   1.135
+++ kern/sys_generic.c  1 Oct 2021 20:01:44 -
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef KTRACE
 #include 
 #endif
@@ -66,8 +67,21 @@
 
 #include 
 
-int selscan(struct proc *, fd_set *, fd_set *, int, int, register_t *);
-void pollscan(struct proc *, struct pollfd *, u_int, register_t *);
+/*
+ * Debug values:
+ *  1 - print implementation errors, things that should not happen.
+ *  2 - print ppoll(2) information, somewhat verbose
+ *  3 - print pselect(2) and ppoll(2) information, very verbose
+ */
+int kqpoll_debug = 0;
+#define DPRINTFN(v, x...) if (kqpoll_debug > v) {  \
+   printf("%s(%d): ", curproc->p_p->ps_comm, curproc->p_tid);  \
+   printf(x);  \
+}
+
+int pselregister(struct proc *, fd_set *[], int, int *);
+int pselcollect(struct proc *, struct kevent *, fd_set *[], int *);
+
 int pollout(struct pollfd *, struct pollfd *, u_int);
 int dopselect(struct proc *, int, fd_set *, fd_set *, fd_set *,
 struct timespec *, const sigset_t *, register_t *);
@@ -584,11 +598,10 @@ int
 dopselect(struct proc *p, int nd, fd_set *in, fd_set *ou, fd_set *ex,
 struct timespec *timeout, const sigset_t *sigmask, register_t *retval)
 {
+   struct kqueue_scan_state scan;
fd_mask bits[6];
fd_set *pibits[3], *pobits[3];
-   struct timespec elapsed, start, stop;
-   uint64_t nsecs;
-   int s, ncoll, error = 0;
+   int error, ncollected = 0, nevents = 0;
u_int ni;
 
if (nd < 0)
@@ -618,6 +631,8 @@ dopselect(struct proc *p, int nd, fd_set
pobits[2] = (fd_set *)&bits[5];
}
 
+   kqpoll_init();
+
 #definegetbits(name, x) \
if (name && (error = copyin(name, pibits[x], ni))) \
goto done;
@@ -636,43 +651,61 @@ dopselect(struct proc *p, int nd, fd_set
if (sigmask)
dosigsuspend(p, *sigmask &~ sigcantmask);
 
-retry:
-   ncoll = nselcoll;
-   atomic_setbits_int(&p->p_flag, P_SELECT);
-   error = selscan(p, pibits[0], pobits[0], nd, ni, retval);
-   if (error || *retval)
+   /* Register kqueue events */
+   error = pselregister(p, pibits, nd, &nevents);
+   if (error != 0)
goto done;
-   if (timeout == NULL || timespecisset(timeout)) {
-   if (timeout != NULL) {
-   getnanouptime(&start);
-   nsecs = MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP);
-   } else
-   nsecs = INFSLP;
-   s = splhigh();
-   if ((p->p_flag & P_SELECT) == 0 || nselcoll != ncoll) {
-   splx(s);
-   goto retry;
-   }
-   atomic_clearbits_int(&p->p_flag, P_SELECT);
-   error = tsleep_nsec(&selwait, PSOCK | PCATCH, "select", nsecs);
-   splx(s);
+
+   /*
+* The poll/select family of syscalls has been designed to
+* block when file descriptors are not available, even if
+* there's nothing to wait for.
+*/
+   if (nevents == 0) {
+   uint64_t nsecs = INFSLP;
+
if (timeout != NULL) {
-   getnanouptime(&stop);
-   timespecsub(&stop, &start, &elapsed);
-   timespecsub(timeout, &elapsed, timeout);
-   if (timeout->tv_sec < 0)
-   timespecclear(timeout);
+   if (!timespecisset(timeout))
+   goto done;
+   nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP));
}
-   if (error == 0 || error == EWOULDBLOCK)
-   goto retry;
+   error = tsleep_nsec(&p->p_kq, PSOCK | PCATCH, "kqsel", nsecs);
+   /* select is not restarted after signals... */
+   if (error == ERESTART)
+   error = EINTR;
+   if (error == EWOULDBLOCK)
+   error = 0;
+   goto done;
}
-done:
-   atomic_clearbits_int(&p->p_flag, P_SELECT);
-   /* select is not restarted after signals... */
-   if (error == ERESTART)
-   error = EINTR;
-   if (erro