Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-02 Thread Mark Kettenis
> Date: Sat, 1 Jun 2019 18:55:20 -0300
> From: Martin Pieuchot 
> 
> Diff below exists mainly for documentation and test purposes.  If
> you're not interested about how to break the scheduler internals in
> pieces, don't read further and go straight to testing!

Still digesting this, but.

> - First change is to stop calling tsleep(9) at PUSER.  That makes
>   it clear that all "sleeping priorities" are smaller than PUSER.
>   That's important to understand for the diff below.  `p_priority'
>   is currently a placeholder for the "sleeping priority" and the
>   "runnqueue priority".  Both fields are separated by this diff.

Separating out the fields is a good idea.  The current way priorities
are recorded is just confusing.  The use of PUSER vs. PWAIT seems to
be fairly arbitrary, so that is probably not a big issue.  Except
maybe for the single-threded signal stuff.  Would be good to get
guenther@'s thoughts on this bit.

The PUSER -> PWAIT change isn't really necessary is it?  It just makes
it easier for you to understand what;s going on when looking at the
queues.

> - When a thread goes to sleep, the priority argument of tsleep(9) is
>   now recorded in `p_slpprio'.  This argument can be considered as part
>   of the sleep queue.  Its purpose is to place the thread into a higher
>   runqueue when awoken.

Great!

> - Currently, for stopped threads, `p_priority' correspond to `p_usrpri'. 
>   So setrunnable() has been untangled to place SSTOP and SSLEEP threads
>   in the preferred queue without having to use `p_priority'.  Note that
>   `p_usrpri' is still recalculated *after* having called setrunqueue().
>   This is currently fine because setrunnable() is called with SCHED_LOCK() 
>   but it will be racy when we'll split it.
> 
> - A new field, `p_runprio' has been introduced.  It should be considered
>   as part of the per-CPU runqueues.  It indicates where a current thread
>   is placed.

You made this an uint8_t, whereas the other priority fields are all
u_char.  Different names for the same thing, but it probably is a good
idea to keep this consistent for now.

> - `spc_curpriority' is now updated at every context-switch.  That means
>need_resched() won't be called after comparing an out-of-date value.
>At the same time, `p_usrpri' is initialized to the highest possible
>value for idle threads.
> 
> - resched_proc() was calling need_resched() in the following conditions:
>- If the SONPROC thread has a higher priority that the current
>  running thread (itself).
>- Twice in setrunnable() when we know that p_priority <= p_usrpri.
>- If schedcpu() considered that a thread, after updating its prio,
>  should preempt the one running on the CPU pointed by `p_cpu'. 
> 
>   The diff below simplify all of that by calling need_resched() when:
>- A thread is inserted in a CPU runqueue at a higher priority than
>  the one SONPROC.
>- schedcpu() decides that a thread in SRUN state should preempt the
>  one SONPROC.
> 
> - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
>   of a thread are now updated while holding a per-thread mutex.  As a
>   result schedclock() and donice() no longer takes the SCHED_LOCK(),
>   and schedcpu() almost never take it.

Need to look closer at how this works.

> - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
>   when displaying priorities.  This is helpful to understand what's
>   happening:

Do you intend to remove that bit before committing this?

> load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 
> 23:42:10
> 70 threads: 68 idle, 2 on processorup  
> 0:09
> CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% idle
> CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% idle
> Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
> 
>   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU COMMAND
> 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% softnet
> 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
> 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
> 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% idle0  
> 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% idle1
> 85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh  
> 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% nfsd 
> 
> 
> 
> - The removal of `p_priority' and the change that makes mi_switch()
>   always update `spc_curpriority' might introduce some changes in
>   behavior, especially with kernel threads that were not going through
>   tsleep(9).  We currently have some situations where the priority of
>   the running thread isn't correctly reflected.  This diff changes that
>   which means we should be able to better understand where the problems
>   

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-02 Thread Martin Pieuchot
On 01/06/19(Sat) 18:55, Martin Pieuchot wrote:
> Diff below exists mainly for documentation and test purposes.  If
> you're not interested about how to break the scheduler internals in
> pieces, don't read further and go straight to testing!
> 
> - First change is to stop calling tsleep(9) at PUSER.  That makes
>   it clear that all "sleeping priorities" are smaller than PUSER.
>   That's important to understand for the diff below.  `p_priority'
>   is currently a placeholder for the "sleeping priority" and the
>   "runnqueue priority".  Both fields are separated by this diff.
> 
> - When a thread goes to sleep, the priority argument of tsleep(9) is
>   now recorded in `p_slpprio'.  This argument can be considered as part
>   of the sleep queue.  Its purpose is to place the thread into a higher
>   runqueue when awoken.
> 
> - Currently, for stopped threads, `p_priority' correspond to `p_usrpri'. 
>   So setrunnable() has been untangled to place SSTOP and SSLEEP threads
>   in the preferred queue without having to use `p_priority'.  Note that
>   `p_usrpri' is still recalculated *after* having called setrunqueue().
>   This is currently fine because setrunnable() is called with SCHED_LOCK() 
>   but it will be racy when we'll split it.
> 
> - A new field, `p_runprio' has been introduced.  It should be considered
>   as part of the per-CPU runqueues.  It indicates where a current thread
>   is placed.
> 
> - `spc_curpriority' is now updated at every context-switch.  That means
>need_resched() won't be called after comparing an out-of-date value.
>At the same time, `p_usrpri' is initialized to the highest possible
>value for idle threads.
> 
> - resched_proc() was calling need_resched() in the following conditions:
>- If the SONPROC thread has a higher priority that the current
>  running thread (itself).
>- Twice in setrunnable() when we know that p_priority <= p_usrpri.
>- If schedcpu() considered that a thread, after updating its prio,
>  should preempt the one running on the CPU pointed by `p_cpu'. 
> 
>   The diff below simplify all of that by calling need_resched() when:
>- A thread is inserted in a CPU runqueue at a higher priority than
>  the one SONPROC.
>- schedcpu() decides that a thread in SRUN state should preempt the
>  one SONPROC.
> 
> - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
>   of a thread are now updated while holding a per-thread mutex.  As a
>   result schedclock() and donice() no longer takes the SCHED_LOCK(),
>   and schedcpu() almost never take it.
> 
> - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
>   when displaying priorities.  This is helpful to understand what's
>   happening:
> 
> load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 
> 23:42:10
> 70 threads: 68 idle, 2 on processorup  
> 0:09
> CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% idle
> CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% idle
> Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
> 
>   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU COMMAND
> 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% softnet
> 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
> 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
> 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% idle0  
> 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% idle1
> 85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh  
> 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% nfsd 
> 
> 
> 
> - The removal of `p_priority' and the change that makes mi_switch()
>   always update `spc_curpriority' might introduce some changes in
>   behavior, especially with kernel threads that were not going through
>   tsleep(9).  We currently have some situations where the priority of
>   the running thread isn't correctly reflected.  This diff changes that
>   which means we should be able to better understand where the problems
>   are.
> 
> I'd be interested in comments/tests/reviews before continuing in this
> direction.  Note that at least part of this diff are required to split
> the accounting apart from the SCHED_LOCK() as well.
> 
> I'll also work on exporting scheduler statistics unless somebody wants
> to beat me :)

Updated diff to use IPL_SCHED and rebased to apply on top of -current :) 

Index: arch/amd64/amd64/genassym.cf
===
RCS file: /cvs/src/sys/arch/amd64/amd64/genassym.cf,v
retrieving revision 1.40
diff -u -p -r1.40 genassym.cf
--- arch/amd64/amd64/genassym.cf17 May 2019 19:07:15 -  1.40
+++ arch/amd64/amd64/genassym.cf1 Jun 2019 16:27:46 -
@@ -32,7 +32,6 @@ export

Re: mtx_enter_try(9) & recursion

2019-06-02 Thread Mark Kettenis
> Date: Sun, 2 Jun 2019 15:39:59 -0300
> From: Martin Pieuchot 
> Cc: tech@openbsd.org
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> 
> On 02/06/19(Sun) 15:50, Visa Hankala wrote:
> > On Sun, Jun 02, 2019 at 10:29:01AM -0300, Martin Pieuchot wrote:
> > > On 02/06/19(Sun) 04:30, Visa Hankala wrote:
> > > > On Sat, Jun 01, 2019 at 07:04:23PM -0300, Martin Pieuchot wrote:
> > > > > On 01/06/19(Sat) 23:22, Mark Kettenis wrote:
> > > > > > > Date: Sat, 1 Jun 2019 17:32:52 -0300
> > > > > > > From: Martin Pieuchot 
> > > > > > > 
> > > > > > > Currently it isn't safe to call mtx_enter_try(9) if you're already
> > > > > > > holding the mutex.  That means it isn't safe to call that function
> > > > > > > in hardclock(9), like with `windup_mtx'.  That's why the mutex 
> > > > > > > needs
> > > > > > > to be initialized as IPL_CLOCK.
> > > > > > > 
> > > > > > > I'm working on removing the SCHED_LOCK() from inside hardclock(9).
> > > > > > > That leads me to wonder if I should initialize all mutexes to 
> > > > > > > IPL_SCHED,
> > > > > > > possibly blocking clock interrupts, or if we should change the 
> > > > > > > mutex API
> > > > > > > to allow mtx_enter_try(9) to deal with recursion.
> > > > > > > 
> > > > > > > The diff below removes the recursion check for mtx_enter_try(9).
> > > > > > > 
> > > > > > > Comments?  Oks?
> > > > > > 
> > > > > > My initial reaction is that if you're trying to lock when you 
> > > > > > already
> > > > > > have the lock, there is something wrong with your locking strategy 
> > > > > > and
> > > > > > that this is something we don't want.
> > > > > 
> > > > > Could you elaborate?  Are you saying that preventing hardclock(9) to 
> > > > > run
> > > > > is the way to move forward to unlock its internals?  Why isn't that
> > > > > strategy wrong?
> > > > > 
> > > > > In the `windup_mtx' case, does it matter if the mutex is taken by
> > > > > another CPU or by myself?  What's the problem when CPU0 is one holding
> > > > > the lock?
> > > > 
> > > > mutex(9) is not and should not become recursive. Recursive locking
> > > > works when it is voluntary. If recursion was allowed with interrupts,
> > > > the CPU could re-enter the critical section at any moment, possibly
> > > > seeing inconsistent state or breaking assumptions made by the original
> > > > entry.
> > > 
> > > I don't understand how your answer is related to my diff :(
> > > 
> > > Why allowing mtx_enter_try(9) to fail and not to panic(9) when the CPU
> > > executing the code already owns the mutex is wrong?
> > > 
> > > Is that what you call recursion?  Not entering the critical section in
> > > the interrupt instead of blocking the interrupt?
> > > 
> > > How is that different from not entering the critical section because
> > > another CPU is holding the mutex?
> > 
> > To me, it looks very suspicious if a CPU tries to lock a mutex that it
> > already holds because it indicates that there is a potential recursion.
> > My stance is that such a locking behaviour is a design error in general
> > and the current way of checking recursion in mtx_enter_try() is sound.
> 
> But why?  Nobody is answering my question :o)  You're saying recursion
> is bad because there's recursion.

I think recursion makes it harder to reason about lock ordering...

> How is that different from using mtx_enter_try(9) to protect from
> another CPU being in the critical section?  Why do you call that a
> design error?

To me, the use of mtx_enter_try(9) is suspicious to start with.  It
either means there is some magic going on to avoid a lock ordering
problem.  Or the code is opportunistically trying to do some work that
doesn't necessarily need to be done at this point.  Ideally, I'd like
to see every mtx_enter_try(9) call annotated with an explanation why
it isn't a simple mtx_enter(9).

> I don't mind putting all mutexes at IPL_SCHED.  I would like to
> understand why it matters *who* is holding the mutex.  Since
> mtx_enter_try(9) is here to not enter the critical section for $reasons,
> setting an ipl of IPL_CLOCK sounds just like a special corner case.

I wonder what $reasons is in the case of tc_ticktock().  I think it is
opportunistically trying to avoid spinning on the lock, which is fine
since presumably some other thread is already busy doing the work.
But I'd argue that the code would still be correct if tc_ticktock()
would simply call mtx_enter().  At which point the IPL_CLOCK would be
mandatory.

> You might understand why I'm asking.  We are trying to get more stuff
> done in parallel.  That implies separating parts of the kernel logically.
> Blocking IPL_CLOCK is a big hammer, it block a lot of stuff.  So yes it
> is easy to use, but my feeling is that this simplicity comes at a cost,
> latency wise.

I'm not so sure this really matters.  In principle you're just
delaying the work while you're busy doing some other work.  So you're
not really doing less work in parallel.  You're right about latency

request for testing: bootstrapping time

2019-06-02 Thread Otto Moerbeek
Hi,

Bootstrapping time is a bit of a hard problem.

The hardest scenario is: resolver running in DNSSEC mode on the local
machine, on a machine that does not have real time clock with battery
backup. It's time will be wrong, especially on a cold boot.

1. Wrong time, so dnssec validation fails -> resolve not possible and
we do not have IP addressed of ntp servers or constraint sites to
contact.  To fix, switch to CD DNS mode (Checking Disabled): now we
have IPs. Once time is synced we force a re-resolve to get DNSSEC
validated IP addresses. I committed these changes recently.

2. Try to get constraints (validation using https is needed since the
ntp protocol itself is not secure) -> https certificate validation
will fail since time is wrong. So use time in the HTTP header to
validate the certificate. This change (by jsing@) has been committed a
few months ago.

So we're now at the point to do the last step: automatic setting of
time on boot.

This diff enables automatic mode if we are booting (securelevel
is 0) and at least one constraint source is defined . We really want
to do do constraint validation (see above). If -s is given as
argument, we do whatever that mode did before, so that behaviour did
not change.

We set the time once a couple of constraint validated ntp replies have
been received. We refuse to set the time backwards (should not happen
since even without a proper RTC last mount time of the root dir is
used to initialize the clock), or with only a small (< 60s) offset:
for that case regualar ntpd operation will correct the time.

To test: make sure you are current, the diff below depends on recent
commits. Apply diff. Setup ntpd to run without flags and with your
favorite ntp servers and/or pools, specify at least one constraint url
and see what happens when booting. To make it more challanging, use
unbound on the local machine while dnssec is enabled, use a machine
without RCT clock or set the time wrong (in the past, since setting
the time backwards in automatic mode is not supported!) and do a cold
boot.

Test reports very much appreciated!

-Otto

Index: client.c
===
RCS file: /cvs/src/usr.sbin/ntpd/client.c,v
retrieving revision 1.106
diff -u -p -r1.106 client.c
--- client.c29 May 2019 18:48:33 -  1.106
+++ client.c1 Jun 2019 17:20:42 -
@@ -29,6 +29,8 @@
 #include "ntpd.h"
 
 intclient_update(struct ntp_peer *);
+intauto_cmp(const void *, const void *);
+void   handle_auto(double);
 void   set_deadline(struct ntp_peer *, time_t);
 
 void
@@ -213,7 +215,47 @@ client_query(struct ntp_peer *p)
 }
 
 int
-client_dispatch(struct ntp_peer *p, u_int8_t settime)
+auto_cmp(const void *a, const void *b)
+{
+   double at = *(const double *)a;
+   double bt = *(const double *)b;
+   return at < bt ? -1 : (at > bt ? 1 : 0);
+}
+
+void
+handle_auto(double offset)
+{
+   static int count;
+   static double v[AUTO_REPLIES];
+
+   /*
+* It happens the (constraint) resolves initially fail, don't give up
+* but see if we get validatd replies later.
+*/
+   if (conf->constraint_median == 0)
+   return;
+
+   if (offset < AUTO_THRESHOLD) {
+   /* don't bother */
+   priv_settime(0);
+   return;
+   }
+   /* collect some more */
+   v[count++] = offset;
+   if (count < AUTO_REPLIES)
+   return;
+   
+   /* we have enough */
+   qsort(v, count, sizeof(double), auto_cmp);
+   if (AUTO_REPLIES % 2 == 0)
+   offset = (v[AUTO_REPLIES / 2 - 1] + v[AUTO_REPLIES / 2]) / 2;
+   else
+   offset = v[AUTO_REPLIES / 2];
+   priv_settime(offset);
+}
+
+int
+client_dispatch(struct ntp_peer *p, u_int8_t settime, u_int8_t automatic)
 {
struct ntp_msg   msg;
struct msghdrsomsg;
@@ -385,7 +427,9 @@ client_dispatch(struct ntp_peer *p, u_in
if (p->trustlevel < TRUSTLEVEL_PATHETIC)
interval = scale_interval(INTERVAL_QUERY_PATHETIC);
else if (p->trustlevel < TRUSTLEVEL_AGGRESSIVE)
-   interval = scale_interval(INTERVAL_QUERY_AGGRESSIVE);
+   interval = (conf->settime && conf->automatic) ?
+   INTERVAL_QUERY_ULTRA_VIOLENCE :
+   scale_interval(INTERVAL_QUERY_AGGRESSIVE);
else
interval = scale_interval(INTERVAL_QUERY_NORMAL);
 
@@ -408,8 +452,12 @@ client_dispatch(struct ntp_peer *p, u_in
(long long)interval);
 
client_update(p);
-   if (settime)
-   priv_settime(p->reply[p->shift].offset);
+   if (settime) {
+   if (automatic)
+   handle_auto(p->reply[p->shift].offset);
+   else
+   priv_settime(p->reply[p->shift].offset);
+   }
 
if (++p->shift >= OFFSET_ARRAY_SIZE)
p->shift = 

Re: mtx_enter_try(9) & recursion

2019-06-02 Thread Martin Pieuchot
On 02/06/19(Sun) 15:50, Visa Hankala wrote:
> On Sun, Jun 02, 2019 at 10:29:01AM -0300, Martin Pieuchot wrote:
> > On 02/06/19(Sun) 04:30, Visa Hankala wrote:
> > > On Sat, Jun 01, 2019 at 07:04:23PM -0300, Martin Pieuchot wrote:
> > > > On 01/06/19(Sat) 23:22, Mark Kettenis wrote:
> > > > > > Date: Sat, 1 Jun 2019 17:32:52 -0300
> > > > > > From: Martin Pieuchot 
> > > > > > 
> > > > > > Currently it isn't safe to call mtx_enter_try(9) if you're already
> > > > > > holding the mutex.  That means it isn't safe to call that function
> > > > > > in hardclock(9), like with `windup_mtx'.  That's why the mutex needs
> > > > > > to be initialized as IPL_CLOCK.
> > > > > > 
> > > > > > I'm working on removing the SCHED_LOCK() from inside hardclock(9).
> > > > > > That leads me to wonder if I should initialize all mutexes to 
> > > > > > IPL_SCHED,
> > > > > > possibly blocking clock interrupts, or if we should change the 
> > > > > > mutex API
> > > > > > to allow mtx_enter_try(9) to deal with recursion.
> > > > > > 
> > > > > > The diff below removes the recursion check for mtx_enter_try(9).
> > > > > > 
> > > > > > Comments?  Oks?
> > > > > 
> > > > > My initial reaction is that if you're trying to lock when you already
> > > > > have the lock, there is something wrong with your locking strategy and
> > > > > that this is something we don't want.
> > > > 
> > > > Could you elaborate?  Are you saying that preventing hardclock(9) to run
> > > > is the way to move forward to unlock its internals?  Why isn't that
> > > > strategy wrong?
> > > > 
> > > > In the `windup_mtx' case, does it matter if the mutex is taken by
> > > > another CPU or by myself?  What's the problem when CPU0 is one holding
> > > > the lock?
> > > 
> > > mutex(9) is not and should not become recursive. Recursive locking
> > > works when it is voluntary. If recursion was allowed with interrupts,
> > > the CPU could re-enter the critical section at any moment, possibly
> > > seeing inconsistent state or breaking assumptions made by the original
> > > entry.
> > 
> > I don't understand how your answer is related to my diff :(
> > 
> > Why allowing mtx_enter_try(9) to fail and not to panic(9) when the CPU
> > executing the code already owns the mutex is wrong?
> > 
> > Is that what you call recursion?  Not entering the critical section in
> > the interrupt instead of blocking the interrupt?
> > 
> > How is that different from not entering the critical section because
> > another CPU is holding the mutex?
> 
> To me, it looks very suspicious if a CPU tries to lock a mutex that it
> already holds because it indicates that there is a potential recursion.
> My stance is that such a locking behaviour is a design error in general
> and the current way of checking recursion in mtx_enter_try() is sound.

But why?  Nobody is answering my question :o)  You're saying recursion
is bad because there's recursion.

How is that different from using mtx_enter_try(9) to protect from
another CPU being in the critical section?  Why do you call that a
design error?

I don't mind putting all mutexes at IPL_SCHED.  I would like to
understand why it matters *who* is holding the mutex.  Since
mtx_enter_try(9) is here to not enter the critical section for $reasons,
setting an ipl of IPL_CLOCK sounds just like a special corner case.

You might understand why I'm asking.  We are trying to get more stuff
done in parallel.  That implies separating parts of the kernel logically.
Blocking IPL_CLOCK is a big hammer, it block a lot of stuff.  So yes it
is easy to use, but my feeling is that this simplicity comes at a cost,
latency wise.

Anyway I'll drop this diff.  I'll try to thing about your point :o)



Re: mtx_enter_try(9) & recursion

2019-06-02 Thread Mark Kettenis
> Date: Sun, 2 Jun 2019 15:50:35 +
> From: Visa Hankala 
> 
> On Sun, Jun 02, 2019 at 10:29:01AM -0300, Martin Pieuchot wrote:
> > On 02/06/19(Sun) 04:30, Visa Hankala wrote:
> > > On Sat, Jun 01, 2019 at 07:04:23PM -0300, Martin Pieuchot wrote:
> > > > On 01/06/19(Sat) 23:22, Mark Kettenis wrote:
> > > > > > Date: Sat, 1 Jun 2019 17:32:52 -0300
> > > > > > From: Martin Pieuchot 
> > > > > > 
> > > > > > Currently it isn't safe to call mtx_enter_try(9) if you're already
> > > > > > holding the mutex.  That means it isn't safe to call that function
> > > > > > in hardclock(9), like with `windup_mtx'.  That's why the mutex needs
> > > > > > to be initialized as IPL_CLOCK.
> > > > > > 
> > > > > > I'm working on removing the SCHED_LOCK() from inside hardclock(9).
> > > > > > That leads me to wonder if I should initialize all mutexes to 
> > > > > > IPL_SCHED,
> > > > > > possibly blocking clock interrupts, or if we should change the 
> > > > > > mutex API
> > > > > > to allow mtx_enter_try(9) to deal with recursion.
> > > > > > 
> > > > > > The diff below removes the recursion check for mtx_enter_try(9).
> > > > > > 
> > > > > > Comments?  Oks?
> > > > > 
> > > > > My initial reaction is that if you're trying to lock when you already
> > > > > have the lock, there is something wrong with your locking strategy and
> > > > > that this is something we don't want.
> > > > 
> > > > Could you elaborate?  Are you saying that preventing hardclock(9) to run
> > > > is the way to move forward to unlock its internals?  Why isn't that
> > > > strategy wrong?
> > > > 
> > > > In the `windup_mtx' case, does it matter if the mutex is taken by
> > > > another CPU or by myself?  What's the problem when CPU0 is one holding
> > > > the lock?
> > > 
> > > mutex(9) is not and should not become recursive. Recursive locking
> > > works when it is voluntary. If recursion was allowed with interrupts,
> > > the CPU could re-enter the critical section at any moment, possibly
> > > seeing inconsistent state or breaking assumptions made by the original
> > > entry.
> > 
> > I don't understand how your answer is related to my diff :(
> > 
> > Why allowing mtx_enter_try(9) to fail and not to panic(9) when the CPU
> > executing the code already owns the mutex is wrong?
> > 
> > Is that what you call recursion?  Not entering the critical section in
> > the interrupt instead of blocking the interrupt?
> > 
> > How is that different from not entering the critical section because
> > another CPU is holding the mutex?
> 
> To me, it looks very suspicious if a CPU tries to lock a mutex that it
> already holds because it indicates that there is a potential recursion.
> My stance is that such a locking behaviour is a design error in general
> and the current way of checking recursion in mtx_enter_try() is sound.
> 
> If a mutex can be taken on multiple interrupt priority levels, the lock
> has to be configured to block the highest level, and mutexes used
> in hardclock() are no exception.

Agreed.  I think relaxing the rules here is unwise.  It is much easier
to audit the code for use of mutexes in interrupt code with the simple
rule that visa has written above.

So yes, I think you should have the mutexes block
IPL_CLOCK/IPL_STATCLOCK/IPL_SCHED, or whatever else is appropriate.  I
don't think that will be a problem as SCHED_LOCK already does a
splsched().



Re: mtx_enter_try(9) & recursion

2019-06-02 Thread Visa Hankala
On Sun, Jun 02, 2019 at 10:29:01AM -0300, Martin Pieuchot wrote:
> On 02/06/19(Sun) 04:30, Visa Hankala wrote:
> > On Sat, Jun 01, 2019 at 07:04:23PM -0300, Martin Pieuchot wrote:
> > > On 01/06/19(Sat) 23:22, Mark Kettenis wrote:
> > > > > Date: Sat, 1 Jun 2019 17:32:52 -0300
> > > > > From: Martin Pieuchot 
> > > > > 
> > > > > Currently it isn't safe to call mtx_enter_try(9) if you're already
> > > > > holding the mutex.  That means it isn't safe to call that function
> > > > > in hardclock(9), like with `windup_mtx'.  That's why the mutex needs
> > > > > to be initialized as IPL_CLOCK.
> > > > > 
> > > > > I'm working on removing the SCHED_LOCK() from inside hardclock(9).
> > > > > That leads me to wonder if I should initialize all mutexes to 
> > > > > IPL_SCHED,
> > > > > possibly blocking clock interrupts, or if we should change the mutex 
> > > > > API
> > > > > to allow mtx_enter_try(9) to deal with recursion.
> > > > > 
> > > > > The diff below removes the recursion check for mtx_enter_try(9).
> > > > > 
> > > > > Comments?  Oks?
> > > > 
> > > > My initial reaction is that if you're trying to lock when you already
> > > > have the lock, there is something wrong with your locking strategy and
> > > > that this is something we don't want.
> > > 
> > > Could you elaborate?  Are you saying that preventing hardclock(9) to run
> > > is the way to move forward to unlock its internals?  Why isn't that
> > > strategy wrong?
> > > 
> > > In the `windup_mtx' case, does it matter if the mutex is taken by
> > > another CPU or by myself?  What's the problem when CPU0 is one holding
> > > the lock?
> > 
> > mutex(9) is not and should not become recursive. Recursive locking
> > works when it is voluntary. If recursion was allowed with interrupts,
> > the CPU could re-enter the critical section at any moment, possibly
> > seeing inconsistent state or breaking assumptions made by the original
> > entry.
> 
> I don't understand how your answer is related to my diff :(
> 
> Why allowing mtx_enter_try(9) to fail and not to panic(9) when the CPU
> executing the code already owns the mutex is wrong?
> 
> Is that what you call recursion?  Not entering the critical section in
> the interrupt instead of blocking the interrupt?
> 
> How is that different from not entering the critical section because
> another CPU is holding the mutex?

To me, it looks very suspicious if a CPU tries to lock a mutex that it
already holds because it indicates that there is a potential recursion.
My stance is that such a locking behaviour is a design error in general
and the current way of checking recursion in mtx_enter_try() is sound.

If a mutex can be taken on multiple interrupt priority levels, the lock
has to be configured to block the highest level, and mutexes used
in hardclock() are no exception.



Re: mtx_enter_try(9) & recursion

2019-06-02 Thread Martin Pieuchot
On 02/06/19(Sun) 04:30, Visa Hankala wrote:
> On Sat, Jun 01, 2019 at 07:04:23PM -0300, Martin Pieuchot wrote:
> > On 01/06/19(Sat) 23:22, Mark Kettenis wrote:
> > > > Date: Sat, 1 Jun 2019 17:32:52 -0300
> > > > From: Martin Pieuchot 
> > > > 
> > > > Currently it isn't safe to call mtx_enter_try(9) if you're already
> > > > holding the mutex.  That means it isn't safe to call that function
> > > > in hardclock(9), like with `windup_mtx'.  That's why the mutex needs
> > > > to be initialized as IPL_CLOCK.
> > > > 
> > > > I'm working on removing the SCHED_LOCK() from inside hardclock(9).
> > > > That leads me to wonder if I should initialize all mutexes to IPL_SCHED,
> > > > possibly blocking clock interrupts, or if we should change the mutex API
> > > > to allow mtx_enter_try(9) to deal with recursion.
> > > > 
> > > > The diff below removes the recursion check for mtx_enter_try(9).
> > > > 
> > > > Comments?  Oks?
> > > 
> > > My initial reaction is that if you're trying to lock when you already
> > > have the lock, there is something wrong with your locking strategy and
> > > that this is something we don't want.
> > 
> > Could you elaborate?  Are you saying that preventing hardclock(9) to run
> > is the way to move forward to unlock its internals?  Why isn't that
> > strategy wrong?
> > 
> > In the `windup_mtx' case, does it matter if the mutex is taken by
> > another CPU or by myself?  What's the problem when CPU0 is one holding
> > the lock?
> 
> mutex(9) is not and should not become recursive. Recursive locking
> works when it is voluntary. If recursion was allowed with interrupts,
> the CPU could re-enter the critical section at any moment, possibly
> seeing inconsistent state or breaking assumptions made by the original
> entry.

I don't understand how your answer is related to my diff :(

Why allowing mtx_enter_try(9) to fail and not to panic(9) when the CPU
executing the code already owns the mutex is wrong?

Is that what you call recursion?  Not entering the critical section in
the interrupt instead of blocking the interrupt?

How is that different from not entering the critical section because
another CPU is holding the mutex?



Re: vmd(8) i8042 device implementation questions

2019-06-02 Thread Jasper Lievisse Adriaanse
On Sat, Jun 01, 2019 at 06:12:16PM -0500, Katherine Rohl wrote:
> Couple questions:
> 
> > This means no interrupt will be injected. I'm not sure if that's what you 
> > want.
> > See vm.c: vcpu_exit_inout(..). It looks like you may have manually asserted 
> > the
> > IRQ in this file, which is a bit different than what we do in other 
> > devices. That
> > may be okay, though.
> 
> The device can assert zero, one, or two IRQs depending on the state of the 
> input ports. Are we capable of asserting two IRQs at once through 
> vcpu_exit_i8042?
> 
> > For this IRQ, if it's edge triggered, please assert then deassert the line.
> > The i8259 code should handle that properly. What you have here is a level
> > triggered interrupt (eg, the line will stay asserted until someone
> > does a 1 -> 0 transition below). Same goes for the next few cases.
> 
> Would asserting the IRQs through the exit function handle this for me if 
> that???s possible?
> 
> > Also, please bump the revision in the vcpu struct for send/receive
> > as we will be sending a new struct layout now.
> 
> Where exactly? The file revision?
That would be VM_DUMP_VERSION in vmd.h I reckon.

-- 
jasper



[patch] rsync: fix another double close socket descriptor

2019-06-02 Thread Hiltjo Posthuma
Hi,

I noticed when using openrsync with a remote and a ssh_prog set (-e option) the
socket is closed twice also.

Reproducable with:

ktrace openrsync -e echo -av rsync://127.0.0.1/ftp/ a

kdump:

  70262 openrsync CALL  close(3)
  70262 openrsync RET   close 0
> 70262 openrsync CALL  close(3)
> 70262 openrsync RET   close -1 errno 9 Bad file descriptor
  70262 openrsync CALL  kbind(0x7f7eb390,24,0x26f671b79e307e66)
  70262 openrsync RET   kbind 0
  70262 openrsync CALL  wait4(15180,0x7f7eb468,0<>,0)
  70262 openrsync RET   wait4 15180/0x3b4c


The patch below fixes this, but has a behaviour change: when the return value
rc = 0 the socket is not closed anymore directly.

I included 2 tiny fixes:
- In blocks.c have_md was initialized to zero twice.
- flist.c has a small typo in a comment.


diff --git usr.bin/rsync/blocks.c usr.bin/rsync/blocks.c
index f6a3575caab..09f09b44e0e 100644
--- usr.bin/rsync/blocks.c
+++ usr.bin/rsync/blocks.c
@@ -55,7 +55,6 @@ blk_find(struct sess *sess, const void *buf, off_t size, 
off_t offs,
assert(remain);
osz = remain < (off_t)blks->len ? remain : (off_t)blks->len;
fhash = hash_fast(buf + offs, (size_t)osz);
-   have_md = 0;
 
/*
 * Start with our match hint.
diff --git usr.bin/rsync/flist.c usr.bin/rsync/flist.c
index f8a6f00a149..447907735af 100644
--- usr.bin/rsync/flist.c
+++ usr.bin/rsync/flist.c
@@ -53,7 +53,7 @@
 #define FLIST_TIME_SAME  0x0080 /* time is repeat */
 
 /*
- * Requied way to sort a filename list.
+ * Required way to sort a filename list.
  */
 static int
 flist_cmp(const void *p1, const void *p2)
diff --git usr.bin/rsync/socket.c usr.bin/rsync/socket.c
index 1a0844f1313..b2cc6b6ac0e 100644
--- usr.bin/rsync/socket.c
+++ usr.bin/rsync/socket.c
@@ -455,6 +455,5 @@ rsync_socket(const struct opts *opts, int sd, const struct 
fargs *f)
rc = 0;
 out:
free(args);
-   close(sd);
return rc;
 }

-- 
Kind regards,
Hiltjo



MSI-X suspend/resume

2019-06-02 Thread Stefan Fritsch
On Fri, May 31, 2019 at 07:35:41AM +1000, Jonathan Matthew wrote:
> I had wondered about the MAU32 define when I was looking at MSI-X
> suspend/resume.

I have also taken a try at MSI-X suspend/resume in the last few weeks.
The diff below is what I have come up with. It worked with nvme on a
laptop and on qemu (but I have not tested it since rebasing on kettenis'
32bit fix).

I have also seen that arm64 got MSI-X in the meantime. That needs to be
adjusted, too.

Cheers,
Stefan


index 8a456f72b09..0b99fe8660d 100644
--- a/sys/arch/amd64/include/pci_machdep.h
+++ b/sys/arch/amd64/include/pci_machdep.h
@@ -98,6 +98,8 @@ void  pci_set_powerstate_md(pci_chipset_tag_t, 
pcitag_t, int, int);
 void   pci_mcfg_init(bus_space_tag_t, bus_addr_t, int, int, int);
 pci_chipset_tag_t pci_lookup_segment(int);
 
+#define __HAVE_PCI_MSIX
+
 /*
  * ALL OF THE FOLLOWING ARE MACHINE-DEPENDENT, AND SHOULD NOT BE USED
  * BY PORTABLE CODE.
diff --git a/sys/arch/amd64/pci/pci_machdep.c b/sys/arch/amd64/pci/pci_machdep.c
index 976ef2d3b6b..4dd95aedd21 100644
--- a/sys/arch/amd64/pci/pci_machdep.c
+++ b/sys/arch/amd64/pci/pci_machdep.c
@@ -446,81 +446,85 @@ msix_hwunmask(struct pic *pic, int pin)
 {
 }
 
+int
+pci_msix_table_map(struct pci_msix_table_info *ti)
+{
+   int bir;
+
+   if (!pci_get_capability(ti->pc, ti->tag, PCI_CAP_MSIX, >cap_offset,
+   >cap_reg)) {
+   return 0;
+   }
+   ti->memt = X86_BUS_SPACE_MEM; /* XXX */
+   ti->table = pci_conf_read(ti->pc, ti->tag,
+   ti->cap_offset + PCI_MSIX_TABLE);
+   bir = (ti->table & PCI_MSIX_TABLE_BIR);
+
+   ti->table_offset = (ti->table & PCI_MSIX_TABLE_OFF);
+   ti->table_size = PCI_MSIX_MC_TBLSZ(ti->cap_reg) + 1;
+
+   bir = PCI_MAPREG_START + bir * 4;
+   if (pci_mem_find(ti->pc, ti->tag, bir, >base, NULL, NULL) ||
+   _bus_space_map(ti->memt, ti->base + ti->table_offset,
+   ti->table_size * 16, 0, >memh)) {
+   panic("%s: cannot map registers", __func__);
+   }
+   return 1;
+}
+
+void
+pci_msix_table_unmap(struct pci_msix_table_info *ti)
+{
+   _bus_space_unmap(ti->memt, ti->memh, ti->table_size * 16, NULL);
+}
+
 void
 msix_addroute(struct pic *pic, struct cpu_info *ci, int pin, int vec, int type)
 {
-   pci_chipset_tag_t pc = NULL; /* XXX */
-   bus_space_tag_t memt = X86_BUS_SPACE_MEM; /* XXX */
-   bus_space_handle_t memh;
-   bus_addr_t base;
-   pcitag_t tag = PCI_MSIX_TAG(pin);
+   struct pci_msix_table_info ti = {
+   .pc = NULL, /* XXX */
+   .tag = PCI_MSIX_TAG(pin),
+   };
int entry = PCI_MSIX_VEC(pin);
-   pcireg_t reg, addr, table;
+   pcireg_t addr;
uint32_t ctrl;
-   int bir, offset;
-   int off, tblsz;
 
-   if (pci_get_capability(pc, tag, PCI_CAP_MSIX, , ) == 0)
+   if (!pci_msix_table_map())
panic("%s: no msix capability", __func__);
 
addr = 0xfee0UL | (ci->ci_apicid << 12);
 
-   table = pci_conf_read(pc, tag, off + PCI_MSIX_TABLE);
-   bir = (table & PCI_MSIX_TABLE_BIR);
-   offset = (table & PCI_MSIX_TABLE_OFF);
-   tblsz = PCI_MSIX_MC_TBLSZ(reg) + 1;
-
-   bir = PCI_MAPREG_START + bir * 4;
-   if (pci_mem_find(pc, tag, bir, , NULL, NULL) ||
-   _bus_space_map(memt, base + offset, tblsz * 16, 0, ))
-   panic("%s: cannot map registers", __func__);
-
-   bus_space_write_4(memt, memh, PCI_MSIX_MA(entry), addr);
-   bus_space_write_4(memt, memh, PCI_MSIX_MAU32(entry), 0);
-   bus_space_write_4(memt, memh, PCI_MSIX_MD(entry), vec);
-   bus_space_barrier(memt, memh, PCI_MSIX_MA(entry), 16,
+   bus_space_write_4(ti.memt, ti.memh, PCI_MSIX_MA(entry), addr);
+   bus_space_write_4(ti.memt, ti.memh, PCI_MSIX_MAU32(entry), 0);
+   bus_space_write_4(ti.memt, ti.memh, PCI_MSIX_MD(entry), vec);
+   bus_space_barrier(ti.memt, ti.memh, PCI_MSIX_MA(entry), 16,
BUS_SPACE_BARRIER_WRITE);
-   ctrl = bus_space_read_4(memt, memh, PCI_MSIX_VC(entry));
-   bus_space_write_4(memt, memh, PCI_MSIX_VC(entry),
+   ctrl = bus_space_read_4(ti.memt, ti.memh, PCI_MSIX_VC(entry));
+   bus_space_write_4(ti.memt, ti.memh, PCI_MSIX_VC(entry),
ctrl & ~PCI_MSIX_VC_MASK);
-
-   _bus_space_unmap(memt, memh, tblsz * 16, NULL);
-
-   pci_conf_write(pc, tag, off, reg | PCI_MSIX_MC_MSIXE);
+   pci_conf_write(ti.pc, ti.tag, ti.cap_offset,
+   ti.cap_reg | PCI_MSIX_MC_MSIXE);
+   pci_msix_table_unmap();
 }
 
 void
 msix_delroute(struct pic *pic, struct cpu_info *ci, int pin, int vec, int type)
 {
-   pci_chipset_tag_t pc = NULL; /* XXX */
-   bus_space_tag_t memt = X86_BUS_SPACE_MEM; /* XXX */
-   bus_space_handle_t memh;
-   bus_addr_t base;
-   pcitag_t tag = PCI_MSIX_TAG(pin);
+   struct pci_msix_table_info ti = {
+   .pc = NULL, /* XXX */
+   .tag = PCI_MSIX_TAG(pin),

Re: bfd: Fixup state and flags bitmask

2019-06-02 Thread Claudio Jeker
On Sun, Jun 02, 2019 at 07:12:35PM +1000, Mitchell Krome wrote:
> I copied the defines from the sys/net bfd code to use in tcpdump and
> noticed that I wasn't seeing any poll bits from the remote side, even
> though tcpdump on the remote side showed it was sending them. Turns out
> the bitmask for state and flags is a bit off. Diff below fixes them to
> match what's in the RFC (and the comment a bit higher in the file)
> 
> Mitchell
> 
> 
> diff --git sys/net/bfd.c sys/net/bfd.c
> index 0d2ec8af512..db8d571f35b 100644
> --- sys/net/bfd.c
> +++ sys/net/bfd.c
> @@ -90,8 +90,8 @@ struct bfd_auth_header {
>  #define BFD_VERSION  1   /* RFC 5880 Page 6 */
>  #define BFD_VER(x)   (((x) & 0xe0) >> 5)
>  #define BFD_DIAG(x)  ((x) & 0x1f)
> -#define BFD_STATE(x) (((x) & 0xf0) >> 6)
> -#define BFD_FLAGS(x) ((x) & 0x0f)
> +#define BFD_STATE(x) (((x) & 0xc0) >> 6)
> +#define BFD_FLAGS(x) ((x) & 0x3f)
>  #define BFD_HDRLEN   24  /* RFC 5880 Page 37 */
>  #define BFD_AUTH_SIMPLE_LEN  16 + 3  /* RFC 5880 Page 10 */
>  #define BFD_AUTH_MD5_LEN 24  /* RFC 5880 Page 11 */
> 

OK claudio@

Anyone else wants to commit?
-- 
:wq Claudio



bfd: Fixup state and flags bitmask

2019-06-02 Thread Mitchell Krome
I copied the defines from the sys/net bfd code to use in tcpdump and
noticed that I wasn't seeing any poll bits from the remote side, even
though tcpdump on the remote side showed it was sending them. Turns out
the bitmask for state and flags is a bit off. Diff below fixes them to
match what's in the RFC (and the comment a bit higher in the file)

Mitchell


diff --git sys/net/bfd.c sys/net/bfd.c
index 0d2ec8af512..db8d571f35b 100644
--- sys/net/bfd.c
+++ sys/net/bfd.c
@@ -90,8 +90,8 @@ struct bfd_auth_header {
 #define BFD_VERSION1   /* RFC 5880 Page 6 */
 #define BFD_VER(x) (((x) & 0xe0) >> 5)
 #define BFD_DIAG(x)((x) & 0x1f)
-#define BFD_STATE(x)   (((x) & 0xf0) >> 6)
-#define BFD_FLAGS(x)   ((x) & 0x0f)
+#define BFD_STATE(x)   (((x) & 0xc0) >> 6)
+#define BFD_FLAGS(x)   ((x) & 0x3f)
 #define BFD_HDRLEN 24  /* RFC 5880 Page 37 */
 #define BFD_AUTH_SIMPLE_LEN16 + 3  /* RFC 5880 Page 10 */
 #define BFD_AUTH_MD5_LEN   24  /* RFC 5880 Page 11 */



Re: printf.1: fix incorrect conversion of apostrophe

2019-06-02 Thread Anthony J. Bentley
Stephen Gregoratto writes:
> In the escape sequences section of printf.1, the 
> character is represented using "\e\'".  In UTF-8 mode, mandoc converts
> this to an acute accent. To fix this I explicitly used "\(aq" as per the
> Accents section of mandoc_char(7), although using "\e'" works as well.

You're correct, thanks.