Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority
> Date: Sat, 1 Jun 2019 18:55:20 -0300 > From: Martin Pieuchot > > Diff below exists mainly for documentation and test purposes. If > you're not interested about how to break the scheduler internals in > pieces, don't read further and go straight to testing! Still digesting this, but. > - First change is to stop calling tsleep(9) at PUSER. That makes > it clear that all "sleeping priorities" are smaller than PUSER. > That's important to understand for the diff below. `p_priority' > is currently a placeholder for the "sleeping priority" and the > "runnqueue priority". Both fields are separated by this diff. Separating out the fields is a good idea. The current way priorities are recorded is just confusing. The use of PUSER vs. PWAIT seems to be fairly arbitrary, so that is probably not a big issue. Except maybe for the single-threded signal stuff. Would be good to get guenther@'s thoughts on this bit. The PUSER -> PWAIT change isn't really necessary is it? It just makes it easier for you to understand what;s going on when looking at the queues. > - When a thread goes to sleep, the priority argument of tsleep(9) is > now recorded in `p_slpprio'. This argument can be considered as part > of the sleep queue. Its purpose is to place the thread into a higher > runqueue when awoken. Great! > - Currently, for stopped threads, `p_priority' correspond to `p_usrpri'. > So setrunnable() has been untangled to place SSTOP and SSLEEP threads > in the preferred queue without having to use `p_priority'. Note that > `p_usrpri' is still recalculated *after* having called setrunqueue(). > This is currently fine because setrunnable() is called with SCHED_LOCK() > but it will be racy when we'll split it. > > - A new field, `p_runprio' has been introduced. It should be considered > as part of the per-CPU runqueues. It indicates where a current thread > is placed. You made this an uint8_t, whereas the other priority fields are all u_char. Different names for the same thing, but it probably is a good idea to keep this consistent for now. > - `spc_curpriority' is now updated at every context-switch. That means >need_resched() won't be called after comparing an out-of-date value. >At the same time, `p_usrpri' is initialized to the highest possible >value for idle threads. > > - resched_proc() was calling need_resched() in the following conditions: >- If the SONPROC thread has a higher priority that the current > running thread (itself). >- Twice in setrunnable() when we know that p_priority <= p_usrpri. >- If schedcpu() considered that a thread, after updating its prio, > should preempt the one running on the CPU pointed by `p_cpu'. > > The diff below simplify all of that by calling need_resched() when: >- A thread is inserted in a CPU runqueue at a higher priority than > the one SONPROC. >- schedcpu() decides that a thread in SRUN state should preempt the > one SONPROC. > > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority" > of a thread are now updated while holding a per-thread mutex. As a > result schedclock() and donice() no longer takes the SCHED_LOCK(), > and schedcpu() almost never take it. Need to look closer at how this works. > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value > when displaying priorities. This is helpful to understand what's > happening: Do you intend to remove that bit before committing this? > load averages: 0.99, 0.56, 0.25 two.lab.grenadille.net > 23:42:10 > 70 threads: 68 idle, 2 on processorup > 0:09 > CPU0: 0.0% user, 0.0% nice, 51.0% sys, 2.0% spin, 0.0% intr, 47.1% idle > CPU1: 2.0% user, 0.0% nice, 51.0% sys, 3.9% spin, 0.0% intr, 43.1% idle > Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M > > PID TID PRI NICE SIZE RES STATE WAIT TIMECPU COMMAND > 81000 145101 7200K 1664K sleep/1 bored 1:15 36.96% softnet > 47133 244097 730 2984K 4408K sleep/1 netio 1:06 35.06% cvs > 64749 522184 660 176K 148K onproc/1 - 0:55 28.81% nfsd > 21615 602473 12700K 1664K sleep/0 - 7:22 0.00% idle0 > 12413 606242 12700K 1664K sleep/1 - 7:08 0.00% idle1 > 85778 338258 500 4936K 7308K idle select0:10 0.00% ssh > 22771 575513 500 176K 148K sleep/0 nfsd 0:02 0.00% nfsd > > > > - The removal of `p_priority' and the change that makes mi_switch() > always update `spc_curpriority' might introduce some changes in > behavior, especially with kernel threads that were not going through > tsleep(9). We currently have some situations where the priority of > the running thread isn't correctly reflected. This diff changes that > which means we should be able to better understand where the problems >
Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority
On 01/06/19(Sat) 18:55, Martin Pieuchot wrote: > Diff below exists mainly for documentation and test purposes. If > you're not interested about how to break the scheduler internals in > pieces, don't read further and go straight to testing! > > - First change is to stop calling tsleep(9) at PUSER. That makes > it clear that all "sleeping priorities" are smaller than PUSER. > That's important to understand for the diff below. `p_priority' > is currently a placeholder for the "sleeping priority" and the > "runnqueue priority". Both fields are separated by this diff. > > - When a thread goes to sleep, the priority argument of tsleep(9) is > now recorded in `p_slpprio'. This argument can be considered as part > of the sleep queue. Its purpose is to place the thread into a higher > runqueue when awoken. > > - Currently, for stopped threads, `p_priority' correspond to `p_usrpri'. > So setrunnable() has been untangled to place SSTOP and SSLEEP threads > in the preferred queue without having to use `p_priority'. Note that > `p_usrpri' is still recalculated *after* having called setrunqueue(). > This is currently fine because setrunnable() is called with SCHED_LOCK() > but it will be racy when we'll split it. > > - A new field, `p_runprio' has been introduced. It should be considered > as part of the per-CPU runqueues. It indicates where a current thread > is placed. > > - `spc_curpriority' is now updated at every context-switch. That means >need_resched() won't be called after comparing an out-of-date value. >At the same time, `p_usrpri' is initialized to the highest possible >value for idle threads. > > - resched_proc() was calling need_resched() in the following conditions: >- If the SONPROC thread has a higher priority that the current > running thread (itself). >- Twice in setrunnable() when we know that p_priority <= p_usrpri. >- If schedcpu() considered that a thread, after updating its prio, > should preempt the one running on the CPU pointed by `p_cpu'. > > The diff below simplify all of that by calling need_resched() when: >- A thread is inserted in a CPU runqueue at a higher priority than > the one SONPROC. >- schedcpu() decides that a thread in SRUN state should preempt the > one SONPROC. > > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority" > of a thread are now updated while holding a per-thread mutex. As a > result schedclock() and donice() no longer takes the SCHED_LOCK(), > and schedcpu() almost never take it. > > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value > when displaying priorities. This is helpful to understand what's > happening: > > load averages: 0.99, 0.56, 0.25 two.lab.grenadille.net > 23:42:10 > 70 threads: 68 idle, 2 on processorup > 0:09 > CPU0: 0.0% user, 0.0% nice, 51.0% sys, 2.0% spin, 0.0% intr, 47.1% idle > CPU1: 2.0% user, 0.0% nice, 51.0% sys, 3.9% spin, 0.0% intr, 43.1% idle > Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M > > PID TID PRI NICE SIZE RES STATE WAIT TIMECPU COMMAND > 81000 145101 7200K 1664K sleep/1 bored 1:15 36.96% softnet > 47133 244097 730 2984K 4408K sleep/1 netio 1:06 35.06% cvs > 64749 522184 660 176K 148K onproc/1 - 0:55 28.81% nfsd > 21615 602473 12700K 1664K sleep/0 - 7:22 0.00% idle0 > 12413 606242 12700K 1664K sleep/1 - 7:08 0.00% idle1 > 85778 338258 500 4936K 7308K idle select0:10 0.00% ssh > 22771 575513 500 176K 148K sleep/0 nfsd 0:02 0.00% nfsd > > > > - The removal of `p_priority' and the change that makes mi_switch() > always update `spc_curpriority' might introduce some changes in > behavior, especially with kernel threads that were not going through > tsleep(9). We currently have some situations where the priority of > the running thread isn't correctly reflected. This diff changes that > which means we should be able to better understand where the problems > are. > > I'd be interested in comments/tests/reviews before continuing in this > direction. Note that at least part of this diff are required to split > the accounting apart from the SCHED_LOCK() as well. > > I'll also work on exporting scheduler statistics unless somebody wants > to beat me :) Updated diff to use IPL_SCHED and rebased to apply on top of -current :) Index: arch/amd64/amd64/genassym.cf === RCS file: /cvs/src/sys/arch/amd64/amd64/genassym.cf,v retrieving revision 1.40 diff -u -p -r1.40 genassym.cf --- arch/amd64/amd64/genassym.cf17 May 2019 19:07:15 - 1.40 +++ arch/amd64/amd64/genassym.cf1 Jun 2019 16:27:46 - @@ -32,7 +32,6 @@ export
Re: mtx_enter_try(9) & recursion
> Date: Sun, 2 Jun 2019 15:39:59 -0300 > From: Martin Pieuchot > Cc: tech@openbsd.org > Content-Type: text/plain; charset=utf-8 > Content-Disposition: inline > > On 02/06/19(Sun) 15:50, Visa Hankala wrote: > > On Sun, Jun 02, 2019 at 10:29:01AM -0300, Martin Pieuchot wrote: > > > On 02/06/19(Sun) 04:30, Visa Hankala wrote: > > > > On Sat, Jun 01, 2019 at 07:04:23PM -0300, Martin Pieuchot wrote: > > > > > On 01/06/19(Sat) 23:22, Mark Kettenis wrote: > > > > > > > Date: Sat, 1 Jun 2019 17:32:52 -0300 > > > > > > > From: Martin Pieuchot > > > > > > > > > > > > > > Currently it isn't safe to call mtx_enter_try(9) if you're already > > > > > > > holding the mutex. That means it isn't safe to call that function > > > > > > > in hardclock(9), like with `windup_mtx'. That's why the mutex > > > > > > > needs > > > > > > > to be initialized as IPL_CLOCK. > > > > > > > > > > > > > > I'm working on removing the SCHED_LOCK() from inside hardclock(9). > > > > > > > That leads me to wonder if I should initialize all mutexes to > > > > > > > IPL_SCHED, > > > > > > > possibly blocking clock interrupts, or if we should change the > > > > > > > mutex API > > > > > > > to allow mtx_enter_try(9) to deal with recursion. > > > > > > > > > > > > > > The diff below removes the recursion check for mtx_enter_try(9). > > > > > > > > > > > > > > Comments? Oks? > > > > > > > > > > > > My initial reaction is that if you're trying to lock when you > > > > > > already > > > > > > have the lock, there is something wrong with your locking strategy > > > > > > and > > > > > > that this is something we don't want. > > > > > > > > > > Could you elaborate? Are you saying that preventing hardclock(9) to > > > > > run > > > > > is the way to move forward to unlock its internals? Why isn't that > > > > > strategy wrong? > > > > > > > > > > In the `windup_mtx' case, does it matter if the mutex is taken by > > > > > another CPU or by myself? What's the problem when CPU0 is one holding > > > > > the lock? > > > > > > > > mutex(9) is not and should not become recursive. Recursive locking > > > > works when it is voluntary. If recursion was allowed with interrupts, > > > > the CPU could re-enter the critical section at any moment, possibly > > > > seeing inconsistent state or breaking assumptions made by the original > > > > entry. > > > > > > I don't understand how your answer is related to my diff :( > > > > > > Why allowing mtx_enter_try(9) to fail and not to panic(9) when the CPU > > > executing the code already owns the mutex is wrong? > > > > > > Is that what you call recursion? Not entering the critical section in > > > the interrupt instead of blocking the interrupt? > > > > > > How is that different from not entering the critical section because > > > another CPU is holding the mutex? > > > > To me, it looks very suspicious if a CPU tries to lock a mutex that it > > already holds because it indicates that there is a potential recursion. > > My stance is that such a locking behaviour is a design error in general > > and the current way of checking recursion in mtx_enter_try() is sound. > > But why? Nobody is answering my question :o) You're saying recursion > is bad because there's recursion. I think recursion makes it harder to reason about lock ordering... > How is that different from using mtx_enter_try(9) to protect from > another CPU being in the critical section? Why do you call that a > design error? To me, the use of mtx_enter_try(9) is suspicious to start with. It either means there is some magic going on to avoid a lock ordering problem. Or the code is opportunistically trying to do some work that doesn't necessarily need to be done at this point. Ideally, I'd like to see every mtx_enter_try(9) call annotated with an explanation why it isn't a simple mtx_enter(9). > I don't mind putting all mutexes at IPL_SCHED. I would like to > understand why it matters *who* is holding the mutex. Since > mtx_enter_try(9) is here to not enter the critical section for $reasons, > setting an ipl of IPL_CLOCK sounds just like a special corner case. I wonder what $reasons is in the case of tc_ticktock(). I think it is opportunistically trying to avoid spinning on the lock, which is fine since presumably some other thread is already busy doing the work. But I'd argue that the code would still be correct if tc_ticktock() would simply call mtx_enter(). At which point the IPL_CLOCK would be mandatory. > You might understand why I'm asking. We are trying to get more stuff > done in parallel. That implies separating parts of the kernel logically. > Blocking IPL_CLOCK is a big hammer, it block a lot of stuff. So yes it > is easy to use, but my feeling is that this simplicity comes at a cost, > latency wise. I'm not so sure this really matters. In principle you're just delaying the work while you're busy doing some other work. So you're not really doing less work in parallel. You're right about latency
request for testing: bootstrapping time
Hi, Bootstrapping time is a bit of a hard problem. The hardest scenario is: resolver running in DNSSEC mode on the local machine, on a machine that does not have real time clock with battery backup. It's time will be wrong, especially on a cold boot. 1. Wrong time, so dnssec validation fails -> resolve not possible and we do not have IP addressed of ntp servers or constraint sites to contact. To fix, switch to CD DNS mode (Checking Disabled): now we have IPs. Once time is synced we force a re-resolve to get DNSSEC validated IP addresses. I committed these changes recently. 2. Try to get constraints (validation using https is needed since the ntp protocol itself is not secure) -> https certificate validation will fail since time is wrong. So use time in the HTTP header to validate the certificate. This change (by jsing@) has been committed a few months ago. So we're now at the point to do the last step: automatic setting of time on boot. This diff enables automatic mode if we are booting (securelevel is 0) and at least one constraint source is defined . We really want to do do constraint validation (see above). If -s is given as argument, we do whatever that mode did before, so that behaviour did not change. We set the time once a couple of constraint validated ntp replies have been received. We refuse to set the time backwards (should not happen since even without a proper RTC last mount time of the root dir is used to initialize the clock), or with only a small (< 60s) offset: for that case regualar ntpd operation will correct the time. To test: make sure you are current, the diff below depends on recent commits. Apply diff. Setup ntpd to run without flags and with your favorite ntp servers and/or pools, specify at least one constraint url and see what happens when booting. To make it more challanging, use unbound on the local machine while dnssec is enabled, use a machine without RCT clock or set the time wrong (in the past, since setting the time backwards in automatic mode is not supported!) and do a cold boot. Test reports very much appreciated! -Otto Index: client.c === RCS file: /cvs/src/usr.sbin/ntpd/client.c,v retrieving revision 1.106 diff -u -p -r1.106 client.c --- client.c29 May 2019 18:48:33 - 1.106 +++ client.c1 Jun 2019 17:20:42 - @@ -29,6 +29,8 @@ #include "ntpd.h" intclient_update(struct ntp_peer *); +intauto_cmp(const void *, const void *); +void handle_auto(double); void set_deadline(struct ntp_peer *, time_t); void @@ -213,7 +215,47 @@ client_query(struct ntp_peer *p) } int -client_dispatch(struct ntp_peer *p, u_int8_t settime) +auto_cmp(const void *a, const void *b) +{ + double at = *(const double *)a; + double bt = *(const double *)b; + return at < bt ? -1 : (at > bt ? 1 : 0); +} + +void +handle_auto(double offset) +{ + static int count; + static double v[AUTO_REPLIES]; + + /* +* It happens the (constraint) resolves initially fail, don't give up +* but see if we get validatd replies later. +*/ + if (conf->constraint_median == 0) + return; + + if (offset < AUTO_THRESHOLD) { + /* don't bother */ + priv_settime(0); + return; + } + /* collect some more */ + v[count++] = offset; + if (count < AUTO_REPLIES) + return; + + /* we have enough */ + qsort(v, count, sizeof(double), auto_cmp); + if (AUTO_REPLIES % 2 == 0) + offset = (v[AUTO_REPLIES / 2 - 1] + v[AUTO_REPLIES / 2]) / 2; + else + offset = v[AUTO_REPLIES / 2]; + priv_settime(offset); +} + +int +client_dispatch(struct ntp_peer *p, u_int8_t settime, u_int8_t automatic) { struct ntp_msg msg; struct msghdrsomsg; @@ -385,7 +427,9 @@ client_dispatch(struct ntp_peer *p, u_in if (p->trustlevel < TRUSTLEVEL_PATHETIC) interval = scale_interval(INTERVAL_QUERY_PATHETIC); else if (p->trustlevel < TRUSTLEVEL_AGGRESSIVE) - interval = scale_interval(INTERVAL_QUERY_AGGRESSIVE); + interval = (conf->settime && conf->automatic) ? + INTERVAL_QUERY_ULTRA_VIOLENCE : + scale_interval(INTERVAL_QUERY_AGGRESSIVE); else interval = scale_interval(INTERVAL_QUERY_NORMAL); @@ -408,8 +452,12 @@ client_dispatch(struct ntp_peer *p, u_in (long long)interval); client_update(p); - if (settime) - priv_settime(p->reply[p->shift].offset); + if (settime) { + if (automatic) + handle_auto(p->reply[p->shift].offset); + else + priv_settime(p->reply[p->shift].offset); + } if (++p->shift >= OFFSET_ARRAY_SIZE) p->shift =
Re: mtx_enter_try(9) & recursion
On 02/06/19(Sun) 15:50, Visa Hankala wrote: > On Sun, Jun 02, 2019 at 10:29:01AM -0300, Martin Pieuchot wrote: > > On 02/06/19(Sun) 04:30, Visa Hankala wrote: > > > On Sat, Jun 01, 2019 at 07:04:23PM -0300, Martin Pieuchot wrote: > > > > On 01/06/19(Sat) 23:22, Mark Kettenis wrote: > > > > > > Date: Sat, 1 Jun 2019 17:32:52 -0300 > > > > > > From: Martin Pieuchot > > > > > > > > > > > > Currently it isn't safe to call mtx_enter_try(9) if you're already > > > > > > holding the mutex. That means it isn't safe to call that function > > > > > > in hardclock(9), like with `windup_mtx'. That's why the mutex needs > > > > > > to be initialized as IPL_CLOCK. > > > > > > > > > > > > I'm working on removing the SCHED_LOCK() from inside hardclock(9). > > > > > > That leads me to wonder if I should initialize all mutexes to > > > > > > IPL_SCHED, > > > > > > possibly blocking clock interrupts, or if we should change the > > > > > > mutex API > > > > > > to allow mtx_enter_try(9) to deal with recursion. > > > > > > > > > > > > The diff below removes the recursion check for mtx_enter_try(9). > > > > > > > > > > > > Comments? Oks? > > > > > > > > > > My initial reaction is that if you're trying to lock when you already > > > > > have the lock, there is something wrong with your locking strategy and > > > > > that this is something we don't want. > > > > > > > > Could you elaborate? Are you saying that preventing hardclock(9) to run > > > > is the way to move forward to unlock its internals? Why isn't that > > > > strategy wrong? > > > > > > > > In the `windup_mtx' case, does it matter if the mutex is taken by > > > > another CPU or by myself? What's the problem when CPU0 is one holding > > > > the lock? > > > > > > mutex(9) is not and should not become recursive. Recursive locking > > > works when it is voluntary. If recursion was allowed with interrupts, > > > the CPU could re-enter the critical section at any moment, possibly > > > seeing inconsistent state or breaking assumptions made by the original > > > entry. > > > > I don't understand how your answer is related to my diff :( > > > > Why allowing mtx_enter_try(9) to fail and not to panic(9) when the CPU > > executing the code already owns the mutex is wrong? > > > > Is that what you call recursion? Not entering the critical section in > > the interrupt instead of blocking the interrupt? > > > > How is that different from not entering the critical section because > > another CPU is holding the mutex? > > To me, it looks very suspicious if a CPU tries to lock a mutex that it > already holds because it indicates that there is a potential recursion. > My stance is that such a locking behaviour is a design error in general > and the current way of checking recursion in mtx_enter_try() is sound. But why? Nobody is answering my question :o) You're saying recursion is bad because there's recursion. How is that different from using mtx_enter_try(9) to protect from another CPU being in the critical section? Why do you call that a design error? I don't mind putting all mutexes at IPL_SCHED. I would like to understand why it matters *who* is holding the mutex. Since mtx_enter_try(9) is here to not enter the critical section for $reasons, setting an ipl of IPL_CLOCK sounds just like a special corner case. You might understand why I'm asking. We are trying to get more stuff done in parallel. That implies separating parts of the kernel logically. Blocking IPL_CLOCK is a big hammer, it block a lot of stuff. So yes it is easy to use, but my feeling is that this simplicity comes at a cost, latency wise. Anyway I'll drop this diff. I'll try to thing about your point :o)
Re: mtx_enter_try(9) & recursion
> Date: Sun, 2 Jun 2019 15:50:35 + > From: Visa Hankala > > On Sun, Jun 02, 2019 at 10:29:01AM -0300, Martin Pieuchot wrote: > > On 02/06/19(Sun) 04:30, Visa Hankala wrote: > > > On Sat, Jun 01, 2019 at 07:04:23PM -0300, Martin Pieuchot wrote: > > > > On 01/06/19(Sat) 23:22, Mark Kettenis wrote: > > > > > > Date: Sat, 1 Jun 2019 17:32:52 -0300 > > > > > > From: Martin Pieuchot > > > > > > > > > > > > Currently it isn't safe to call mtx_enter_try(9) if you're already > > > > > > holding the mutex. That means it isn't safe to call that function > > > > > > in hardclock(9), like with `windup_mtx'. That's why the mutex needs > > > > > > to be initialized as IPL_CLOCK. > > > > > > > > > > > > I'm working on removing the SCHED_LOCK() from inside hardclock(9). > > > > > > That leads me to wonder if I should initialize all mutexes to > > > > > > IPL_SCHED, > > > > > > possibly blocking clock interrupts, or if we should change the > > > > > > mutex API > > > > > > to allow mtx_enter_try(9) to deal with recursion. > > > > > > > > > > > > The diff below removes the recursion check for mtx_enter_try(9). > > > > > > > > > > > > Comments? Oks? > > > > > > > > > > My initial reaction is that if you're trying to lock when you already > > > > > have the lock, there is something wrong with your locking strategy and > > > > > that this is something we don't want. > > > > > > > > Could you elaborate? Are you saying that preventing hardclock(9) to run > > > > is the way to move forward to unlock its internals? Why isn't that > > > > strategy wrong? > > > > > > > > In the `windup_mtx' case, does it matter if the mutex is taken by > > > > another CPU or by myself? What's the problem when CPU0 is one holding > > > > the lock? > > > > > > mutex(9) is not and should not become recursive. Recursive locking > > > works when it is voluntary. If recursion was allowed with interrupts, > > > the CPU could re-enter the critical section at any moment, possibly > > > seeing inconsistent state or breaking assumptions made by the original > > > entry. > > > > I don't understand how your answer is related to my diff :( > > > > Why allowing mtx_enter_try(9) to fail and not to panic(9) when the CPU > > executing the code already owns the mutex is wrong? > > > > Is that what you call recursion? Not entering the critical section in > > the interrupt instead of blocking the interrupt? > > > > How is that different from not entering the critical section because > > another CPU is holding the mutex? > > To me, it looks very suspicious if a CPU tries to lock a mutex that it > already holds because it indicates that there is a potential recursion. > My stance is that such a locking behaviour is a design error in general > and the current way of checking recursion in mtx_enter_try() is sound. > > If a mutex can be taken on multiple interrupt priority levels, the lock > has to be configured to block the highest level, and mutexes used > in hardclock() are no exception. Agreed. I think relaxing the rules here is unwise. It is much easier to audit the code for use of mutexes in interrupt code with the simple rule that visa has written above. So yes, I think you should have the mutexes block IPL_CLOCK/IPL_STATCLOCK/IPL_SCHED, or whatever else is appropriate. I don't think that will be a problem as SCHED_LOCK already does a splsched().
Re: mtx_enter_try(9) & recursion
On Sun, Jun 02, 2019 at 10:29:01AM -0300, Martin Pieuchot wrote: > On 02/06/19(Sun) 04:30, Visa Hankala wrote: > > On Sat, Jun 01, 2019 at 07:04:23PM -0300, Martin Pieuchot wrote: > > > On 01/06/19(Sat) 23:22, Mark Kettenis wrote: > > > > > Date: Sat, 1 Jun 2019 17:32:52 -0300 > > > > > From: Martin Pieuchot > > > > > > > > > > Currently it isn't safe to call mtx_enter_try(9) if you're already > > > > > holding the mutex. That means it isn't safe to call that function > > > > > in hardclock(9), like with `windup_mtx'. That's why the mutex needs > > > > > to be initialized as IPL_CLOCK. > > > > > > > > > > I'm working on removing the SCHED_LOCK() from inside hardclock(9). > > > > > That leads me to wonder if I should initialize all mutexes to > > > > > IPL_SCHED, > > > > > possibly blocking clock interrupts, or if we should change the mutex > > > > > API > > > > > to allow mtx_enter_try(9) to deal with recursion. > > > > > > > > > > The diff below removes the recursion check for mtx_enter_try(9). > > > > > > > > > > Comments? Oks? > > > > > > > > My initial reaction is that if you're trying to lock when you already > > > > have the lock, there is something wrong with your locking strategy and > > > > that this is something we don't want. > > > > > > Could you elaborate? Are you saying that preventing hardclock(9) to run > > > is the way to move forward to unlock its internals? Why isn't that > > > strategy wrong? > > > > > > In the `windup_mtx' case, does it matter if the mutex is taken by > > > another CPU or by myself? What's the problem when CPU0 is one holding > > > the lock? > > > > mutex(9) is not and should not become recursive. Recursive locking > > works when it is voluntary. If recursion was allowed with interrupts, > > the CPU could re-enter the critical section at any moment, possibly > > seeing inconsistent state or breaking assumptions made by the original > > entry. > > I don't understand how your answer is related to my diff :( > > Why allowing mtx_enter_try(9) to fail and not to panic(9) when the CPU > executing the code already owns the mutex is wrong? > > Is that what you call recursion? Not entering the critical section in > the interrupt instead of blocking the interrupt? > > How is that different from not entering the critical section because > another CPU is holding the mutex? To me, it looks very suspicious if a CPU tries to lock a mutex that it already holds because it indicates that there is a potential recursion. My stance is that such a locking behaviour is a design error in general and the current way of checking recursion in mtx_enter_try() is sound. If a mutex can be taken on multiple interrupt priority levels, the lock has to be configured to block the highest level, and mutexes used in hardclock() are no exception.
Re: mtx_enter_try(9) & recursion
On 02/06/19(Sun) 04:30, Visa Hankala wrote: > On Sat, Jun 01, 2019 at 07:04:23PM -0300, Martin Pieuchot wrote: > > On 01/06/19(Sat) 23:22, Mark Kettenis wrote: > > > > Date: Sat, 1 Jun 2019 17:32:52 -0300 > > > > From: Martin Pieuchot > > > > > > > > Currently it isn't safe to call mtx_enter_try(9) if you're already > > > > holding the mutex. That means it isn't safe to call that function > > > > in hardclock(9), like with `windup_mtx'. That's why the mutex needs > > > > to be initialized as IPL_CLOCK. > > > > > > > > I'm working on removing the SCHED_LOCK() from inside hardclock(9). > > > > That leads me to wonder if I should initialize all mutexes to IPL_SCHED, > > > > possibly blocking clock interrupts, or if we should change the mutex API > > > > to allow mtx_enter_try(9) to deal with recursion. > > > > > > > > The diff below removes the recursion check for mtx_enter_try(9). > > > > > > > > Comments? Oks? > > > > > > My initial reaction is that if you're trying to lock when you already > > > have the lock, there is something wrong with your locking strategy and > > > that this is something we don't want. > > > > Could you elaborate? Are you saying that preventing hardclock(9) to run > > is the way to move forward to unlock its internals? Why isn't that > > strategy wrong? > > > > In the `windup_mtx' case, does it matter if the mutex is taken by > > another CPU or by myself? What's the problem when CPU0 is one holding > > the lock? > > mutex(9) is not and should not become recursive. Recursive locking > works when it is voluntary. If recursion was allowed with interrupts, > the CPU could re-enter the critical section at any moment, possibly > seeing inconsistent state or breaking assumptions made by the original > entry. I don't understand how your answer is related to my diff :( Why allowing mtx_enter_try(9) to fail and not to panic(9) when the CPU executing the code already owns the mutex is wrong? Is that what you call recursion? Not entering the critical section in the interrupt instead of blocking the interrupt? How is that different from not entering the critical section because another CPU is holding the mutex?
Re: vmd(8) i8042 device implementation questions
On Sat, Jun 01, 2019 at 06:12:16PM -0500, Katherine Rohl wrote: > Couple questions: > > > This means no interrupt will be injected. I'm not sure if that's what you > > want. > > See vm.c: vcpu_exit_inout(..). It looks like you may have manually asserted > > the > > IRQ in this file, which is a bit different than what we do in other > > devices. That > > may be okay, though. > > The device can assert zero, one, or two IRQs depending on the state of the > input ports. Are we capable of asserting two IRQs at once through > vcpu_exit_i8042? > > > For this IRQ, if it's edge triggered, please assert then deassert the line. > > The i8259 code should handle that properly. What you have here is a level > > triggered interrupt (eg, the line will stay asserted until someone > > does a 1 -> 0 transition below). Same goes for the next few cases. > > Would asserting the IRQs through the exit function handle this for me if > that???s possible? > > > Also, please bump the revision in the vcpu struct for send/receive > > as we will be sending a new struct layout now. > > Where exactly? The file revision? That would be VM_DUMP_VERSION in vmd.h I reckon. -- jasper
[patch] rsync: fix another double close socket descriptor
Hi, I noticed when using openrsync with a remote and a ssh_prog set (-e option) the socket is closed twice also. Reproducable with: ktrace openrsync -e echo -av rsync://127.0.0.1/ftp/ a kdump: 70262 openrsync CALL close(3) 70262 openrsync RET close 0 > 70262 openrsync CALL close(3) > 70262 openrsync RET close -1 errno 9 Bad file descriptor 70262 openrsync CALL kbind(0x7f7eb390,24,0x26f671b79e307e66) 70262 openrsync RET kbind 0 70262 openrsync CALL wait4(15180,0x7f7eb468,0<>,0) 70262 openrsync RET wait4 15180/0x3b4c The patch below fixes this, but has a behaviour change: when the return value rc = 0 the socket is not closed anymore directly. I included 2 tiny fixes: - In blocks.c have_md was initialized to zero twice. - flist.c has a small typo in a comment. diff --git usr.bin/rsync/blocks.c usr.bin/rsync/blocks.c index f6a3575caab..09f09b44e0e 100644 --- usr.bin/rsync/blocks.c +++ usr.bin/rsync/blocks.c @@ -55,7 +55,6 @@ blk_find(struct sess *sess, const void *buf, off_t size, off_t offs, assert(remain); osz = remain < (off_t)blks->len ? remain : (off_t)blks->len; fhash = hash_fast(buf + offs, (size_t)osz); - have_md = 0; /* * Start with our match hint. diff --git usr.bin/rsync/flist.c usr.bin/rsync/flist.c index f8a6f00a149..447907735af 100644 --- usr.bin/rsync/flist.c +++ usr.bin/rsync/flist.c @@ -53,7 +53,7 @@ #define FLIST_TIME_SAME 0x0080 /* time is repeat */ /* - * Requied way to sort a filename list. + * Required way to sort a filename list. */ static int flist_cmp(const void *p1, const void *p2) diff --git usr.bin/rsync/socket.c usr.bin/rsync/socket.c index 1a0844f1313..b2cc6b6ac0e 100644 --- usr.bin/rsync/socket.c +++ usr.bin/rsync/socket.c @@ -455,6 +455,5 @@ rsync_socket(const struct opts *opts, int sd, const struct fargs *f) rc = 0; out: free(args); - close(sd); return rc; } -- Kind regards, Hiltjo
MSI-X suspend/resume
On Fri, May 31, 2019 at 07:35:41AM +1000, Jonathan Matthew wrote: > I had wondered about the MAU32 define when I was looking at MSI-X > suspend/resume. I have also taken a try at MSI-X suspend/resume in the last few weeks. The diff below is what I have come up with. It worked with nvme on a laptop and on qemu (but I have not tested it since rebasing on kettenis' 32bit fix). I have also seen that arm64 got MSI-X in the meantime. That needs to be adjusted, too. Cheers, Stefan index 8a456f72b09..0b99fe8660d 100644 --- a/sys/arch/amd64/include/pci_machdep.h +++ b/sys/arch/amd64/include/pci_machdep.h @@ -98,6 +98,8 @@ void pci_set_powerstate_md(pci_chipset_tag_t, pcitag_t, int, int); void pci_mcfg_init(bus_space_tag_t, bus_addr_t, int, int, int); pci_chipset_tag_t pci_lookup_segment(int); +#define __HAVE_PCI_MSIX + /* * ALL OF THE FOLLOWING ARE MACHINE-DEPENDENT, AND SHOULD NOT BE USED * BY PORTABLE CODE. diff --git a/sys/arch/amd64/pci/pci_machdep.c b/sys/arch/amd64/pci/pci_machdep.c index 976ef2d3b6b..4dd95aedd21 100644 --- a/sys/arch/amd64/pci/pci_machdep.c +++ b/sys/arch/amd64/pci/pci_machdep.c @@ -446,81 +446,85 @@ msix_hwunmask(struct pic *pic, int pin) { } +int +pci_msix_table_map(struct pci_msix_table_info *ti) +{ + int bir; + + if (!pci_get_capability(ti->pc, ti->tag, PCI_CAP_MSIX, >cap_offset, + >cap_reg)) { + return 0; + } + ti->memt = X86_BUS_SPACE_MEM; /* XXX */ + ti->table = pci_conf_read(ti->pc, ti->tag, + ti->cap_offset + PCI_MSIX_TABLE); + bir = (ti->table & PCI_MSIX_TABLE_BIR); + + ti->table_offset = (ti->table & PCI_MSIX_TABLE_OFF); + ti->table_size = PCI_MSIX_MC_TBLSZ(ti->cap_reg) + 1; + + bir = PCI_MAPREG_START + bir * 4; + if (pci_mem_find(ti->pc, ti->tag, bir, >base, NULL, NULL) || + _bus_space_map(ti->memt, ti->base + ti->table_offset, + ti->table_size * 16, 0, >memh)) { + panic("%s: cannot map registers", __func__); + } + return 1; +} + +void +pci_msix_table_unmap(struct pci_msix_table_info *ti) +{ + _bus_space_unmap(ti->memt, ti->memh, ti->table_size * 16, NULL); +} + void msix_addroute(struct pic *pic, struct cpu_info *ci, int pin, int vec, int type) { - pci_chipset_tag_t pc = NULL; /* XXX */ - bus_space_tag_t memt = X86_BUS_SPACE_MEM; /* XXX */ - bus_space_handle_t memh; - bus_addr_t base; - pcitag_t tag = PCI_MSIX_TAG(pin); + struct pci_msix_table_info ti = { + .pc = NULL, /* XXX */ + .tag = PCI_MSIX_TAG(pin), + }; int entry = PCI_MSIX_VEC(pin); - pcireg_t reg, addr, table; + pcireg_t addr; uint32_t ctrl; - int bir, offset; - int off, tblsz; - if (pci_get_capability(pc, tag, PCI_CAP_MSIX, , ) == 0) + if (!pci_msix_table_map()) panic("%s: no msix capability", __func__); addr = 0xfee0UL | (ci->ci_apicid << 12); - table = pci_conf_read(pc, tag, off + PCI_MSIX_TABLE); - bir = (table & PCI_MSIX_TABLE_BIR); - offset = (table & PCI_MSIX_TABLE_OFF); - tblsz = PCI_MSIX_MC_TBLSZ(reg) + 1; - - bir = PCI_MAPREG_START + bir * 4; - if (pci_mem_find(pc, tag, bir, , NULL, NULL) || - _bus_space_map(memt, base + offset, tblsz * 16, 0, )) - panic("%s: cannot map registers", __func__); - - bus_space_write_4(memt, memh, PCI_MSIX_MA(entry), addr); - bus_space_write_4(memt, memh, PCI_MSIX_MAU32(entry), 0); - bus_space_write_4(memt, memh, PCI_MSIX_MD(entry), vec); - bus_space_barrier(memt, memh, PCI_MSIX_MA(entry), 16, + bus_space_write_4(ti.memt, ti.memh, PCI_MSIX_MA(entry), addr); + bus_space_write_4(ti.memt, ti.memh, PCI_MSIX_MAU32(entry), 0); + bus_space_write_4(ti.memt, ti.memh, PCI_MSIX_MD(entry), vec); + bus_space_barrier(ti.memt, ti.memh, PCI_MSIX_MA(entry), 16, BUS_SPACE_BARRIER_WRITE); - ctrl = bus_space_read_4(memt, memh, PCI_MSIX_VC(entry)); - bus_space_write_4(memt, memh, PCI_MSIX_VC(entry), + ctrl = bus_space_read_4(ti.memt, ti.memh, PCI_MSIX_VC(entry)); + bus_space_write_4(ti.memt, ti.memh, PCI_MSIX_VC(entry), ctrl & ~PCI_MSIX_VC_MASK); - - _bus_space_unmap(memt, memh, tblsz * 16, NULL); - - pci_conf_write(pc, tag, off, reg | PCI_MSIX_MC_MSIXE); + pci_conf_write(ti.pc, ti.tag, ti.cap_offset, + ti.cap_reg | PCI_MSIX_MC_MSIXE); + pci_msix_table_unmap(); } void msix_delroute(struct pic *pic, struct cpu_info *ci, int pin, int vec, int type) { - pci_chipset_tag_t pc = NULL; /* XXX */ - bus_space_tag_t memt = X86_BUS_SPACE_MEM; /* XXX */ - bus_space_handle_t memh; - bus_addr_t base; - pcitag_t tag = PCI_MSIX_TAG(pin); + struct pci_msix_table_info ti = { + .pc = NULL, /* XXX */ + .tag = PCI_MSIX_TAG(pin),
Re: bfd: Fixup state and flags bitmask
On Sun, Jun 02, 2019 at 07:12:35PM +1000, Mitchell Krome wrote: > I copied the defines from the sys/net bfd code to use in tcpdump and > noticed that I wasn't seeing any poll bits from the remote side, even > though tcpdump on the remote side showed it was sending them. Turns out > the bitmask for state and flags is a bit off. Diff below fixes them to > match what's in the RFC (and the comment a bit higher in the file) > > Mitchell > > > diff --git sys/net/bfd.c sys/net/bfd.c > index 0d2ec8af512..db8d571f35b 100644 > --- sys/net/bfd.c > +++ sys/net/bfd.c > @@ -90,8 +90,8 @@ struct bfd_auth_header { > #define BFD_VERSION 1 /* RFC 5880 Page 6 */ > #define BFD_VER(x) (((x) & 0xe0) >> 5) > #define BFD_DIAG(x) ((x) & 0x1f) > -#define BFD_STATE(x) (((x) & 0xf0) >> 6) > -#define BFD_FLAGS(x) ((x) & 0x0f) > +#define BFD_STATE(x) (((x) & 0xc0) >> 6) > +#define BFD_FLAGS(x) ((x) & 0x3f) > #define BFD_HDRLEN 24 /* RFC 5880 Page 37 */ > #define BFD_AUTH_SIMPLE_LEN 16 + 3 /* RFC 5880 Page 10 */ > #define BFD_AUTH_MD5_LEN 24 /* RFC 5880 Page 11 */ > OK claudio@ Anyone else wants to commit? -- :wq Claudio
bfd: Fixup state and flags bitmask
I copied the defines from the sys/net bfd code to use in tcpdump and noticed that I wasn't seeing any poll bits from the remote side, even though tcpdump on the remote side showed it was sending them. Turns out the bitmask for state and flags is a bit off. Diff below fixes them to match what's in the RFC (and the comment a bit higher in the file) Mitchell diff --git sys/net/bfd.c sys/net/bfd.c index 0d2ec8af512..db8d571f35b 100644 --- sys/net/bfd.c +++ sys/net/bfd.c @@ -90,8 +90,8 @@ struct bfd_auth_header { #define BFD_VERSION1 /* RFC 5880 Page 6 */ #define BFD_VER(x) (((x) & 0xe0) >> 5) #define BFD_DIAG(x)((x) & 0x1f) -#define BFD_STATE(x) (((x) & 0xf0) >> 6) -#define BFD_FLAGS(x) ((x) & 0x0f) +#define BFD_STATE(x) (((x) & 0xc0) >> 6) +#define BFD_FLAGS(x) ((x) & 0x3f) #define BFD_HDRLEN 24 /* RFC 5880 Page 37 */ #define BFD_AUTH_SIMPLE_LEN16 + 3 /* RFC 5880 Page 10 */ #define BFD_AUTH_MD5_LEN 24 /* RFC 5880 Page 11 */
Re: printf.1: fix incorrect conversion of apostrophe
Stephen Gregoratto writes: > In the escape sequences section of printf.1, the > character is represented using "\e\'". In UTF-8 mode, mandoc converts > this to an acute accent. To fix this I explicitly used "\(aq" as per the > Accents section of mandoc_char(7), although using "\e'" works as well. You're correct, thanks.