Re: [PATCH v3] sched/cputime: Fix using smp_processor_id() in preemptible
On Wed, Jul 19, 2017 at 4:38 PM, Paul McKenney wrote: > On Thu, Jul 13, 2017 at 11:49 PM, Wanpeng Li wrote: >> >> Ping for the merge window. :) >> 2017-07-09 15:40 GMT+08:00 Wanpeng Li : >> > From: Wanpeng Li >> > >> > BUG: using smp_processor_id() in preemptible [] code: >> > 99-trinity/181 >> > caller is debug_smp_processor_id+0x17/0x19 >> > CPU: 0 PID: 181 Comm: 99-trinity Not tainted 4.12.0-01059-g2a42eb9 #1 >> > Call Trace: >> > dump_stack+0x82/0xb8 >> > check_preemption_disabled+0xd1/0xe3 >> > debug_smp_processor_id+0x17/0x19 >> > vtime_delta+0xd/0x2c >> > task_cputime+0x89/0xdb >> > thread_group_cputime+0x11b/0x1ed >> > thread_group_cputime_adjusted+0x1f/0x47 >> > wait_consider_task+0x2a9/0xaf9 >> > ? lock_acquire+0x97/0xa4 >> > do_wait+0xdf/0x1f4 >> > SYSC_wait4+0x8e/0xb5 >> > ? list_add+0x34/0x34 >> > SyS_wait4+0x9/0xb >> > do_syscall_64+0x70/0x82 >> > entry_SYSCALL64_slow_path+0x25/0x25 >> > >> > As Frederic pointed out: >> > >> > | Although those sched_clock_cpu() things seem to only matter when the >> > | sched_clock() is unstable. And that stability is a condition for >> > nohz_full >> > | to work anyway. So probably sched_clock() alone would be enough. >> > >> > This patch fixes it by replacing sched_clock_cpu() by sched_clock() to >> > avoid to call smp_processor_id() in preemptible context. > > > I am hitting this with rcutorture, so have kicked off an overnight > run with this patch. Which passed, so: Tested-by: Paul E. McKenney Thanx, Paul >> > Reported-by: Xiaolong Ye >> > Cc: Thomas Gleixner >> > Cc: Luiz Capitulino >> > Cc: Frederic Weisbecker >> > Cc: Rik van Riel >> > Cc: Peter Zijlstra >> > Cc: Ingo Molnar >> > Signed-off-by: Wanpeng Li >> > --- >> > kernel/sched/cputime.c | 6 +++--- >> > 1 file changed, 3 insertions(+), 3 deletions(-) >> > >> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c >> > index 6e3ea4a..14d2dbf 100644 >> > --- a/kernel/sched/cputime.c >> > +++ b/kernel/sched/cputime.c >> > @@ -683,7 +683,7 @@ static u64 vtime_delta(struct vtime *vtime) >> > { >> > unsigned long long clock; >> > >> > - clock = sched_clock_cpu(smp_processor_id()); >> > + clock = sched_clock(); >> > if (clock < vtime->starttime) >> > return 0; >> > >> > @@ -814,7 +814,7 @@ void arch_vtime_task_switch(struct task_struct >> > *prev) >> > >> > write_seqcount_begin(&vtime->seqcount); >> > vtime->state = VTIME_SYS; >> > - vtime->starttime = sched_clock_cpu(smp_processor_id()); >> > + vtime->starttime = sched_clock(); >> > write_seqcount_end(&vtime->seqcount); >> > } >> > >> > @@ -826,7 +826,7 @@ void vtime_init_idle(struct task_struct *t, int cpu) >> > local_irq_save(flags); >> > write_seqcount_begin(&vtime->seqcount); >> > vtime->state = VTIME_SYS; >> > - vtime->starttime = sched_clock_cpu(cpu); >> > + vtime->starttime = sched_clock(); >> > write_seqcount_end(&vtime->seqcount); >> > local_irq_restore(flags); >> > } >> > -- >> > 2.7.4 >> > > >
Re: [PATCH 07/10] rcu: Separate the RCU synchronization types and APIs into
On Wed, Feb 8, 2017 at 10:34 AM, Ingo Molnar wrote: > So rcupdate.h is a pretty complex header, in particular it includes > which includes - creating a > dependency that includes in , > which prevents the isolation of from the derived > header. > > Solve part of the problem by decoupling rcupdate.h from completions: > this can be done by separating out the rcu_synchronize types and APIs, > and updating their usage sites. > > Since this is a mostly RCU-internal types this will not just simplify > 's dependencies, but will make all the hundreds of > .c files that include rcupdate.h but not completions or wait.h build > faster. Indeed, rcupdate.h is overdue for a more sweeping overhaul. > ( For rcutiny this means that two dependent APIs have to be uninlined, > but that shouldn't be much of a problem as they are rare variants. ) Do people still care about Tiny kernel? If so, I can queue a patch that leaves rcu_barrier_bh() and rcu_barrier_sched() inline, but creates an rcu_barrier_generic() or some such, so that there is space taken up by only one EXPORT_SYMBOL() instead of two. (0day Test Robot yells at me every time I add one...) Other than that, I don't see any problems with this. I will do some testing. Thanx, Paul > Cc: Linus Torvalds > Cc: Mike Galbraith > Cc: Paul E. McKenney > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Cc: linux-kernel@vger.kernel.org > Signed-off-by: Ingo Molnar > --- > fs/autofs4/autofs_i.h | 1 + > include/linux/dcache.h| 1 + > include/linux/rcupdate.h | 40 > include/linux/rcupdate_wait.h | 50 > ++ > include/linux/rcutiny.h | 11 ++- > kernel/rcu/srcu.c | 2 +- > kernel/rcu/tiny.c | 14 +- > kernel/rcu/tree.c | 2 +- > kernel/rcu/update.c | 1 + > kernel/sched/core.c | 1 + > 10 files changed, 71 insertions(+), 52 deletions(-) > > diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h > index c885daae68c8..beef981aa54f 100644 > --- a/fs/autofs4/autofs_i.h > +++ b/fs/autofs4/autofs_i.h > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include > > /* This is the range of ioctl() numbers we claim as ours */ > #define AUTOFS_IOC_FIRST AUTOFS_IOC_READY > diff --git a/include/linux/dcache.h b/include/linux/dcache.h > index c965e4469499..16948defb568 100644 > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > > struct path; > struct vfsmount; > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > index 01f71e1d2e94..1f476b63c596 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -40,7 +40,6 @@ > #include > #include > #include > -#include > #include > #include > #include > @@ -226,45 +225,6 @@ void call_rcu_sched(struct rcu_head *head, > > void synchronize_sched(void); > > -/* > - * Structure allowing asynchronous waiting on RCU. > - */ > -struct rcu_synchronize { > - struct rcu_head head; > - struct completion completion; > -}; > -void wakeme_after_rcu(struct rcu_head *head);> - > -void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, > - struct rcu_synchronize *rs_array); > - > -#define _wait_rcu_gp(checktiny, ...) \ > -do { \ > - call_rcu_func_t __crcu_array[] = { __VA_ARGS__ }; \ > - struct rcu_synchronize __rs_array[ARRAY_SIZE(__crcu_array)];\ > - __wait_rcu_gp(checktiny, ARRAY_SIZE(__crcu_array), \ > - __crcu_array, __rs_array); \ > -} while (0) > - > -#define wait_rcu_gp(...) _wait_rcu_gp(false, __VA_ARGS__) > - > -/** > - * synchronize_rcu_mult - Wait concurrently for multiple grace periods > - * @...: List of call_rcu() functions for the flavors to wait on. > - * > - * This macro waits concurrently for multiple flavors of RCU grace periods. > - * For example, synchronize_rcu_mult(call_rcu, call_rcu_bh) would wait > - * on concurrent RCU and RCU-bh grace periods. Waiting on a give SRCU > - * domain requires you to write a wrapper function for that SRCU domain's > - * call_srcu() function, supplying the corresponding srcu_struct. > - * > - * If Tiny RCU, tell _wait_rcu_gp() not to bother waiting for RCU > - * or RCU-bh, given that anywhere synchronize_rcu_mult() can be called > - * is automatically a grace period. > - */ > -#define synchronize_rcu_mult(...) \ > - _wait_rcu_gp(IS_ENABLED(CONFIG_TINY_RCU), __VA_ARGS__) > - > /** > * call_rcu_tasks() - Queue an RCU for invocation task-based grace period > * @head: structure to be used for queueing the RCU updates. > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_w
Re: linux-next: build failure after merge of the rcu tree
On Wed, Jan 18, 2017 at 7:34 PM, Stephen Rothwell wrote: > Hi Paul, > > After merging the rcu tree, today's linux-next build (x86_64 allmodconfig) > failed like this: > > net/smc/af_smc.c:102:16: error: 'SLAB_DESTROY_BY_RCU' undeclared here (not in > a function) > .slab_flags = SLAB_DESTROY_BY_RCU, > ^ > > Caused by commit > > c7a545924ca1 ("mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU") > > interacting with commit > > ac7138746e14 ("smc: establish new socket family") > > from the net-next tree. > > I have applied the following merge fix patch (someone will need to > remember to mention this to Linus): Thank you, Stephen! I expect that there might be a bit more bikeshedding on the name, but here is hoping... :-/ Thanx, Paul > From: Stephen Rothwell > Date: Thu, 19 Jan 2017 14:29:12 +1100 > Subject: [PATCH] smc: merge fix for "mm: Rename SLAB_DESTROY_BY_RCU to > SLAB_TYPESAFE_BY_RCU" > > Signed-off-by: Stephen Rothwell > --- > net/smc/af_smc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c > index 4875e65f0c4a..a48260f9ebb7 100644 > --- a/net/smc/af_smc.c > +++ b/net/smc/af_smc.c > @@ -99,7 +99,7 @@ struct proto smc_proto = { > .unhash = smc_unhash_sk, > .obj_size = sizeof(struct smc_sock), > .h.smc_hash = &smc_v4_hashinfo, > - .slab_flags = SLAB_DESTROY_BY_RCU, > + .slab_flags = SLAB_TYPESAFE_BY_RCU, > }; > EXPORT_SYMBOL_GPL(smc_proto); > > -- > 2.10.2 > > -- > Cheers, > Stephen Rothwell
Re: kvm: use-after-free in process_srcu
(Trouble with VPN, so replying from gmail.) On Thu, Jan 19, 2017 at 1:27 AM, Paolo Bonzini wrote: > > > On 18/01/2017 23:15, Paul E. McKenney wrote: >> On Wed, Jan 18, 2017 at 09:53:19AM +0100, Paolo Bonzini wrote: >>> >>> >>> On 17/01/2017 21:34, Paul E. McKenney wrote: Do any of your callback functions invoke call_srcu()? (Hey, I have to ask!) >>> >>> No, we only use synchronize_srcu and synchronize_srcu_expedited, so our >>> only callback comes from there. >> >> OK, so the next question is whether your code makes sure that all of its >> synchronize_srcu() and synchronize_srcu_expedited() calls return before >> the call to cleanup_srcu_struct(). > > It certainly should! Or at least that would be our bug. > >> You should only need srcu_barrier() if there were calls to call_srcu(). >> Given that you only have synchronize_srcu() and synchronize_srcu_expedited(), >> you -don't- need srcu_barrier(). What you need instead is to make sure >> that all synchronize_srcu() and synchronize_srcu_expedited() have >> returned before the call to cleanup_srcu_struct(). > > Ok, good. > >>> If this is incorrect, then one flush_delayed_work is enough. If it is >>> correct, the possible alternatives are: >>> >>> * srcu_barrier in the caller, flush_delayed_work+WARN_ON(sp->running) in >>> cleanup_srcu_struct. I strongly dislike this one---because we don't use >>> call_srcu at all, there should be no reason to use srcu_barrier in KVM >>> code. Plus I think all other users have the same issue. >>> >>> * srcu_barrier+flush_delayed_work+WARN_ON(sp->running) in >>> cleanup_srcu_struct >>> >>> * flush_delayed_work+flush_delayed_work+WARN_ON(sp->running) in >>> cleanup_srcu_struct >>> >>> * while(flush_delayed_work) in cleanup_srcu_struct >>> >>> * "while(sp->running) flush_delayed_work" in cleanup_srcu_struct >> >> My current thought is flush_delayed_work() followed by a warning if >> there are any callbacks still posted, and also as you say sp->running. > > Yes, that would work for KVM and anyone else who doesn't use call_srcu > (and order synchronize_srcu correctly against destruction). > > On the other hand, users of call_srcu, such as rcutorture, _do_ need to > place an srcu_barrier before cleanup_srcu_struct, or they need two > flush_delayed_work() calls back to back in cleanup_srcu_struct. Do you > agree? The reason I am resisting the notion of placing an srcu_barrier() in the cleanup_srcu_struct path is that most users don't use call_srcu(), and I don't feel that we should be inflicting the srcu_barrier() performance penalty on them. So I agree with the user invoking call_srcu() after preventing future calls to call_srcu(), and with there being a flush_delayed_work() in cleanup_srcu_struct(). As in the untested (and probably whitespace-mangled) patch below. Thoughts? Thanx, Paul diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c index f2abfbae258c..5813ed848821 100644 --- a/kernel/rcu/srcu.c +++ b/kernel/rcu/srcu.c @@ -65,6 +65,17 @@ static inline bool rcu_batch_empty(struct rcu_batch *b) } /* + * Are all batches empty for the specified srcu_struct? + */ +static inline bool rcu_all_batches_empty(struct srcu_struct *sp) +{ + return rcu_batch_empty(&sp->batch_done) && + rcu_batch_empty(&sp->batch_check1) && + rcu_batch_empty(&sp->batch_check0) && + rcu_batch_empty(&sp->batch_queue); +} + +/* * Remove the callback at the head of the specified rcu_batch structure * and return a pointer to it, or return NULL if the structure is empty. */ @@ -251,6 +262,11 @@ void cleanup_srcu_struct(struct srcu_struct *sp) { if (WARN_ON(srcu_readers_active(sp))) return; /* Leakage unless caller handles error. */ + if (WARN_ON(!rcu_all_batches_empty(sp))) + return; /* Leakage unless caller handles error. */ + flush_delayed_work(&sp->work); + if (WARN_ON(sp->running)) + return; /* Caller forgot to stop doing call_srcu()? */ free_percpu(sp->per_cpu_ref); sp->per_cpu_ref = NULL; } @@ -610,15 +626,9 @@ static void srcu_reschedule(struct srcu_struct *sp) { bool pending = true; - if (rcu_batch_empty(&sp->batch_done) && -rcu_batch_empty(&sp->batch_check1) && -rcu_batch_empty(&sp->batch_check0) && -rcu_batch_empty(&sp->batch_queue)) { + if (rcu_all_batches_empty(sp)) { spin_lock_irq(&sp->queue_lock); - if (rcu_batch_empty(&sp->batch_done) && -rcu_batch_empty(&sp->batch_check1) && -rcu_batch_empty(&sp->batch_check0) && -rcu_batch_empty(&sp->batch_queue)) { + if (rcu_all_batches_empty(sp)) { sp->running = false; pending = false; }
Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
On Wed, Jan 18, 2017 at 3:00 AM, Peter Zijlstra wrote: > On Tue, Jan 17, 2017 at 12:53:21PM -0800, Paul E. McKenney wrote: >> On Tue, Jan 17, 2017 at 04:55:22AM +0100, Frederic Weisbecker wrote: >> >> [ . . . ] >> >> > In fact due to the complexity involved, I have to ask first if we >> > really need this feature. Typically nohz_full workloads don't want to >> > be disturbed at all, so do we have real and significant usecases of CPU >> > isolation workloads that want to be concerned by this membarrier so much >> > that they can tolerate some random IRQ? >> >> I believe that we need to explore the options for implementing it and >> to -at- -least- have a patch ready, even if that patch doesn't go >> upstream immediately. > > I tend to agree with Frederic here in that the design requirements seem > mutually exclusive. > > NOHZ_FULL users do _not_ want interruptions of any sort, in fact some > want to make that a hard fail of the task. > > OTOH sys_membarrier(CMD_SHARED) promises to serialize against anything > observable. > > The only logical solution is to error the sys_membarrier(CMD_SHARED) > call when a NOHZ_FULL task shares memory with the caller. Now > determining this is somewhat tricky of course :/ > > > I really don't see how there is another possible solution that makes > sense here. If there is shared memory between a NOHZ_FULL task and > others, a urcu implementation used by those must not rely on > sys_membarrier() but instead use a more expensive one, for instance one > where rcu_read_{,un}lock() do explicit counting and have memory barriers > in. Actually, agreed. After all, if we knew of a solution that made sense, we would have already implemented it, so some out-of-tree experimentation makes sense. One possibility is to have a flag that user processes could set that aborted (or splatted or whatever) on any interruption, so that tasks using signal URCU would avoid setting that flag, and, as you say, tasks setting the flag using URCU would need to use one of the memory-barrier variants. Thanx, Paul
Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00
> got this early-bootup crash on an SMP box: > > BUG: Unable to handle kernel NULL pointer dereference at virtual address > > printing eip: > c0131aec > *pde = > Oops: 0002 [#1] > PREEMPT SMP > Modules linked in: > CPU:1 > EIP:0060:[]Not tainted VLI > EFLAGS: 00010293 (2.6.12-rc1-RT-V0.7.41-00) > EIP is at rcu_advance_callbacks+0x3c/0x80 > eax: ebx: c050f280 ecx: c12191e0 edx: > esi: cfd2e560 edi: cfd2e4e0 ebp: cfd31dd0 esp: cfd31dc8 > ds: 007b es: 007b ss: 0068 preempt: 0003 > Process khelper (pid: 60, threadinfo=cfd3 task=cfd106a0) > Stack: 0001 c12191e0 cfd31de4 c0131b67 0001 cfd2e4d8 c13004d8 > cfd31e00 >c017e449 cfd2e4d8 c04d6e80 cfd32006 fffe cfd31e54 cfd31e70 > c01749cc >cfd2e4d8 cfd31e50 cfd31e4c 0001 cfd32001 cfd2e4d8 c03dd41f > c04cf920 > Call Trace: > [] show_stack+0x7f/0xa0 (28) > [] show_registers+0x16a/0x1e0 (56) > [] die+0x101/0x190 (64) > [] do_page_fault+0x442/0x680 (216) > [] error_code+0x2b/0x30 (68) > [] call_rcu+0x37/0x70 (20) > [] dput+0x139/0x210 (28) > [] __link_path_walk+0x9fc/0xf80 (112) > [] link_path_walk+0x4a/0x130 (100) > [] path_lookup+0x9e/0x1c0 (32) > [] open_exec+0x28/0x100 (100) > [] do_execve+0x44/0x220 (36) > [] sys_execve+0x42/0xa0 (36) > [] syscall_call+0x7/0xb (-8096) > --- > | preempt count: 0004 ] > | 4-level deep critical section nesting: > > .. [] call_rcu+0x1f/0x70 > .[] .. ( <= dput+0x139/0x210) > .. [] rcu_advance_callbacks+0x13/0x80 > .[] .. ( <= call_rcu+0x37/0x70) > .. [] _raw_spin_lock_irqsave+0x1a/0xa0 > .[] .. ( <= die+0x3f/0x190) > .. [] print_traces+0x16/0x50 > .[] .. ( <= show_stack+0x7f/0xa0) > Code: 00 00 e8 78 2d 0a 00 8b 0c 85 20 20 51 c0 bb 80 f2 50 c0 01 d9 f0 83 44 > 24 00 00 a1 88 19 52 c0 39 41 40 74 23 8b 41 44 8b 51 50 <89> 02 8b 41 48 c7 > 41 44 00 00 00 00 89 41 50 8d 41 44 89 41 48 > <6>note: khelper[60] exited with preempt_count 2 > > (gdb) list *0xc0131aec > 0xc0131aec is in rcu_advance_callbacks (kernel/rcupdate.c:558). > > 553 struct rcu_data *rdp; > 554 > 555 rdp = &get_cpu_var(rcu_data); > 556 smp_mb(); /* prevent sampling batch # before list > removal. */ > 557 if (rdp->batch != rcu_ctrlblk.batch) { > 558 *rdp->donetail = rdp->waitlist; > 559 rdp->donetail = rdp->waittail; > 560 rdp->waitlist = NULL; > 561 rdp->waittail = &rdp->waitlist; > 562 rdp->batch = rcu_ctrlblk.batch; > (gdb) Does the following help? Thanx, Paul diff -urpN -X dontdiff linux-2.6.11.fixes/kernel/rcupdate.c linux-2.6.11.fixes2/kernel/rcupdate.c --- linux-2.6.11.fixes/kernel/rcupdate.cMon Mar 21 08:14:47 2005 +++ linux-2.6.11.fixes2/kernel/rcupdate.c Mon Mar 21 08:17:00 2005 @@ -620,7 +620,7 @@ static void rcu_process_callbacks(void) return; } rdp->donelist = NULL; - rdp->donetail = &rdp->waitlist; + rdp->donetail = &rdp->donelist; put_cpu_var(rcu_data); while (list) { next = list->next; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel
> > But if you are suppressing preemption in all read-side critical sections, > > then wouldn't any already-preempted tasks be guaranteed to -not- be in > > a read-side critical section, and therefore be guaranteed to be unaffected > > by the update (in other words, wouldn't such tasks not need to be waited > > for)? > > Ah, if you want to inc and dec all the time, yes. But even if the > performance isn't hurt, it's unneccessary, and something else people > have to remember to do. I must admit that free is a very good price. > Simplicity is very nice. And in the case of module unload, gives us > the ability to avoid the distinction between "am I calling into a > module?" and "is this fixed in the kernel?" at runtime. A very good > thing 8) Is it also desireable to avoid the distinction between "the currently executing code is in a module" and "the currently executing code is fixed in the kernel"? > Rusty. Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel
> On Tue, 10 Apr 2001, Paul McKenney wrote: > > The algorithms we have been looking at need to have absolute guarantees > > that earlier activity has completed. The most straightforward way to > > guarantee this is to have the critical-section activity run with preemption > > disabled. Most of these code segments either take out locks or run > > with interrupts disabled anyway, so there is little or no degradation of > > latency in this case. In fact, in many cases, latency would actually be > > improved due to removal of explicit locking primitives. > > > > I believe that one of the issues that pushes in this direction is the > > discovery that "synchronize_kernel()" could not be a nop in a UP kernel > > unless the read-side critical sections disable preemption (either in > > the natural course of events, or artificially if need be). Andi or > > Rusty can correct me if I missed something in the previous exchange... > > > > The read-side code segments are almost always quite short, and, again, > > they would almost always otherwise need to be protected by a lock of > > some sort, which would disable preemption in any event. > > > > Thoughts? > > Disabling preemption is a possible solution if the critical section is short > - less than 100us - otherwise preemption latencies become a problem. Seems like a reasonable restriction. Of course, this same limit applies to locks and interrupt disabling, right? > The implementation of synchronize_kernel() that Rusty and I discussed > earlier in this thread would work in other cases, such as module > unloading, where there was a concern that it was not practical to have > any sort of lock in the read-side code path and the write side was not > time critical. True, but only if the synchronize_kernel() implementation is applied to UP kernels, also. Thanx, Paul > Nigel Gamble[EMAIL PROTECTED] > Mountain View, CA, USA. http://www.nrg.org/ > > MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel
> > As you've observed, with the approach of waiting for all pre-empted tasks > > to synchronize, the possibility of a task staying pre-empted for a long > > time could affect the latency of an update/synchonize (though its hard for > > me to judge how likely that is). > > It's very unlikely on a system that doesn't already have problems with > CPU starvation because of runaway real-time tasks or interrupt handlers. Agreed! > First, preemption is a comparitively rare event with a mostly > timesharing load, typically from 1% to 10% of all context switches. Again, agreed! > Second, the scheduler should not penalize the preempted task for being > preempted, so that it should usually get to continue running as soon as > the preempting task is descheduled, which is at most one timeslice for > timesharing tasks. The algorithms we have been looking at need to have absolute guarantees that earlier activity has completed. The most straightforward way to guarantee this is to have the critical-section activity run with preemption disabled. Most of these code segments either take out locks or run with interrupts disabled anyway, so there is little or no degradation of latency in this case. In fact, in many cases, latency would actually be improved due to removal of explicit locking primitives. I believe that one of the issues that pushes in this direction is the discovery that "synchronize_kernel()" could not be a nop in a UP kernel unless the read-side critical sections disable preemption (either in the natural course of events, or artificially if need be). Andi or Rusty can correct me if I missed something in the previous exchange... The read-side code segments are almost always quite short, and, again, they would almost always otherwise need to be protected by a lock of some sort, which would disable preemption in any event. Thoughts? Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel
> > > > 2. Isn't it possible to get in trouble even on a UP if a task > > > > is preempted in a critical region? For example, suppose the > > > > preempting task does a synchronize_kernel()? > > > > > > Ugly. I guess one way to solve it would be to readd the 2.2 scheduler > > > taskqueue, and just queue a scheduler callback in this case. > > > > Another approach would be to define a "really low" priority that noone > > other than synchronize_kernel() was allowed to use. Then the UP > > implementation of synchronize_kernel() could drop its priority to > > this level, yield the CPU, and know that all preempted tasks must > > have obtained and voluntarily yielded the CPU before synchronize_kernel () > > gets it back again. > > That just would allow nasty starvation, e.g. when someone runs a cpu intensive > screensaver or a seti-at-home. Good point! I hereby withdraw my suggested use of ultra-low priorities for UP implementations of synchronize_kernel(). ;-) > > I still prefer suppressing preemption on the read side, though I > > suppose one could claim that this is only because I am -really- > > used to it. ;-) > > For a lot of reader cases non-preemption by threads is guaranteed anyways -- > e.g. anything that runs in interrupts, timers, tasklets and network softirq. > I think that already covers a lot of interesting cases. Good point again! For example, this does cover most of the TCP/IP cases, right? Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel
> > I see your point here, but need to think about it. One question: > > isn't it the case that the alternative to using synchronize_kernel() > > is to protect the read side with explicit locks, which will themselves > > suppress preemption? If so, why not just suppress preemption on the read > > side in preemptible kernels, and thus gain the simpler implementation > > of synchronize_kernel()? You are not losing any preemption latency > > compared to a kernel that uses traditional locks, in fact, you should > > improve latency a bit since the lock operations are more expensive than > > are simple increments and decrements. As usual, what am I missing > > here? ;-) > > Already preempted tasks. But if you are suppressing preemption in all read-side critical sections, then wouldn't any already-preempted tasks be guaranteed to -not- be in a read-side critical section, and therefore be guaranteed to be unaffected by the update (in other words, wouldn't such tasks not need to be waited for)? > > Another approach would be to define a "really low" priority that noone > > other than synchronize_kernel() was allowed to use. Then the UP > > implementation of synchronize_kernel() could drop its priority to > > this level, yield the CPU, and know that all preempted tasks must > > have obtained and voluntarily yielded the CPU before synchronize_kernel () > > gets it back again. > > Or "never", because I'm running RC5 etc. 8(. U... Good point! Never mind use of low priorities in UP kernels for synchronize_kernel()... Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
Andi, thank you for the background! More comments interspersed... > On Fri, Apr 06, 2001 at 04:52:25PM -0700, Paul McKenney wrote: > > 1. On a busy system, isn't it possible for a preempted task > > to stay preempted for a -long- time, especially if there are > > lots of real-time tasks in the mix? > > The problem you're describing is probably considered too hard to > solve properly (bad answer, but that is how it is currently) > > Yes there is. You can also force a normal (non preemptive) kernel > into complete livelock by just giving it enough network traffic > to do, so that it always works in the high priority network > softirq or doing the same with some other interrupt. > > Just when this happens a lot of basic things will stop working (like > page cleaning, IO flushing etc.), so your callbacks or module unloads > not running are probably the least of your worries. > > The same problem applies to a smaller scale to real time processes; > kernel services normally do not run real-time, so they can be starved. > > Priority inversion is not handled in Linux kernel ATM BTW, there > are already situations where a realtime task can cause a deadlock > with some lower priority system thread (I believe there is at least > one case of this known with realtime ntpd on 2.4) I see your point here, but need to think about it. One question: isn't it the case that the alternative to using synchronize_kernel() is to protect the read side with explicit locks, which will themselves suppress preemption? If so, why not just suppress preemption on the read side in preemptible kernels, and thus gain the simpler implementation of synchronize_kernel()? You are not losing any preemption latency compared to a kernel that uses traditional locks, in fact, you should improve latency a bit since the lock operations are more expensive than are simple increments and decrements. As usual, what am I missing here? ;-) > > 2. Isn't it possible to get in trouble even on a UP if a task > > is preempted in a critical region? For example, suppose the > > preempting task does a synchronize_kernel()? > > Ugly. I guess one way to solve it would be to readd the 2.2 scheduler > taskqueue, and just queue a scheduler callback in this case. Another approach would be to define a "really low" priority that noone other than synchronize_kernel() was allowed to use. Then the UP implementation of synchronize_kernel() could drop its priority to this level, yield the CPU, and know that all preempted tasks must have obtained and voluntarily yielded the CPU before synchronize_kernel() gets it back again. I still prefer suppressing preemption on the read side, though I suppose one could claim that this is only because I am -really- used to it. ;-) Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
Please accept my apologies if I am missing something basic, but... 1. On a busy system, isn't it possible for a preempted task to stay preempted for a -long- time, especially if there are lots of real-time tasks in the mix? 2. Isn't it possible to get in trouble even on a UP if a task is preempted in a critical region? For example, suppose the preempting task does a synchronize_kernel()? If I understand the preemptible kernel patch, suppressing preemption is quite cheap -- just increment preempt_count before and decrement it after, with no locks, atomic operations, or disabling of interrupts required. I understand that disabling preemption for any long time is a Bad Thing, however, Dipankar's patch is able to detect situations where preemption has been suppressed for too long. In addition, Rusty's original synchronize_kernel() patch is much simpler than the two that wait for preempted tasks to reach a voluntary context switch. What am I missing here? Thanx, Paul > > Here is an attempt at a possible version of synchronize_kernel() that > > should work on a preemptible kernel. I haven't tested it yet. > > It's close, but... > > Those who suggest that we don't do preemtion on SMP make this much > easier (synchronize_kernel() is a NOP on UP), and I'm starting to > agree with them. Anyway: > > > if (p->state == TASK_RUNNING || > > (p->state == (TASK_RUNNING|TASK_PREEMPTED))) { > > p->flags |= PF_SYNCING; > > Setting a running task's flags brings races, AFAICT, and checking > p->state is NOT sufficient, consider wait_event(): you need p->has_cpu > here I think. You could do it for TASK_PREEMPTED only, but you'd have > to do the "unreal priority" part of synchronize_kernel() with some > method to say "don't preempt anyone", but it will hurt latency. > Hmmm... > > The only way I can see is to have a new element in "struct > task_struct" saying "syncing now", which is protected by the runqueue > lock. This looks like (and I prefer wait queues, they have such nice > helpers): > > static DECLARE_WAIT_QUEUE_HEAD(syncing_task); > static DECLARE_MUTEX(synchronize_kernel_mtx); > static int sync_count = 0; > > schedule(): > if (!(prev->state & TASK_PREEMPTED) && prev->syncing) > if (--sync_count == 0) wake_up(&syncing_task); > > synchronize_kernel(): > { > struct list_head *tmp; > struct task_struct *p; > > /* Guard against multiple calls to this function */ > down(&synchronize_kernel_mtx); > > /* Everyone running now or currently preempted must >voluntarily schedule before we know we are safe. */ > spin_lock_irq(&runqueue_lock); > list_for_each(tmp, &runqueue_head) { > p = list_entry(tmp, struct task_struct, run_list); > if (p->has_cpu || p->state == (TASK_RUNNING|TASK_PREEMPTED)) { > p->syncing = 1; > sync_count++; > } > } > spin_unlock_irq(&runqueue_lock); > > /* Wait for them all */ > wait_event(syncing_task, sync_count == 0); > up(&synchronize_kernel_mtx); > } > > Also untested 8), > Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: a quest for a better scheduler
> Just a quick comment. Andrea, unless your machine has some hardware > that imply pernode runqueues will help (nodelevel caches etc), I fail > to understand how this is helping you ... here's a simple theory though. > If your system is lightly loaded, your pernode queues are actually > implementing some sort of affinity, making sure processes stick to > cpus on nodes where they have allocated most of their memory on. I am > not sure what the situation will be under huge loads though. Exactly. If a given task has run on a particular nodes for a while, its memory will tend to be allocated on that node. So preferentially running it on another CPU on that same node should get you better memory latency than would running it on some other node's CPUs. In addition, continuing to run the task on a particular node means that more of that task's memory is from that node, which again means good memory latency. In contrast, if you move a task back and forth between nodes, it can end up with its memory spread over many nodes, which means that it does not get good memory latency no matter where you run it. Thanx, Paul > As I have mentioned to some people before, percpu/pernode/percpuset/global > runqueues probably all have their advantages and disadvantages, and their > own sweet spots. Wouldn't it be really neat if a system administrator > or performance expert could pick and choose what scheduler behavior he > wants, based on how the system is going to be used? > > Kanoj - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/