Re: [PATCH v3] sched/cputime: Fix using smp_processor_id() in preemptible

2017-07-20 Thread Paul McKenney
On Wed, Jul 19, 2017 at 4:38 PM, Paul McKenney  wrote:
> On Thu, Jul 13, 2017 at 11:49 PM, Wanpeng Li  wrote:
>>
>> Ping for the merge window. :)
>> 2017-07-09 15:40 GMT+08:00 Wanpeng Li :
>> > From: Wanpeng Li 
>> >
>> >  BUG: using smp_processor_id() in preemptible [] code:
>> > 99-trinity/181
>> >  caller is debug_smp_processor_id+0x17/0x19
>> >  CPU: 0 PID: 181 Comm: 99-trinity Not tainted 4.12.0-01059-g2a42eb9 #1
>> >  Call Trace:
>> >   dump_stack+0x82/0xb8
>> >   check_preemption_disabled+0xd1/0xe3
>> >   debug_smp_processor_id+0x17/0x19
>> >   vtime_delta+0xd/0x2c
>> >   task_cputime+0x89/0xdb
>> >   thread_group_cputime+0x11b/0x1ed
>> >   thread_group_cputime_adjusted+0x1f/0x47
>> >   wait_consider_task+0x2a9/0xaf9
>> >   ? lock_acquire+0x97/0xa4
>> >   do_wait+0xdf/0x1f4
>> >   SYSC_wait4+0x8e/0xb5
>> >   ? list_add+0x34/0x34
>> >   SyS_wait4+0x9/0xb
>> >   do_syscall_64+0x70/0x82
>> >   entry_SYSCALL64_slow_path+0x25/0x25
>> >
>> > As Frederic pointed out:
>> >
>> > | Although those sched_clock_cpu() things seem to only matter when the
>> > | sched_clock() is unstable. And that stability is a condition for
>> > nohz_full
>> > | to work anyway. So probably sched_clock() alone would be enough.
>> >
>> > This patch fixes it by replacing sched_clock_cpu() by sched_clock() to
>> > avoid to call smp_processor_id() in preemptible context.
>
>
> I am hitting this with rcutorture, so have kicked off an overnight
> run with this patch.

Which passed, so:

Tested-by: Paul E. McKenney 

 Thanx, Paul


>> > Reported-by: Xiaolong Ye 
>> > Cc: Thomas Gleixner 
>> > Cc: Luiz Capitulino 
>> > Cc: Frederic Weisbecker 
>> > Cc: Rik van Riel 
>> > Cc: Peter Zijlstra 
>> > Cc: Ingo Molnar 
>> > Signed-off-by: Wanpeng Li 
>> > ---
>> >  kernel/sched/cputime.c | 6 +++---
>> >  1 file changed, 3 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> > index 6e3ea4a..14d2dbf 100644
>> > --- a/kernel/sched/cputime.c
>> > +++ b/kernel/sched/cputime.c
>> > @@ -683,7 +683,7 @@ static u64 vtime_delta(struct vtime *vtime)
>> >  {
>> > unsigned long long clock;
>> >
>> > -   clock = sched_clock_cpu(smp_processor_id());
>> > +   clock = sched_clock();
>> > if (clock < vtime->starttime)
>> > return 0;
>> >
>> > @@ -814,7 +814,7 @@ void arch_vtime_task_switch(struct task_struct
>> > *prev)
>> >
>> > write_seqcount_begin(&vtime->seqcount);
>> > vtime->state = VTIME_SYS;
>> > -   vtime->starttime = sched_clock_cpu(smp_processor_id());
>> > +   vtime->starttime = sched_clock();
>> > write_seqcount_end(&vtime->seqcount);
>> >  }
>> >
>> > @@ -826,7 +826,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
>> > local_irq_save(flags);
>> > write_seqcount_begin(&vtime->seqcount);
>> > vtime->state = VTIME_SYS;
>> > -   vtime->starttime = sched_clock_cpu(cpu);
>> > +   vtime->starttime = sched_clock();
>> > write_seqcount_end(&vtime->seqcount);
>> > local_irq_restore(flags);
>> >  }
>> > --
>> > 2.7.4
>> >
>
>


Re: [PATCH 07/10] rcu: Separate the RCU synchronization types and APIs into

2017-02-11 Thread Paul McKenney
On Wed, Feb 8, 2017 at 10:34 AM, Ingo Molnar  wrote:
> So rcupdate.h is a pretty complex header, in particular it includes
>  which includes  - creating a
> dependency that includes  in ,
> which prevents the isolation of  from the derived
>  header.
>
> Solve part of the problem by decoupling rcupdate.h from completions:
> this can be done by separating out the rcu_synchronize types and APIs,
> and updating their usage sites.
>
> Since this is a mostly RCU-internal types this will not just simplify
> 's dependencies, but will make all the hundreds of
> .c files that include rcupdate.h but not completions or wait.h build
> faster.

Indeed, rcupdate.h is overdue for a more sweeping overhaul.

> ( For rcutiny this means that two dependent APIs have to be uninlined,
>   but that shouldn't be much of a problem as they are rare variants. )

Do people still care about Tiny kernel?  If so, I can queue a patch
that leaves rcu_barrier_bh() and rcu_barrier_sched() inline, but
creates an rcu_barrier_generic() or some such, so that there is
space taken up by only one EXPORT_SYMBOL() instead of two.
(0day Test Robot yells at me every time I add one...)

Other than that, I don't see any problems with this.  I will do some
testing.

   Thanx, Paul

> Cc: Linus Torvalds 
> Cc: Mike Galbraith 
> Cc: Paul E. McKenney 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Ingo Molnar 
> ---
>  fs/autofs4/autofs_i.h |  1 +
>  include/linux/dcache.h|  1 +
>  include/linux/rcupdate.h  | 40 
>  include/linux/rcupdate_wait.h | 50 
> ++
>  include/linux/rcutiny.h   | 11 ++-
>  kernel/rcu/srcu.c |  2 +-
>  kernel/rcu/tiny.c | 14 +-
>  kernel/rcu/tree.c |  2 +-
>  kernel/rcu/update.c   |  1 +
>  kernel/sched/core.c   |  1 +
>  10 files changed, 71 insertions(+), 52 deletions(-)
>
> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
> index c885daae68c8..beef981aa54f 100644
> --- a/fs/autofs4/autofs_i.h
> +++ b/fs/autofs4/autofs_i.h
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  /* This is the range of ioctl() numbers we claim as ours */
>  #define AUTOFS_IOC_FIRST AUTOFS_IOC_READY
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index c965e4469499..16948defb568 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  struct path;
>  struct vfsmount;
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 01f71e1d2e94..1f476b63c596 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -40,7 +40,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -226,45 +225,6 @@ void call_rcu_sched(struct rcu_head *head,
>
>  void synchronize_sched(void);
>
> -/*
> - * Structure allowing asynchronous waiting on RCU.
> - */
> -struct rcu_synchronize {
> -   struct rcu_head head;
> -   struct completion completion;
> -};

> -void wakeme_after_rcu(struct rcu_head *head);> -
> -void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
> -  struct rcu_synchronize *rs_array);
> -
> -#define _wait_rcu_gp(checktiny, ...) \
> -do {   \
> -   call_rcu_func_t __crcu_array[] = { __VA_ARGS__ };   \
> -   struct rcu_synchronize __rs_array[ARRAY_SIZE(__crcu_array)];\
> -   __wait_rcu_gp(checktiny, ARRAY_SIZE(__crcu_array),  \
> -   __crcu_array, __rs_array);  \
> -} while (0)
> -
> -#define wait_rcu_gp(...) _wait_rcu_gp(false, __VA_ARGS__)
> -
> -/**
> - * synchronize_rcu_mult - Wait concurrently for multiple grace periods
> - * @...: List of call_rcu() functions for the flavors to wait on.
> - *
> - * This macro waits concurrently for multiple flavors of RCU grace periods.
> - * For example, synchronize_rcu_mult(call_rcu, call_rcu_bh) would wait
> - * on concurrent RCU and RCU-bh grace periods.  Waiting on a give SRCU
> - * domain requires you to write a wrapper function for that SRCU domain's
> - * call_srcu() function, supplying the corresponding srcu_struct.
> - *
> - * If Tiny RCU, tell _wait_rcu_gp() not to bother waiting for RCU
> - * or RCU-bh, given that anywhere synchronize_rcu_mult() can be called
> - * is automatically a grace period.
> - */
> -#define synchronize_rcu_mult(...) \
> -   _wait_rcu_gp(IS_ENABLED(CONFIG_TINY_RCU), __VA_ARGS__)
> -
>  /**
>   * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
>   * @head: structure to be used for queueing the RCU updates.
> diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_w

Re: linux-next: build failure after merge of the rcu tree

2017-01-19 Thread Paul McKenney
On Wed, Jan 18, 2017 at 7:34 PM, Stephen Rothwell  wrote:
> Hi Paul,
>
> After merging the rcu tree, today's linux-next build (x86_64 allmodconfig)
> failed like this:
>
> net/smc/af_smc.c:102:16: error: 'SLAB_DESTROY_BY_RCU' undeclared here (not in 
> a function)
>   .slab_flags = SLAB_DESTROY_BY_RCU,
> ^
>
> Caused by commit
>
>   c7a545924ca1 ("mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU")
>
> interacting with commit
>
>   ac7138746e14 ("smc: establish new socket family")
>
> from the net-next tree.
>
> I have applied the following merge fix patch (someone will need to
> remember to mention this to Linus):

Thank you, Stephen!  I expect that there might be a bit more
bikeshedding on the name, but here is hoping...  :-/

  Thanx, Paul

> From: Stephen Rothwell 
> Date: Thu, 19 Jan 2017 14:29:12 +1100
> Subject: [PATCH] smc: merge fix for "mm: Rename SLAB_DESTROY_BY_RCU to 
> SLAB_TYPESAFE_BY_RCU"
>
> Signed-off-by: Stephen Rothwell 
> ---
>  net/smc/af_smc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 4875e65f0c4a..a48260f9ebb7 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -99,7 +99,7 @@ struct proto smc_proto = {
> .unhash = smc_unhash_sk,
> .obj_size   = sizeof(struct smc_sock),
> .h.smc_hash = &smc_v4_hashinfo,
> -   .slab_flags = SLAB_DESTROY_BY_RCU,
> +   .slab_flags = SLAB_TYPESAFE_BY_RCU,
>  };
>  EXPORT_SYMBOL_GPL(smc_proto);
>
> --
> 2.10.2
>
> --
> Cheers,
> Stephen Rothwell


Re: kvm: use-after-free in process_srcu

2017-01-19 Thread Paul McKenney
(Trouble with VPN, so replying from gmail.)

On Thu, Jan 19, 2017 at 1:27 AM, Paolo Bonzini  wrote:
>
>
> On 18/01/2017 23:15, Paul E. McKenney wrote:
>> On Wed, Jan 18, 2017 at 09:53:19AM +0100, Paolo Bonzini wrote:
>>>
>>>
>>> On 17/01/2017 21:34, Paul E. McKenney wrote:
 Do any of your callback functions invoke call_srcu()?  (Hey, I have to 
 ask!)
>>>
>>> No, we only use synchronize_srcu and synchronize_srcu_expedited, so our
>>> only callback comes from there.
>>
>> OK, so the next question is whether your code makes sure that all of its
>> synchronize_srcu() and synchronize_srcu_expedited() calls return before
>> the call to cleanup_srcu_struct().
>
> It certainly should!  Or at least that would be our bug.
>
>> You should only need srcu_barrier() if there were calls to call_srcu().
>> Given that you only have synchronize_srcu() and synchronize_srcu_expedited(),
>> you -don't- need srcu_barrier().  What you need instead is to make sure
>> that all synchronize_srcu() and synchronize_srcu_expedited() have
>> returned before the call to cleanup_srcu_struct().
>
> Ok, good.
>
>>> If this is incorrect, then one flush_delayed_work is enough.  If it is
>>> correct, the possible alternatives are:
>>>
>>> * srcu_barrier in the caller, flush_delayed_work+WARN_ON(sp->running) in
>>> cleanup_srcu_struct.  I strongly dislike this one---because we don't use
>>> call_srcu at all, there should be no reason to use srcu_barrier in KVM
>>> code.  Plus I think all other users have the same issue.
>>>
>>> * srcu_barrier+flush_delayed_work+WARN_ON(sp->running) in
>>> cleanup_srcu_struct
>>>
>>> * flush_delayed_work+flush_delayed_work+WARN_ON(sp->running) in
>>> cleanup_srcu_struct
>>>
>>> * while(flush_delayed_work) in cleanup_srcu_struct
>>>
>>> * "while(sp->running) flush_delayed_work" in cleanup_srcu_struct
>>
>> My current thought is flush_delayed_work() followed by a warning if
>> there are any callbacks still posted, and also as you say sp->running.
>
> Yes, that would work for KVM and anyone else who doesn't use call_srcu
> (and order synchronize_srcu correctly against destruction).
>
> On the other hand, users of call_srcu, such as rcutorture, _do_ need to
> place an srcu_barrier before cleanup_srcu_struct, or they need two
> flush_delayed_work() calls back to back in cleanup_srcu_struct.  Do you
> agree?

The reason I am resisting the notion of placing an srcu_barrier() in
the cleanup_srcu_struct path is that most users don't use call_srcu(),
and I don't feel that we should be inflicting the srcu_barrier()
performance penalty on them.

So I agree with the user invoking call_srcu() after preventing future
calls to call_srcu(), and with there being a flush_delayed_work() in
cleanup_srcu_struct().  As in the untested (and probably
whitespace-mangled) patch below.

Thoughts?

   Thanx, Paul

diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index f2abfbae258c..5813ed848821 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -65,6 +65,17 @@ static inline bool rcu_batch_empty(struct rcu_batch *b)
 }

 /*
+ * Are all batches empty for the specified srcu_struct?
+ */
+static inline bool rcu_all_batches_empty(struct srcu_struct *sp)
+{
+ return rcu_batch_empty(&sp->batch_done) &&
+   rcu_batch_empty(&sp->batch_check1) &&
+   rcu_batch_empty(&sp->batch_check0) &&
+   rcu_batch_empty(&sp->batch_queue);
+}
+
+/*
  * Remove the callback at the head of the specified rcu_batch structure
  * and return a pointer to it, or return NULL if the structure is empty.
  */
@@ -251,6 +262,11 @@ void cleanup_srcu_struct(struct srcu_struct *sp)
 {
  if (WARN_ON(srcu_readers_active(sp)))
  return; /* Leakage unless caller handles error. */
+ if (WARN_ON(!rcu_all_batches_empty(sp)))
+ return; /* Leakage unless caller handles error. */
+ flush_delayed_work(&sp->work);
+ if (WARN_ON(sp->running))
+ return; /* Caller forgot to stop doing call_srcu()? */
  free_percpu(sp->per_cpu_ref);
  sp->per_cpu_ref = NULL;
 }
@@ -610,15 +626,9 @@ static void srcu_reschedule(struct srcu_struct *sp)
 {
  bool pending = true;

- if (rcu_batch_empty(&sp->batch_done) &&
-rcu_batch_empty(&sp->batch_check1) &&
-rcu_batch_empty(&sp->batch_check0) &&
-rcu_batch_empty(&sp->batch_queue)) {
+ if (rcu_all_batches_empty(sp)) {
  spin_lock_irq(&sp->queue_lock);
- if (rcu_batch_empty(&sp->batch_done) &&
-rcu_batch_empty(&sp->batch_check1) &&
-rcu_batch_empty(&sp->batch_check0) &&
-rcu_batch_empty(&sp->batch_queue)) {
+ if (rcu_all_batches_empty(sp)) {
  sp->running = false;
  pending = false;
  }


Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration

2017-01-19 Thread Paul McKenney
On Wed, Jan 18, 2017 at 3:00 AM, Peter Zijlstra  wrote:
> On Tue, Jan 17, 2017 at 12:53:21PM -0800, Paul E. McKenney wrote:
>> On Tue, Jan 17, 2017 at 04:55:22AM +0100, Frederic Weisbecker wrote:
>>
>> [ . . . ]
>>
>> > In fact due to the complexity involved, I have to ask first if we
>> > really need this feature.  Typically nohz_full workloads don't want to
>> > be disturbed at all, so do we have real and significant usecases of CPU
>> > isolation workloads that want to be concerned by this membarrier so much
>> > that they can tolerate some random IRQ?
>>
>> I believe that we need to explore the options for implementing it and
>> to -at- -least- have a patch ready, even if that patch doesn't go
>> upstream immediately.
>
> I tend to agree with Frederic here in that the design requirements seem
> mutually exclusive.
>
> NOHZ_FULL users do _not_ want interruptions of any sort, in fact some
> want to make that a hard fail of the task.
>
> OTOH sys_membarrier(CMD_SHARED) promises to serialize against anything
> observable.
>
> The only logical solution is to error the sys_membarrier(CMD_SHARED)
> call when a NOHZ_FULL task shares memory with the caller. Now
> determining this is somewhat tricky of course :/
>
>
> I really don't see how there is another possible solution that makes
> sense here. If there is shared memory between a NOHZ_FULL task and
> others, a urcu implementation used by those must not rely on
> sys_membarrier() but instead use a more expensive one, for instance one
> where rcu_read_{,un}lock() do explicit counting and have memory barriers
> in.

Actually, agreed.  After all, if we knew of a solution that made
sense, we would have already implemented it, so some out-of-tree
experimentation makes sense.  One possibility is to have a flag that
user processes could set that aborted (or splatted or whatever) on any
interruption, so that tasks using signal URCU would avoid setting that
flag, and, as you say, tasks setting the flag using URCU would need to
use one of the memory-barrier variants.

   Thanx, Paul


Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00

2005-03-21 Thread Paul Mckenney
> got this early-bootup crash on an SMP box:
> 
> BUG: Unable to handle kernel NULL pointer dereference at virtual address 
> 
>  printing eip:
> c0131aec
> *pde = 
> Oops: 0002 [#1]
> PREEMPT SMP 
> Modules linked in:
> CPU:1
> EIP:0060:[]Not tainted VLI
> EFLAGS: 00010293   (2.6.12-rc1-RT-V0.7.41-00) 
> EIP is at rcu_advance_callbacks+0x3c/0x80
> eax:    ebx: c050f280   ecx: c12191e0   edx: 
> esi: cfd2e560   edi: cfd2e4e0   ebp: cfd31dd0   esp: cfd31dc8
> ds: 007b   es: 007b   ss: 0068   preempt: 0003
> Process khelper (pid: 60, threadinfo=cfd3 task=cfd106a0)
> Stack: 0001 c12191e0 cfd31de4 c0131b67 0001 cfd2e4d8 c13004d8 
> cfd31e00 
>c017e449 cfd2e4d8 c04d6e80 cfd32006 fffe cfd31e54 cfd31e70 
> c01749cc 
>cfd2e4d8 cfd31e50 cfd31e4c 0001 cfd32001 cfd2e4d8 c03dd41f 
> c04cf920 
> Call Trace:
>  [] show_stack+0x7f/0xa0 (28)
>  [] show_registers+0x16a/0x1e0 (56)
>  [] die+0x101/0x190 (64)
>  [] do_page_fault+0x442/0x680 (216)
>  [] error_code+0x2b/0x30 (68)
>  [] call_rcu+0x37/0x70 (20)
>  [] dput+0x139/0x210 (28)
>  [] __link_path_walk+0x9fc/0xf80 (112)
>  [] link_path_walk+0x4a/0x130 (100)
>  [] path_lookup+0x9e/0x1c0 (32)
>  [] open_exec+0x28/0x100 (100)
>  [] do_execve+0x44/0x220 (36)
>  [] sys_execve+0x42/0xa0 (36)
>  [] syscall_call+0x7/0xb (-8096)
> ---
> | preempt count: 0004 ]
> | 4-level deep critical section nesting:
> 
> .. []  call_rcu+0x1f/0x70
> .[] ..   ( <= dput+0x139/0x210)
> .. []  rcu_advance_callbacks+0x13/0x80
> .[] ..   ( <= call_rcu+0x37/0x70)
> .. []  _raw_spin_lock_irqsave+0x1a/0xa0
> .[] ..   ( <= die+0x3f/0x190)
> .. []  print_traces+0x16/0x50
> .[] ..   ( <= show_stack+0x7f/0xa0)
> Code: 00 00 e8 78 2d 0a 00 8b 0c 85 20 20 51 c0 bb 80 f2 50 c0 01 d9 f0 83 44 
> 24 00 00 a1 88 19 52 c0 39 41 40 74 23 8b 41 44 8b 51 50 <89> 02 8b 41 48 c7 
> 41 44 00 00 00 00 89 41 50 8d 41 44 89 41 48 
>  <6>note: khelper[60] exited with preempt_count 2
> 
> (gdb) list *0xc0131aec
> 0xc0131aec is in rcu_advance_callbacks (kernel/rcupdate.c:558).
> 
> 553 struct rcu_data *rdp;
> 554
> 555 rdp = &get_cpu_var(rcu_data);
> 556 smp_mb();   /* prevent sampling batch # before list 
> removal. */
> 557 if (rdp->batch != rcu_ctrlblk.batch) {
> 558 *rdp->donetail = rdp->waitlist;
> 559 rdp->donetail = rdp->waittail;
> 560 rdp->waitlist = NULL;
> 561 rdp->waittail = &rdp->waitlist;
> 562 rdp->batch = rcu_ctrlblk.batch;
> (gdb)

Does the following help?

Thanx, Paul

diff -urpN -X dontdiff linux-2.6.11.fixes/kernel/rcupdate.c 
linux-2.6.11.fixes2/kernel/rcupdate.c
--- linux-2.6.11.fixes/kernel/rcupdate.cMon Mar 21 08:14:47 2005
+++ linux-2.6.11.fixes2/kernel/rcupdate.c   Mon Mar 21 08:17:00 2005
@@ -620,7 +620,7 @@ static void rcu_process_callbacks(void)
return;
}
rdp->donelist = NULL;
-   rdp->donetail = &rdp->waitlist;
+   rdp->donetail = &rdp->donelist;
put_cpu_var(rcu_data);
while (list) {
next = list->next;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-22 Thread Paul McKenney


> > But if you are suppressing preemption in all read-side critical
sections,
> > then wouldn't any already-preempted tasks be guaranteed to -not- be in
> > a read-side critical section, and therefore be guaranteed to be
unaffected
> > by the update (in other words, wouldn't such tasks not need to be
waited
> > for)?
>
> Ah, if you want to inc and dec all the time, yes.  But even if the
> performance isn't hurt, it's unneccessary, and something else people
> have to remember to do.

I must admit that free is a very good price.

> Simplicity is very nice.  And in the case of module unload, gives us
> the ability to avoid the distinction between "am I calling into a
> module?" and "is this fixed in the kernel?" at runtime.  A very good
> thing 8)

Is it also desireable to avoid the distinction between "the currently
executing code is in a module" and "the currently executing code is
fixed in the kernel"?

> Rusty.

  Thanx, Paul

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-10 Thread Paul McKenney


> On Tue, 10 Apr 2001, Paul McKenney wrote:
> > The algorithms we have been looking at need to have absolute guarantees
> > that earlier activity has completed.  The most straightforward way to
> > guarantee this is to have the critical-section activity run with
preemption
> > disabled.  Most of these code segments either take out locks or run
> > with interrupts disabled anyway, so there is little or no degradation
of
> > latency in this case.  In fact, in many cases, latency would actually
be
> > improved due to removal of explicit locking primitives.
> >
> > I believe that one of the issues that pushes in this direction is the
> > discovery that "synchronize_kernel()" could not be a nop in a UP kernel
> > unless the read-side critical sections disable preemption (either in
> > the natural course of events, or artificially if need be).  Andi or
> > Rusty can correct me if I missed something in the previous exchange...
> >
> > The read-side code segments are almost always quite short, and, again,
> > they would almost always otherwise need to be protected by a lock of
> > some sort, which would disable preemption in any event.
> >
> > Thoughts?
>
> Disabling preemption is a possible solution if the critical section is
short
> - less than 100us - otherwise preemption latencies become a problem.

Seems like a reasonable restriction.  Of course, this same limit applies
to locks and interrupt disabling, right?

> The implementation of synchronize_kernel() that Rusty and I discussed
> earlier in this thread would work in other cases, such as module
> unloading, where there was a concern that it was not practical to have
> any sort of lock in the read-side code path and the write side was not
> time critical.

True, but only if the synchronize_kernel() implementation is applied to UP
kernels, also.

Thanx, Paul

> Nigel Gamble[EMAIL PROTECTED]
> Mountain View, CA, USA. http://www.nrg.org/
>
> MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-10 Thread Paul McKenney


> > As you've observed, with the approach of waiting for all pre-empted
tasks
> > to synchronize, the possibility of a task staying pre-empted for a long
> > time could affect the latency of an update/synchonize (though its hard
for
> > me to judge how likely that is).
>
> It's very unlikely on a system that doesn't already have problems with
> CPU starvation because of runaway real-time tasks or interrupt handlers.

Agreed!

> First, preemption is a comparitively rare event with a mostly
> timesharing load, typically from 1% to 10% of all context switches.

Again, agreed!

> Second, the scheduler should not penalize the preempted task for being
> preempted, so that it should usually get to continue running as soon as
> the preempting task is descheduled, which is at most one timeslice for
> timesharing tasks.

The algorithms we have been looking at need to have absolute guarantees
that earlier activity has completed.  The most straightforward way to
guarantee this is to have the critical-section activity run with preemption
disabled.  Most of these code segments either take out locks or run
with interrupts disabled anyway, so there is little or no degradation of
latency in this case.  In fact, in many cases, latency would actually be
improved due to removal of explicit locking primitives.

I believe that one of the issues that pushes in this direction is the
discovery that "synchronize_kernel()" could not be a nop in a UP kernel
unless the read-side critical sections disable preemption (either in
the natural course of events, or artificially if need be).  Andi or
Rusty can correct me if I missed something in the previous exchange...

The read-side code segments are almost always quite short, and, again,
they would almost always otherwise need to be protected by a lock of
some sort, which would disable preemption in any event.

Thoughts?

  Thanx, Paul

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-07 Thread Paul McKenney


> > > > 2.   Isn't it possible to get in trouble even on a UP if a task
> > > >  is preempted in a critical region?  For example, suppose the
> > > >  preempting task does a synchronize_kernel()?
> > >
> > > Ugly. I guess one way to solve it would be to readd the 2.2 scheduler
> > > taskqueue, and just queue a scheduler callback in this case.
> >
> > Another approach would be to define a "really low" priority that noone
> > other than synchronize_kernel() was allowed to use.  Then the UP
> > implementation of synchronize_kernel() could drop its priority to
> > this level, yield the CPU, and know that all preempted tasks must
> > have obtained and voluntarily yielded the CPU before synchronize_kernel
()
> > gets it back again.
>
> That just would allow nasty starvation, e.g. when someone runs a cpu
intensive
> screensaver or a seti-at-home.

Good point!  I hereby withdraw my suggested use of ultra-low priorities
for UP implementations of synchronize_kernel().  ;-)

> > I still prefer suppressing preemption on the read side, though I
> > suppose one could claim that this is only because I am -really-
> > used to it.  ;-)
>
> For a lot of reader cases non-preemption by threads is guaranteed anyways
--
> e.g.  anything that runs in interrupts, timers, tasklets and network
softirq.
> I think that already covers a lot of interesting cases.

Good point again!  For example, this does cover most of the TCP/IP
cases, right?

  Thanx, Paul

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-07 Thread Paul McKenney


> > I see your point here, but need to think about it.  One question:
> > isn't it the case that the alternative to using synchronize_kernel()
> > is to protect the read side with explicit locks, which will themselves
> > suppress preemption?  If so, why not just suppress preemption on the
read
> > side in preemptible kernels, and thus gain the simpler implementation
> > of synchronize_kernel()?  You are not losing any preemption latency
> > compared to a kernel that uses traditional locks, in fact, you should
> > improve latency a bit since the lock operations are more expensive than
> > are simple increments and decrements.  As usual, what am I missing
> > here?  ;-)
>
> Already preempted tasks.

But if you are suppressing preemption in all read-side critical sections,
then wouldn't any already-preempted tasks be guaranteed to -not- be in
a read-side critical section, and therefore be guaranteed to be unaffected
by the update (in other words, wouldn't such tasks not need to be waited
for)?

> > Another approach would be to define a "really low" priority that noone
> > other than synchronize_kernel() was allowed to use.  Then the UP
> > implementation of synchronize_kernel() could drop its priority to
> > this level, yield the CPU, and know that all preempted tasks must
> > have obtained and voluntarily yielded the CPU before synchronize_kernel
()
> > gets it back again.
>
> Or "never", because I'm running RC5 etc. 8(.

U...  Good point!  Never mind use of low priorities in UP kernels
for synchronize_kernel()...

   Thanx, Paul

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH for 2.5] preemptible kernel

2001-04-07 Thread Paul McKenney


Andi, thank you for the background!  More comments interspersed...

> On Fri, Apr 06, 2001 at 04:52:25PM -0700, Paul McKenney wrote:
> > 1.   On a busy system, isn't it possible for a preempted task
> >  to stay preempted for a -long- time, especially if there are
> >  lots of real-time tasks in the mix?
>
> The problem you're describing is probably considered too hard to
> solve properly (bad answer, but that is how it is currently)
>
> Yes there is. You can also force a normal (non preemptive) kernel
> into complete livelock by just giving it enough network traffic
> to do, so that it always works in the high priority network
> softirq or doing the same with some other interrupt.
>
> Just when this happens a lot of basic things will stop working (like
> page cleaning, IO flushing etc.), so your callbacks or module unloads
> not running are probably the least of your worries.
>
> The same problem applies to a smaller scale to real time processes;
> kernel services normally do not run real-time, so they can be starved.
>
> Priority inversion is not handled in Linux kernel ATM BTW, there
> are already situations where a realtime task can cause a deadlock
> with some lower priority system thread (I believe there is at least
> one case of this known with realtime ntpd on 2.4)

I see your point here, but need to think about it.  One question:
isn't it the case that the alternative to using synchronize_kernel()
is to protect the read side with explicit locks, which will themselves
suppress preemption?  If so, why not just suppress preemption on the read
side in preemptible kernels, and thus gain the simpler implementation
of synchronize_kernel()?  You are not losing any preemption latency
compared to a kernel that uses traditional locks, in fact, you should
improve latency a bit since the lock operations are more expensive than
are simple increments and decrements.  As usual, what am I missing
here?  ;-)

> > 2.   Isn't it possible to get in trouble even on a UP if a task
> >  is preempted in a critical region?  For example, suppose the
> >  preempting task does a synchronize_kernel()?
>
> Ugly. I guess one way to solve it would be to readd the 2.2 scheduler
> taskqueue, and just queue a scheduler callback in this case.

Another approach would be to define a "really low" priority that noone
other than synchronize_kernel() was allowed to use.  Then the UP
implementation of synchronize_kernel() could drop its priority to
this level, yield the CPU, and know that all preempted tasks must
have obtained and voluntarily yielded the CPU before synchronize_kernel()
gets it back again.

I still prefer suppressing preemption on the read side, though I
suppose one could claim that this is only because I am -really-
used to it.  ;-)

  Thanx, Paul

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH for 2.5] preemptible kernel

2001-04-06 Thread Paul McKenney

Please accept my apologies if I am missing something basic, but...

1.   On a busy system, isn't it possible for a preempted task
 to stay preempted for a -long- time, especially if there are
 lots of real-time tasks in the mix?

2.   Isn't it possible to get in trouble even on a UP if a task
 is preempted in a critical region?  For example, suppose the
 preempting task does a synchronize_kernel()?

If I understand the preemptible kernel patch, suppressing preemption
is quite cheap -- just increment preempt_count before and decrement
it after, with no locks, atomic operations, or disabling of interrupts
required.  I understand that disabling preemption for any long time
is a Bad Thing, however, Dipankar's patch is able to detect situations
where preemption has been suppressed for too long.  In addition, Rusty's
original synchronize_kernel() patch is much simpler than the two that
wait for preempted tasks to reach a voluntary context switch.

What am I missing here?

   Thanx, Paul

> > Here is an attempt at a possible version of synchronize_kernel() that
> > should work on a preemptible kernel. I haven't tested it yet.
>
> It's close, but...
>
> Those who suggest that we don't do preemtion on SMP make this much
> easier (synchronize_kernel() is a NOP on UP), and I'm starting to
> agree with them. Anyway:
>
> > if (p->state == TASK_RUNNING ||
> > (p->state == (TASK_RUNNING|TASK_PREEMPTED))) {
> > p->flags |= PF_SYNCING;
>
> Setting a running task's flags brings races, AFAICT, and checking
> p->state is NOT sufficient, consider wait_event(): you need p->has_cpu
> here I think. You could do it for TASK_PREEMPTED only, but you'd have
> to do the "unreal priority" part of synchronize_kernel() with some
> method to say "don't preempt anyone", but it will hurt latency.
> Hmmm...
>
> The only way I can see is to have a new element in "struct
> task_struct" saying "syncing now", which is protected by the runqueue
> lock. This looks like (and I prefer wait queues, they have such nice
> helpers):
>
> static DECLARE_WAIT_QUEUE_HEAD(syncing_task);
> static DECLARE_MUTEX(synchronize_kernel_mtx);
> static int sync_count = 0;
>
> schedule():
> if (!(prev->state & TASK_PREEMPTED) && prev->syncing)
> if (--sync_count == 0) wake_up(&syncing_task);
>
> synchronize_kernel():
> {
> struct list_head *tmp;
> struct task_struct *p;
>
> /* Guard against multiple calls to this function */
> down(&synchronize_kernel_mtx);
>
> /* Everyone running now or currently preempted must
>voluntarily schedule before we know we are safe. */
> spin_lock_irq(&runqueue_lock);
> list_for_each(tmp, &runqueue_head) {
> p = list_entry(tmp, struct task_struct, run_list);
> if (p->has_cpu || p->state
== (TASK_RUNNING|TASK_PREEMPTED)) {
> p->syncing = 1;
> sync_count++;
> }
> }
> spin_unlock_irq(&runqueue_lock);
>
> /* Wait for them all */
> wait_event(syncing_task, sync_count == 0);
> up(&synchronize_kernel_mtx);
> }
>
> Also untested 8),
> Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Lse-tech] Re: a quest for a better scheduler

2001-04-04 Thread Paul McKenney


> Just a quick comment. Andrea, unless your machine has some hardware
> that imply pernode runqueues will help (nodelevel caches etc), I fail
> to understand how this is helping you ... here's a simple theory though.
> If your system is lightly loaded, your pernode queues are actually
> implementing some sort of affinity, making sure processes stick to
> cpus on nodes where they have allocated most of their memory on. I am
> not sure what the situation will be under huge loads though.

Exactly.  If a given task has run on a particular nodes for a while,
its memory will tend to be allocated on that node.  So preferentially
running it on another CPU on that same node should get you better
memory latency than would running it on some other node's CPUs.

In addition, continuing to run the task on a particular node means
that more of that task's memory is from that node, which again means
good memory latency.  In contrast, if you move a task back and forth
between nodes, it can end up with its memory spread over many nodes,
which means that it does not get good memory latency no matter where
you run it.

  Thanx, Paul

> As I have mentioned to some people before,
percpu/pernode/percpuset/global
> runqueues probably all have their advantages and disadvantages, and their
> own sweet spots. Wouldn't it be really neat if a system administrator
> or performance expert could pick and choose what scheduler behavior he
> wants, based on how the system is going to be used?
>
> Kanoj

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/