Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-10-03 Thread Peter Zijlstra
On Thu, Oct 03, 2013 at 09:04:59AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
> > 
> 
> Fully agreed! :-)

haha.. never realized I send that email completely empty. It was
supposed to contain the patch I later send as 2/3.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-10-03 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> 

Fully agreed! :-)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-10-03 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 

Fully agreed! :-)

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-10-03 Thread Peter Zijlstra
On Thu, Oct 03, 2013 at 09:04:59AM +0200, Ingo Molnar wrote:
 
 * Peter Zijlstra pet...@infradead.org wrote:
 
  
 
 Fully agreed! :-)

haha.. never realized I send that email completely empty. It was
supposed to contain the patch I later send as 2/3.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-10-02 Thread Peter Zijlstra

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-10-02 Thread Peter Zijlstra

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Oleg Nesterov
On 09/30, Peter Zijlstra wrote:
>
> On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote:
> > For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
> > wouldn't we?
> >
> > Without that there's no guarantee the fast path readers will have a MB
> > to observe the write critical section, unless I'm completely missing
> > something obviuos here.
>
> Duh.. we should be looking at gp_state like Paul said.

Yes, yes, that is why we have xxx_is_idle(). Its name is confusing
even ignoring "xxx".

OK, I'll try to invent the naming (but I'd like to hear suggestions ;)
and send the patch. I am going to add "exclusive" and "rcu_domain/ops"
later, currently percpu_rw_semaphore needs ->rw_sem anyway.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Oleg Nesterov
On 09/30, Peter Zijlstra wrote:
>
> On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> > (Peter, I think they should be unified anyway, but lets ignore this for
> > now).
>
> If you think the percpu_rwsem users can benefit sure.. So far its good I
> didn't go the percpu_rwsem route for it looks like we got something
> better at the end of it ;-)

I think you could simply improve percpu_rwsem instead. Once we add
task_struct->cpuhp_ctr percpu_rwsem and get_online_cpus/hotplug_begin
becomes absolutely congruent.

OTOH, it would be simpler to change hotplug first, then copy-and-paste
the improvents into percpu_rwsem, then see if we can simply convert
cpu_hotplug_begin/end into percpu_down/up_write.

> Well, if we make percpu_rwsem the defacto container of the pattern and
> use that throughout, we'd have only a single implementation

Not sure. I think it can have other users. But even if not, please look
at "struct sb_writers". Yes, I believe it makes sense to use percpu_rwsem
here, but note that it is actually array of semaphores. I do not think
each element needs its own xxx_struct.

> and don't
> need the abstraction.

And even if struct percpu_rw_semaphore will be the only container of
xxx_struct, I think the code looks better and more understandable this
way, exactly because it adds the new abstraction layer. Performance-wise
this should be free.

> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > long flags;
> >
> > BUG_ON(xxx->gp_state != GP_PASSED);
> > BUG_ON(xxx->cb_state == CB_IDLE);
> >
> > spin_lock_irqsave(>xxx_lock, flags);
> > if (xxx->gp_count) {
> > xxx->cb_state = CB_IDLE;
>
> This seems to be when a new xxx_begin() has happened after our last
> xxx_end() and the sync_sched() from xxx_begin() merges with the
> xxx_end() one and we're done.

Yes,

> > } else if (xxx->cb_state == CB_REPLAY) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(>cb_head, cb_rcu_func);
>
> A later xxx_exit() has happened, and we need to requeue to catch a later
> GP.

Exactly.

> So I don't immediately see the point of the concurrent write side;
> percpu_rwsem wouldn't allow this and afaict neither would
> freeze_super().

Oh I disagree. Even ignoring the fact I believe xxx_struct itself
can have more users (I can be wrong of course), I do think that
percpu_down_write_nonexclusive() makes sense (except "exclusive"
should be the argument of percpu_init_rwsem). And in fact the
initial implementation I sent didn't even has the "exclusive" mode.

Please look at uprobes (currently the only user). We do not really
need the global write-lock, we can do the per-uprobe locking. However,
every caller needs to block the percpu_down_read() callers (dup_mmap).

> Other than that; yes this makes sense if you care about write side
> performance and I think its solid.

Great ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Peter Zijlstra
On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote:
> For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
> wouldn't we?
> 
> Without that there's no guarantee the fast path readers will have a MB
> to observe the write critical section, unless I'm completely missing
> something obviuos here.

Duh.. we should be looking at gp_state like Paul said.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Peter Zijlstra
On Mon, Sep 30, 2013 at 02:59:42PM +0200, Peter Zijlstra wrote:

> > 
> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > long flags;
> > 
> > BUG_ON(xxx->gp_state != GP_PASSED);
> > BUG_ON(xxx->cb_state == CB_IDLE);
> > 
> > spin_lock_irqsave(>xxx_lock, flags);
> > if (xxx->gp_count) {
> > xxx->cb_state = CB_IDLE;
> 
> This seems to be when a new xxx_begin() has happened after our last
> xxx_end() and the sync_sched() from xxx_begin() merges with the
> xxx_end() one and we're done.
> 
> > } else if (xxx->cb_state == CB_REPLAY) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(>cb_head, cb_rcu_func);
> 
> A later xxx_exit() has happened, and we need to requeue to catch a later
> GP.
> 
> > } else {
> > xxx->cb_state = CB_IDLE;
> > xxx->gp_state = GP_IDLE;
> 
> Nothing fancy happened and we're done.
> 
> > }
> > spin_unlock_irqrestore(>xxx_lock, flags);
> > }
> > 
> > void xxx_exit(struct xxx_struct *xxx)
> > {
> > spin_lock_irq(>xxx_lock);
> > if (!--xxx->gp_count) {
> > if (xxx->cb_state == CB_IDLE) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(>cb_head, cb_rcu_func);
> > } else if (xxx->cb_state == CB_PENDING) {
> > xxx->cb_state = CB_REPLAY;
> > }
> > }
> > spin_unlock_irq(>xxx_lock);
> > }
> 
> So I don't immediately see the point of the concurrent write side;
> percpu_rwsem wouldn't allow this and afaict neither would
> freeze_super().
> 
> Other than that; yes this makes sense if you care about write side
> performance and I think its solid.

Hmm, wait. I don't see how this is equivalent to:

xxx_end()
{
synchronize_sched();
atomic_dec(>counter);
}

For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
wouldn't we?

Without that there's no guarantee the fast path readers will have a MB
to observe the write critical section, unless I'm completely missing
something obviuos here.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Oleg Nesterov
On 09/29, Steven Rostedt wrote:
>
> On Sun, 29 Sep 2013 20:36:34 +0200
> Oleg Nesterov  wrote:
>
>
> > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> > (Peter, I think they should be unified anyway, but lets ignore this for
> > now). Or freeze_super() (which currently looks buggy), perhaps something
> > else. This pattern
> >
>
> Just so I'm clear to what you are trying to implement... This is to
> handle the case (as Paul said) to see changes to state by RCU and back
> again? That is, it isn't enough to see that the state changed to
> something (like SLOW MODE), but we also need a way to see it change
> back?

Suppose this code was applied as is. Now we can change percpu_rwsem,
see the "patch" below. (please ignore _expedited in the current code).

This immediately makes percpu_up_write() much faster, it no longer
blocks. And the contending writers (or even the same writer which
takes it again) can avoid synchronize_sched() in percpu_down_write().

And to remind, we can add xxx_struct->exclusive (or add the argument
to xxx_enter/exit), and then (with some other changes) we can kill
percpu_rw_semaphore->rw_sem.

> With get_online_cpus(), we need to see the state where it changed to
> "performing hotplug" where holders need to go into the slow path, and
> then also see the state change to "no longe performing hotplug" and the
> holders now go back to fast path. Is this the rational for this email?

The same. cpu_hotplug_begin/end (I mean the code written by Peter) can
be changed to use xxx_enter/exit.

Oleg.

--- x/include/linux/percpu-rwsem.h
+++ x/include/linux/percpu-rwsem.h
@@ -8,8 +8,8 @@
 #include 
 
 struct percpu_rw_semaphore {
+   xxx_struct  xxx;
unsigned int __percpu   *fast_read_ctr;
-   atomic_twrite_ctr;
struct rw_semaphore rw_sem;
atomic_tslow_read_ctr;
wait_queue_head_t   write_waitq;
--- x/lib/percpu-rwsem.c
+++ x/lib/percpu-rwsem.c
@@ -17,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw
 
/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
__init_rwsem(>rw_sem, name, rwsem_key);
-   atomic_set(>write_ctr, 0);
+   xxx_init(>xxx, ...);
atomic_set(>slow_read_ctr, 0);
init_waitqueue_head(>write_waitq);
return 0;
@@ -25,6 +25,14 @@ int __percpu_init_rwsem(struct percpu_rw
 
 void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
 {
+   might_sleep();
+
+   // pseudo code which needs another simple xxx_ helper
+   if (xxx->gp_state == GP_REPLAY)
+   xxx->gp_state == GP_PENDING;
+   if (xxx->gp_state)
+   synchronize_sched();
+
free_percpu(brw->fast_read_ctr);
brw->fast_read_ctr = NULL; /* catch use after free bugs */
 }
@@ -57,7 +65,7 @@ static bool update_fast_ctr(struct percp
bool success = false;
 
preempt_disable();
-   if (likely(!atomic_read(>write_ctr))) {
+   if (likely(xxx_is_idle(>xxx))) {
__this_cpu_add(*brw->fast_read_ctr, val);
success = true;
}
@@ -126,20 +134,7 @@ static int clear_fast_ctr(struct percpu_
  */
 void percpu_down_write(struct percpu_rw_semaphore *brw)
 {
-   /* tell update_fast_ctr() there is a pending writer */
-   atomic_inc(>write_ctr);
-   /*
-* 1. Ensures that write_ctr != 0 is visible to any down_read/up_read
-*so that update_fast_ctr() can't succeed.
-*
-* 2. Ensures we see the result of every previous this_cpu_add() in
-*update_fast_ctr().
-*
-* 3. Ensures that if any reader has exited its critical section via
-*fast-path, it executes a full memory barrier before we return.
-*See R_W case in the comment above update_fast_ctr().
-*/
-   synchronize_sched_expedited();
+   xxx_enter(>xxx);
 
/* exclude other writers, and block the new readers completely */
down_write(>rw_sem);
@@ -159,7 +154,5 @@ void percpu_up_write(struct percpu_rw_se
 * Insert the barrier before the next fast-path in down_read,
 * see W_R case in the comment above update_fast_ctr().
 */
-   synchronize_sched_expedited();
-   /* the last writer unblocks update_fast_ctr() */
-   atomic_dec(>write_ctr);
+   xxx_exit();
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Peter Zijlstra
On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now). 

If you think the percpu_rwsem users can benefit sure.. So far its good I
didn't go the percpu_rwsem route for it looks like we got something
better at the end of it ;-)

> Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
> 
>   writer:
>   state = SLOW_MODE;
>   synchronize_rcu/sched();
> 
>   reader:
>   preempt_disable();  // or rcu_read_lock();
>   if (state != SLOW_MODE)
>   ...
> 
> is quite common.

Well, if we make percpu_rwsem the defacto container of the pattern and
use that throughout, we'd have only a single implementation and don't
need the abstraction.

That said; we could still use the idea proposed; so let me take a look.

> // .h ---
> 
> struct xxx_struct {
>   int gp_state;
> 
>   int gp_count;
>   wait_queue_head_t   gp_waitq;
> 
>   int cb_state;
>   struct rcu_head cb_head;
> };
> 
> static inline bool xxx_is_idle(struct xxx_struct *xxx)
> {
>   return !xxx->gp_state; /* GP_IDLE */
> }
> 
> extern void xxx_enter(struct xxx_struct *xxx);
> extern void xxx_exit(struct xxx_struct *xxx);
> 
> // .c ---
> 
> enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> 
> enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> 
> #define xxx_lock  gp_waitq.lock
> 
> void xxx_enter(struct xxx_struct *xxx)
> {
>   bool need_wait, need_sync;
> 
>   spin_lock_irq(>xxx_lock);
>   need_wait = xxx->gp_count++;
>   need_sync = xxx->gp_state == GP_IDLE;
>   if (need_sync)
>   xxx->gp_state = GP_PENDING;
>   spin_unlock_irq(>xxx_lock);
> 
>   BUG_ON(need_wait && need_sync);
> 
>   } if (need_sync) {
>   synchronize_sched();
>   xxx->gp_state = GP_PASSED;
>   wake_up_all(>gp_waitq);
>   } else if (need_wait) {
>   wait_event(>gp_waitq, xxx->gp_state == GP_PASSED);
>   } else {
>   BUG_ON(xxx->gp_state != GP_PASSED);
>   }
> }
> 
> static void cb_rcu_func(struct rcu_head *rcu)
> {
>   struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
>   long flags;
> 
>   BUG_ON(xxx->gp_state != GP_PASSED);
>   BUG_ON(xxx->cb_state == CB_IDLE);
> 
>   spin_lock_irqsave(>xxx_lock, flags);
>   if (xxx->gp_count) {
>   xxx->cb_state = CB_IDLE;

This seems to be when a new xxx_begin() has happened after our last
xxx_end() and the sync_sched() from xxx_begin() merges with the
xxx_end() one and we're done.

>   } else if (xxx->cb_state == CB_REPLAY) {
>   xxx->cb_state = CB_PENDING;
>   call_rcu_sched(>cb_head, cb_rcu_func);

A later xxx_exit() has happened, and we need to requeue to catch a later
GP.

>   } else {
>   xxx->cb_state = CB_IDLE;
>   xxx->gp_state = GP_IDLE;

Nothing fancy happened and we're done.

>   }
>   spin_unlock_irqrestore(>xxx_lock, flags);
> }
> 
> void xxx_exit(struct xxx_struct *xxx)
> {
>   spin_lock_irq(>xxx_lock);
>   if (!--xxx->gp_count) {
>   if (xxx->cb_state == CB_IDLE) {
>   xxx->cb_state = CB_PENDING;
>   call_rcu_sched(>cb_head, cb_rcu_func);
>   } else if (xxx->cb_state == CB_PENDING) {
>   xxx->cb_state = CB_REPLAY;
>   }
>   }
>   spin_unlock_irq(>xxx_lock);
> }

So I don't immediately see the point of the concurrent write side;
percpu_rwsem wouldn't allow this and afaict neither would
freeze_super().

Other than that; yes this makes sense if you care about write side
performance and I think its solid.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Oleg Nesterov
On 09/29, Paul E. McKenney wrote:
>
> On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> >
> > struct xxx_struct {
> > atomic_t counter;
> > };
> >
> > static inline bool xxx_is_idle(struct xxx_struct *xxx)
> > {
> > return atomic_read(>counter) == 0;
> > }
> >
> > static inline void xxx_enter(struct xxx_struct *xxx)
> > {
> > atomic_inc(>counter);
> > synchronize_sched();
> > }
> >
> > static inline void xxx_enter(struct xxx_struct *xxx)
> > {
> > synchronize_sched();
> > atomic_dec(>counter);
> > }
>
> But there is nothing for synchronize_sched() to wait for in the above.
> Presumably the caller of xxx_is_idle() is required to disable preemption
> or be under rcu_read_lock_sched()?

Yes, yes, sure, xxx_is_idle() should be called under preempt_disable().
(or rcu_read_lock() if xxx_enter() uses synchronize_rcu()).

> So you are trying to make something that abstracts the RCU-protected
> state-change pattern?  Or perhaps more accurately, the RCU-protected
> state-change-and-back pattern?

Yes, exactly.

> > struct xxx_struct {
> > int gp_state;
> >
> > int gp_count;
> > wait_queue_head_t   gp_waitq;
> >
> > int cb_state;
> > struct rcu_head cb_head;
>
>   spinlock_t  xxx_lock;  /* ? */

See

#define xxx_lockgp_waitq.lock

in .c below, but we can add another spinlock.

> This spinlock might not make the big-system guys happy, but it appears to
> be needed below.

Only the writers use this spinlock, and they should synchronize with each
other anyway. I don't think this can really penalize, say, percpu_down_write
or cpu_hotplug_begin.

> > // .c   
> > ---
> >
> > enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> >
> > enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> >
> > #define xxx_lockgp_waitq.lock
> >
> > void xxx_enter(struct xxx_struct *xxx)
> > {
> > bool need_wait, need_sync;
> >
> > spin_lock_irq(>xxx_lock);
> > need_wait = xxx->gp_count++;
> > need_sync = xxx->gp_state == GP_IDLE;
>
> Suppose ->gp_state is GP_PASSED.  It could transition to GP_IDLE at any
> time, right?

As you already pointed below - no.

Once we incremented ->nr_writers, nobody can set GP_IDLE. And if the
caller is the "first" writer (need_sync == T) nobody else can change
->gp_state, so xxx_enter() sets GP_PASSED lockless.

> > if (need_sync)
> > xxx->gp_state = GP_PENDING;
> > spin_unlock_irq(>xxx_lock);
> >
> > BUG_ON(need_wait && need_sync);
> >
> > } if (need_sync) {
> > synchronize_sched();
> > xxx->gp_state = GP_PASSED;
> > wake_up_all(>gp_waitq);
> > } else if (need_wait) {
> > wait_event(>gp_waitq, xxx->gp_state == GP_PASSED);
>
> Suppose the wakeup is delayed until after the state has been updated
> back to GP_IDLE?  Ah, presumably the non-zero ->gp_count prevents this.

Yes, exactly.

> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > long flags;
> >
> > BUG_ON(xxx->gp_state != GP_PASSED);
> > BUG_ON(xxx->cb_state == CB_IDLE);
> >
> > spin_lock_irqsave(>xxx_lock, flags);
> > if (xxx->gp_count) {
> > xxx->cb_state = CB_IDLE;
> > } else if (xxx->cb_state == CB_REPLAY) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(>cb_head, cb_rcu_func);
> > } else {
> > xxx->cb_state = CB_IDLE;
> > xxx->gp_state = GP_IDLE;
> > }
>
> It took me a bit to work out the above.  It looks like the intent is
> to have the last xxx_exit() put the state back to GP_IDLE, which appears
> to be the state in which readers can use a fastpath.

Yes, and we we offload this work to rcu callback so xxx_exit() doesn't
block.

The only complication is the next writer which does xxx_enter() after
xxx_exit(). If there are no other writers, the next xxx_exit() should do

rcu_cancel(>cb_head);
call_rcu_sched(>cb_head, cb_rcu_func);

to "extend" the gp, but since we do not have rcu_cancel() it simply sets
CB_REPLAY to instruct cb_rcu_func() to reschedule itself.

> This works because if ->gp_count is non-zero and ->cb_state is CB_IDLE,
> there must be an xxx_exit() in our future.

Yes, but ->cb_state doesn't really matter if ->gp_count != 0 in xxx_exit()
or cb_rcu_func() (except it can't be CB_IDLE in cb_rcu_func).

> > void xxx_exit(struct xxx_struct *xxx)
> > {
> > spin_lock_irq(>xxx_lock);
> > if (!--xxx->gp_count) {
> > if (xxx->cb_state == CB_IDLE) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(>cb_head, cb_rcu_func);
> > } else if (xxx->cb_state == CB_PENDING) 

Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Oleg Nesterov
On 09/29, Paul E. McKenney wrote:

 On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
 
  struct xxx_struct {
  atomic_t counter;
  };
 
  static inline bool xxx_is_idle(struct xxx_struct *xxx)
  {
  return atomic_read(xxx-counter) == 0;
  }
 
  static inline void xxx_enter(struct xxx_struct *xxx)
  {
  atomic_inc(xxx-counter);
  synchronize_sched();
  }
 
  static inline void xxx_enter(struct xxx_struct *xxx)
  {
  synchronize_sched();
  atomic_dec(xxx-counter);
  }

 But there is nothing for synchronize_sched() to wait for in the above.
 Presumably the caller of xxx_is_idle() is required to disable preemption
 or be under rcu_read_lock_sched()?

Yes, yes, sure, xxx_is_idle() should be called under preempt_disable().
(or rcu_read_lock() if xxx_enter() uses synchronize_rcu()).

 So you are trying to make something that abstracts the RCU-protected
 state-change pattern?  Or perhaps more accurately, the RCU-protected
 state-change-and-back pattern?

Yes, exactly.

  struct xxx_struct {
  int gp_state;
 
  int gp_count;
  wait_queue_head_t   gp_waitq;
 
  int cb_state;
  struct rcu_head cb_head;

   spinlock_t  xxx_lock;  /* ? */

See

#define xxx_lockgp_waitq.lock

in .c below, but we can add another spinlock.

 This spinlock might not make the big-system guys happy, but it appears to
 be needed below.

Only the writers use this spinlock, and they should synchronize with each
other anyway. I don't think this can really penalize, say, percpu_down_write
or cpu_hotplug_begin.

  // .c   
  ---
 
  enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
 
  enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
 
  #define xxx_lockgp_waitq.lock
 
  void xxx_enter(struct xxx_struct *xxx)
  {
  bool need_wait, need_sync;
 
  spin_lock_irq(xxx-xxx_lock);
  need_wait = xxx-gp_count++;
  need_sync = xxx-gp_state == GP_IDLE;

 Suppose -gp_state is GP_PASSED.  It could transition to GP_IDLE at any
 time, right?

As you already pointed below - no.

Once we incremented -nr_writers, nobody can set GP_IDLE. And if the
caller is the first writer (need_sync == T) nobody else can change
-gp_state, so xxx_enter() sets GP_PASSED lockless.

  if (need_sync)
  xxx-gp_state = GP_PENDING;
  spin_unlock_irq(xxx-xxx_lock);
 
  BUG_ON(need_wait  need_sync);
 
  } if (need_sync) {
  synchronize_sched();
  xxx-gp_state = GP_PASSED;
  wake_up_all(xxx-gp_waitq);
  } else if (need_wait) {
  wait_event(xxx-gp_waitq, xxx-gp_state == GP_PASSED);

 Suppose the wakeup is delayed until after the state has been updated
 back to GP_IDLE?  Ah, presumably the non-zero -gp_count prevents this.

Yes, exactly.

  static void cb_rcu_func(struct rcu_head *rcu)
  {
  struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
  long flags;
 
  BUG_ON(xxx-gp_state != GP_PASSED);
  BUG_ON(xxx-cb_state == CB_IDLE);
 
  spin_lock_irqsave(xxx-xxx_lock, flags);
  if (xxx-gp_count) {
  xxx-cb_state = CB_IDLE;
  } else if (xxx-cb_state == CB_REPLAY) {
  xxx-cb_state = CB_PENDING;
  call_rcu_sched(xxx-cb_head, cb_rcu_func);
  } else {
  xxx-cb_state = CB_IDLE;
  xxx-gp_state = GP_IDLE;
  }

 It took me a bit to work out the above.  It looks like the intent is
 to have the last xxx_exit() put the state back to GP_IDLE, which appears
 to be the state in which readers can use a fastpath.

Yes, and we we offload this work to rcu callback so xxx_exit() doesn't
block.

The only complication is the next writer which does xxx_enter() after
xxx_exit(). If there are no other writers, the next xxx_exit() should do

rcu_cancel(xxx-cb_head);
call_rcu_sched(xxx-cb_head, cb_rcu_func);

to extend the gp, but since we do not have rcu_cancel() it simply sets
CB_REPLAY to instruct cb_rcu_func() to reschedule itself.

 This works because if -gp_count is non-zero and -cb_state is CB_IDLE,
 there must be an xxx_exit() in our future.

Yes, but -cb_state doesn't really matter if -gp_count != 0 in xxx_exit()
or cb_rcu_func() (except it can't be CB_IDLE in cb_rcu_func).

  void xxx_exit(struct xxx_struct *xxx)
  {
  spin_lock_irq(xxx-xxx_lock);
  if (!--xxx-gp_count) {
  if (xxx-cb_state == CB_IDLE) {
  xxx-cb_state = CB_PENDING;
  call_rcu_sched(xxx-cb_head, cb_rcu_func);
  } else if (xxx-cb_state == CB_PENDING) {
  xxx-cb_state = CB_REPLAY;
  }
  }
  spin_unlock_irq(xxx-xxx_lock);
  }

 Then we also have something like this?

 bool 

Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Peter Zijlstra
On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
 Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
 (Peter, I think they should be unified anyway, but lets ignore this for
 now). 

If you think the percpu_rwsem users can benefit sure.. So far its good I
didn't go the percpu_rwsem route for it looks like we got something
better at the end of it ;-)

 Or freeze_super() (which currently looks buggy), perhaps something
 else. This pattern
 
   writer:
   state = SLOW_MODE;
   synchronize_rcu/sched();
 
   reader:
   preempt_disable();  // or rcu_read_lock();
   if (state != SLOW_MODE)
   ...
 
 is quite common.

Well, if we make percpu_rwsem the defacto container of the pattern and
use that throughout, we'd have only a single implementation and don't
need the abstraction.

That said; we could still use the idea proposed; so let me take a look.

 // .h ---
 
 struct xxx_struct {
   int gp_state;
 
   int gp_count;
   wait_queue_head_t   gp_waitq;
 
   int cb_state;
   struct rcu_head cb_head;
 };
 
 static inline bool xxx_is_idle(struct xxx_struct *xxx)
 {
   return !xxx-gp_state; /* GP_IDLE */
 }
 
 extern void xxx_enter(struct xxx_struct *xxx);
 extern void xxx_exit(struct xxx_struct *xxx);
 
 // .c ---
 
 enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
 
 enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
 
 #define xxx_lock  gp_waitq.lock
 
 void xxx_enter(struct xxx_struct *xxx)
 {
   bool need_wait, need_sync;
 
   spin_lock_irq(xxx-xxx_lock);
   need_wait = xxx-gp_count++;
   need_sync = xxx-gp_state == GP_IDLE;
   if (need_sync)
   xxx-gp_state = GP_PENDING;
   spin_unlock_irq(xxx-xxx_lock);
 
   BUG_ON(need_wait  need_sync);
 
   } if (need_sync) {
   synchronize_sched();
   xxx-gp_state = GP_PASSED;
   wake_up_all(xxx-gp_waitq);
   } else if (need_wait) {
   wait_event(xxx-gp_waitq, xxx-gp_state == GP_PASSED);
   } else {
   BUG_ON(xxx-gp_state != GP_PASSED);
   }
 }
 
 static void cb_rcu_func(struct rcu_head *rcu)
 {
   struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
   long flags;
 
   BUG_ON(xxx-gp_state != GP_PASSED);
   BUG_ON(xxx-cb_state == CB_IDLE);
 
   spin_lock_irqsave(xxx-xxx_lock, flags);
   if (xxx-gp_count) {
   xxx-cb_state = CB_IDLE;

This seems to be when a new xxx_begin() has happened after our last
xxx_end() and the sync_sched() from xxx_begin() merges with the
xxx_end() one and we're done.

   } else if (xxx-cb_state == CB_REPLAY) {
   xxx-cb_state = CB_PENDING;
   call_rcu_sched(xxx-cb_head, cb_rcu_func);

A later xxx_exit() has happened, and we need to requeue to catch a later
GP.

   } else {
   xxx-cb_state = CB_IDLE;
   xxx-gp_state = GP_IDLE;

Nothing fancy happened and we're done.

   }
   spin_unlock_irqrestore(xxx-xxx_lock, flags);
 }
 
 void xxx_exit(struct xxx_struct *xxx)
 {
   spin_lock_irq(xxx-xxx_lock);
   if (!--xxx-gp_count) {
   if (xxx-cb_state == CB_IDLE) {
   xxx-cb_state = CB_PENDING;
   call_rcu_sched(xxx-cb_head, cb_rcu_func);
   } else if (xxx-cb_state == CB_PENDING) {
   xxx-cb_state = CB_REPLAY;
   }
   }
   spin_unlock_irq(xxx-xxx_lock);
 }

So I don't immediately see the point of the concurrent write side;
percpu_rwsem wouldn't allow this and afaict neither would
freeze_super().

Other than that; yes this makes sense if you care about write side
performance and I think its solid.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Oleg Nesterov
On 09/29, Steven Rostedt wrote:

 On Sun, 29 Sep 2013 20:36:34 +0200
 Oleg Nesterov o...@redhat.com wrote:


  Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
  (Peter, I think they should be unified anyway, but lets ignore this for
  now). Or freeze_super() (which currently looks buggy), perhaps something
  else. This pattern
 

 Just so I'm clear to what you are trying to implement... This is to
 handle the case (as Paul said) to see changes to state by RCU and back
 again? That is, it isn't enough to see that the state changed to
 something (like SLOW MODE), but we also need a way to see it change
 back?

Suppose this code was applied as is. Now we can change percpu_rwsem,
see the patch below. (please ignore _expedited in the current code).

This immediately makes percpu_up_write() much faster, it no longer
blocks. And the contending writers (or even the same writer which
takes it again) can avoid synchronize_sched() in percpu_down_write().

And to remind, we can add xxx_struct-exclusive (or add the argument
to xxx_enter/exit), and then (with some other changes) we can kill
percpu_rw_semaphore-rw_sem.

 With get_online_cpus(), we need to see the state where it changed to
 performing hotplug where holders need to go into the slow path, and
 then also see the state change to no longe performing hotplug and the
 holders now go back to fast path. Is this the rational for this email?

The same. cpu_hotplug_begin/end (I mean the code written by Peter) can
be changed to use xxx_enter/exit.

Oleg.

--- x/include/linux/percpu-rwsem.h
+++ x/include/linux/percpu-rwsem.h
@@ -8,8 +8,8 @@
 #include linux/lockdep.h
 
 struct percpu_rw_semaphore {
+   xxx_struct  xxx;
unsigned int __percpu   *fast_read_ctr;
-   atomic_twrite_ctr;
struct rw_semaphore rw_sem;
atomic_tslow_read_ctr;
wait_queue_head_t   write_waitq;
--- x/lib/percpu-rwsem.c
+++ x/lib/percpu-rwsem.c
@@ -17,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw
 
/* -rw_sem represents the whole percpu_rw_semaphore for lockdep */
__init_rwsem(brw-rw_sem, name, rwsem_key);
-   atomic_set(brw-write_ctr, 0);
+   xxx_init(brw-xxx, ...);
atomic_set(brw-slow_read_ctr, 0);
init_waitqueue_head(brw-write_waitq);
return 0;
@@ -25,6 +25,14 @@ int __percpu_init_rwsem(struct percpu_rw
 
 void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
 {
+   might_sleep();
+
+   // pseudo code which needs another simple xxx_ helper
+   if (xxx-gp_state == GP_REPLAY)
+   xxx-gp_state == GP_PENDING;
+   if (xxx-gp_state)
+   synchronize_sched();
+
free_percpu(brw-fast_read_ctr);
brw-fast_read_ctr = NULL; /* catch use after free bugs */
 }
@@ -57,7 +65,7 @@ static bool update_fast_ctr(struct percp
bool success = false;
 
preempt_disable();
-   if (likely(!atomic_read(brw-write_ctr))) {
+   if (likely(xxx_is_idle(brw-xxx))) {
__this_cpu_add(*brw-fast_read_ctr, val);
success = true;
}
@@ -126,20 +134,7 @@ static int clear_fast_ctr(struct percpu_
  */
 void percpu_down_write(struct percpu_rw_semaphore *brw)
 {
-   /* tell update_fast_ctr() there is a pending writer */
-   atomic_inc(brw-write_ctr);
-   /*
-* 1. Ensures that write_ctr != 0 is visible to any down_read/up_read
-*so that update_fast_ctr() can't succeed.
-*
-* 2. Ensures we see the result of every previous this_cpu_add() in
-*update_fast_ctr().
-*
-* 3. Ensures that if any reader has exited its critical section via
-*fast-path, it executes a full memory barrier before we return.
-*See R_W case in the comment above update_fast_ctr().
-*/
-   synchronize_sched_expedited();
+   xxx_enter(brw-xxx);
 
/* exclude other writers, and block the new readers completely */
down_write(brw-rw_sem);
@@ -159,7 +154,5 @@ void percpu_up_write(struct percpu_rw_se
 * Insert the barrier before the next fast-path in down_read,
 * see W_R case in the comment above update_fast_ctr().
 */
-   synchronize_sched_expedited();
-   /* the last writer unblocks update_fast_ctr() */
-   atomic_dec(brw-write_ctr);
+   xxx_exit();
 }

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Peter Zijlstra
On Mon, Sep 30, 2013 at 02:59:42PM +0200, Peter Zijlstra wrote:

  
  static void cb_rcu_func(struct rcu_head *rcu)
  {
  struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
  long flags;
  
  BUG_ON(xxx-gp_state != GP_PASSED);
  BUG_ON(xxx-cb_state == CB_IDLE);
  
  spin_lock_irqsave(xxx-xxx_lock, flags);
  if (xxx-gp_count) {
  xxx-cb_state = CB_IDLE;
 
 This seems to be when a new xxx_begin() has happened after our last
 xxx_end() and the sync_sched() from xxx_begin() merges with the
 xxx_end() one and we're done.
 
  } else if (xxx-cb_state == CB_REPLAY) {
  xxx-cb_state = CB_PENDING;
  call_rcu_sched(xxx-cb_head, cb_rcu_func);
 
 A later xxx_exit() has happened, and we need to requeue to catch a later
 GP.
 
  } else {
  xxx-cb_state = CB_IDLE;
  xxx-gp_state = GP_IDLE;
 
 Nothing fancy happened and we're done.
 
  }
  spin_unlock_irqrestore(xxx-xxx_lock, flags);
  }
  
  void xxx_exit(struct xxx_struct *xxx)
  {
  spin_lock_irq(xxx-xxx_lock);
  if (!--xxx-gp_count) {
  if (xxx-cb_state == CB_IDLE) {
  xxx-cb_state = CB_PENDING;
  call_rcu_sched(xxx-cb_head, cb_rcu_func);
  } else if (xxx-cb_state == CB_PENDING) {
  xxx-cb_state = CB_REPLAY;
  }
  }
  spin_unlock_irq(xxx-xxx_lock);
  }
 
 So I don't immediately see the point of the concurrent write side;
 percpu_rwsem wouldn't allow this and afaict neither would
 freeze_super().
 
 Other than that; yes this makes sense if you care about write side
 performance and I think its solid.

Hmm, wait. I don't see how this is equivalent to:

xxx_end()
{
synchronize_sched();
atomic_dec(xxx-counter);
}

For that we'd have to decrement xxx-gp_count from cb_rcu_func(),
wouldn't we?

Without that there's no guarantee the fast path readers will have a MB
to observe the write critical section, unless I'm completely missing
something obviuos here.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Peter Zijlstra
On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote:
 For that we'd have to decrement xxx-gp_count from cb_rcu_func(),
 wouldn't we?
 
 Without that there's no guarantee the fast path readers will have a MB
 to observe the write critical section, unless I'm completely missing
 something obviuos here.

Duh.. we should be looking at gp_state like Paul said.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Oleg Nesterov
On 09/30, Peter Zijlstra wrote:

 On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
  Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
  (Peter, I think they should be unified anyway, but lets ignore this for
  now).

 If you think the percpu_rwsem users can benefit sure.. So far its good I
 didn't go the percpu_rwsem route for it looks like we got something
 better at the end of it ;-)

I think you could simply improve percpu_rwsem instead. Once we add
task_struct-cpuhp_ctr percpu_rwsem and get_online_cpus/hotplug_begin
becomes absolutely congruent.

OTOH, it would be simpler to change hotplug first, then copy-and-paste
the improvents into percpu_rwsem, then see if we can simply convert
cpu_hotplug_begin/end into percpu_down/up_write.

 Well, if we make percpu_rwsem the defacto container of the pattern and
 use that throughout, we'd have only a single implementation

Not sure. I think it can have other users. But even if not, please look
at struct sb_writers. Yes, I believe it makes sense to use percpu_rwsem
here, but note that it is actually array of semaphores. I do not think
each element needs its own xxx_struct.

 and don't
 need the abstraction.

And even if struct percpu_rw_semaphore will be the only container of
xxx_struct, I think the code looks better and more understandable this
way, exactly because it adds the new abstraction layer. Performance-wise
this should be free.

  static void cb_rcu_func(struct rcu_head *rcu)
  {
  struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
  long flags;
 
  BUG_ON(xxx-gp_state != GP_PASSED);
  BUG_ON(xxx-cb_state == CB_IDLE);
 
  spin_lock_irqsave(xxx-xxx_lock, flags);
  if (xxx-gp_count) {
  xxx-cb_state = CB_IDLE;

 This seems to be when a new xxx_begin() has happened after our last
 xxx_end() and the sync_sched() from xxx_begin() merges with the
 xxx_end() one and we're done.

Yes,

  } else if (xxx-cb_state == CB_REPLAY) {
  xxx-cb_state = CB_PENDING;
  call_rcu_sched(xxx-cb_head, cb_rcu_func);

 A later xxx_exit() has happened, and we need to requeue to catch a later
 GP.

Exactly.

 So I don't immediately see the point of the concurrent write side;
 percpu_rwsem wouldn't allow this and afaict neither would
 freeze_super().

Oh I disagree. Even ignoring the fact I believe xxx_struct itself
can have more users (I can be wrong of course), I do think that
percpu_down_write_nonexclusive() makes sense (except exclusive
should be the argument of percpu_init_rwsem). And in fact the
initial implementation I sent didn't even has the exclusive mode.

Please look at uprobes (currently the only user). We do not really
need the global write-lock, we can do the per-uprobe locking. However,
every caller needs to block the percpu_down_read() callers (dup_mmap).

 Other than that; yes this makes sense if you care about write side
 performance and I think its solid.

Great ;)

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-30 Thread Oleg Nesterov
On 09/30, Peter Zijlstra wrote:

 On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote:
  For that we'd have to decrement xxx-gp_count from cb_rcu_func(),
  wouldn't we?
 
  Without that there's no guarantee the fast path readers will have a MB
  to observe the write critical section, unless I'm completely missing
  something obviuos here.

 Duh.. we should be looking at gp_state like Paul said.

Yes, yes, that is why we have xxx_is_idle(). Its name is confusing
even ignoring xxx.

OK, I'll try to invent the naming (but I'd like to hear suggestions ;)
and send the patch. I am going to add exclusive and rcu_domain/ops
later, currently percpu_rw_semaphore needs -rw_sem anyway.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-29 Thread Steven Rostedt
On Sun, 29 Sep 2013 20:36:34 +0200
Oleg Nesterov  wrote:

 
> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now). Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
> 

Just so I'm clear to what you are trying to implement... This is to
handle the case (as Paul said) to see changes to state by RCU and back
again? That is, it isn't enough to see that the state changed to
something (like SLOW MODE), but we also need a way to see it change
back?

With get_online_cpus(), we need to see the state where it changed to
"performing hotplug" where holders need to go into the slow path, and
then also see the state change to "no longe performing hotplug" and the
holders now go back to fast path. Is this the rational for this email?

Thanks,

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-29 Thread Paul E. McKenney
On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> Hello.
> 
> Paul, Peter, et al, could you review the code below?
> 
> I am not sending the patch, I think it is simpler to read the code
> inline (just in case, I didn't try to compile it yet).
> 
> It is functionally equivalent to
> 
>   struct xxx_struct {
>   atomic_t counter;
>   };
> 
>   static inline bool xxx_is_idle(struct xxx_struct *xxx)
>   {
>   return atomic_read(>counter) == 0;
>   }
> 
>   static inline void xxx_enter(struct xxx_struct *xxx)
>   {
>   atomic_inc(>counter);
>   synchronize_sched();
>   }
> 
>   static inline void xxx_enter(struct xxx_struct *xxx)
>   {
>   synchronize_sched();
>   atomic_dec(>counter);
>   }

But there is nothing for synchronize_sched() to wait for in the above.
Presumably the caller of xxx_is_idle() is required to disable preemption
or be under rcu_read_lock_sched()?

> except: it records the state and synchronize_sched() is only called by
> xxx_enter() and only if necessary.
> 
> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now). Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
> 
>   writer:
>   state = SLOW_MODE;
>   synchronize_rcu/sched();
> 
>   reader:
>   preempt_disable();  // or rcu_read_lock();
>   if (state != SLOW_MODE)
>   ...
> 
> is quite common.

And this does guarantee that by the time the writer's synchronize_whatever()
exits, all readers will know that state==SLOW_MODE.

> Note:
>   - This implementation allows multiple writers, and sometimes
> this makes sense.

If each writer atomically incremented SLOW_MODE, did its update, then
atomically decremented it, sure.  You could be more clever and avoid
unneeded synchronize_whatever() calls, but I would have to see a good
reason for doing so before recommending this.

OK, but you appear to be doing this below anyway.  ;-)

>   - But it's trivial to add "bool xxx->exclusive" set by xxx_init().
> If it is true only one xxx_enter() is possible, other callers
> should block until xxx_exit(). This is what percpu_down_write()
> actually needs.

Agreed.

>   - Probably it makes sense to add xxx->rcu_domain = RCU/SCHED/ETC.

Or just have pointers to the RCU functions in the xxx structure...

So you are trying to make something that abstracts the RCU-protected
state-change pattern?  Or perhaps more accurately, the RCU-protected
state-change-and-back pattern?

> Do you think it is correct? Makes sense? (BUG_ON's are just comments).

... Maybe ...   Please see below for commentary and a question.

Thanx, Paul

> Oleg.
> 
> // .h ---
> 
> struct xxx_struct {
>   int gp_state;
> 
>   int gp_count;
>   wait_queue_head_t   gp_waitq;
> 
>   int cb_state;
>   struct rcu_head cb_head;

spinlock_t  xxx_lock;  /* ? */

This spinlock might not make the big-system guys happy, but it appears to
be needed below.

> };
> 
> static inline bool xxx_is_idle(struct xxx_struct *xxx)
> {
>   return !xxx->gp_state; /* GP_IDLE */
> }
> 
> extern void xxx_enter(struct xxx_struct *xxx);
> extern void xxx_exit(struct xxx_struct *xxx);
> 
> // .c ---
> 
> enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> 
> enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> 
> #define xxx_lock  gp_waitq.lock
> 
> void xxx_enter(struct xxx_struct *xxx)
> {
>   bool need_wait, need_sync;
> 
>   spin_lock_irq(>xxx_lock);
>   need_wait = xxx->gp_count++;
>   need_sync = xxx->gp_state == GP_IDLE;

Suppose ->gp_state is GP_PASSED.  It could transition to GP_IDLE at any
time, right?

>   if (need_sync)
>   xxx->gp_state = GP_PENDING;
>   spin_unlock_irq(>xxx_lock);
> 
>   BUG_ON(need_wait && need_sync);
> 
>   } if (need_sync) {
>   synchronize_sched();
>   xxx->gp_state = GP_PASSED;
>   wake_up_all(>gp_waitq);
>   } else if (need_wait) {
>   wait_event(>gp_waitq, xxx->gp_state == GP_PASSED);

Suppose the wakeup is delayed until after the state has been updated
back to GP_IDLE?  Ah, presumably the non-zero ->gp_count prevents this.
Never mind!

>   } else {
>   BUG_ON(xxx->gp_state != GP_PASSED);
>   }
> }
> 
> static void cb_rcu_func(struct rcu_head *rcu)
> {
>   struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
>   long flags;
> 
>   BUG_ON(xxx->gp_state != 

Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-29 Thread Paul E. McKenney
On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
 Hello.
 
 Paul, Peter, et al, could you review the code below?
 
 I am not sending the patch, I think it is simpler to read the code
 inline (just in case, I didn't try to compile it yet).
 
 It is functionally equivalent to
 
   struct xxx_struct {
   atomic_t counter;
   };
 
   static inline bool xxx_is_idle(struct xxx_struct *xxx)
   {
   return atomic_read(xxx-counter) == 0;
   }
 
   static inline void xxx_enter(struct xxx_struct *xxx)
   {
   atomic_inc(xxx-counter);
   synchronize_sched();
   }
 
   static inline void xxx_enter(struct xxx_struct *xxx)
   {
   synchronize_sched();
   atomic_dec(xxx-counter);
   }

But there is nothing for synchronize_sched() to wait for in the above.
Presumably the caller of xxx_is_idle() is required to disable preemption
or be under rcu_read_lock_sched()?

 except: it records the state and synchronize_sched() is only called by
 xxx_enter() and only if necessary.
 
 Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
 (Peter, I think they should be unified anyway, but lets ignore this for
 now). Or freeze_super() (which currently looks buggy), perhaps something
 else. This pattern
 
   writer:
   state = SLOW_MODE;
   synchronize_rcu/sched();
 
   reader:
   preempt_disable();  // or rcu_read_lock();
   if (state != SLOW_MODE)
   ...
 
 is quite common.

And this does guarantee that by the time the writer's synchronize_whatever()
exits, all readers will know that state==SLOW_MODE.

 Note:
   - This implementation allows multiple writers, and sometimes
 this makes sense.

If each writer atomically incremented SLOW_MODE, did its update, then
atomically decremented it, sure.  You could be more clever and avoid
unneeded synchronize_whatever() calls, but I would have to see a good
reason for doing so before recommending this.

OK, but you appear to be doing this below anyway.  ;-)

   - But it's trivial to add bool xxx-exclusive set by xxx_init().
 If it is true only one xxx_enter() is possible, other callers
 should block until xxx_exit(). This is what percpu_down_write()
 actually needs.

Agreed.

   - Probably it makes sense to add xxx-rcu_domain = RCU/SCHED/ETC.

Or just have pointers to the RCU functions in the xxx structure...

So you are trying to make something that abstracts the RCU-protected
state-change pattern?  Or perhaps more accurately, the RCU-protected
state-change-and-back pattern?

 Do you think it is correct? Makes sense? (BUG_ON's are just comments).

... Maybe ...   Please see below for commentary and a question.

Thanx, Paul

 Oleg.
 
 // .h ---
 
 struct xxx_struct {
   int gp_state;
 
   int gp_count;
   wait_queue_head_t   gp_waitq;
 
   int cb_state;
   struct rcu_head cb_head;

spinlock_t  xxx_lock;  /* ? */

This spinlock might not make the big-system guys happy, but it appears to
be needed below.

 };
 
 static inline bool xxx_is_idle(struct xxx_struct *xxx)
 {
   return !xxx-gp_state; /* GP_IDLE */
 }
 
 extern void xxx_enter(struct xxx_struct *xxx);
 extern void xxx_exit(struct xxx_struct *xxx);
 
 // .c ---
 
 enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
 
 enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
 
 #define xxx_lock  gp_waitq.lock
 
 void xxx_enter(struct xxx_struct *xxx)
 {
   bool need_wait, need_sync;
 
   spin_lock_irq(xxx-xxx_lock);
   need_wait = xxx-gp_count++;
   need_sync = xxx-gp_state == GP_IDLE;

Suppose -gp_state is GP_PASSED.  It could transition to GP_IDLE at any
time, right?

   if (need_sync)
   xxx-gp_state = GP_PENDING;
   spin_unlock_irq(xxx-xxx_lock);
 
   BUG_ON(need_wait  need_sync);
 
   } if (need_sync) {
   synchronize_sched();
   xxx-gp_state = GP_PASSED;
   wake_up_all(xxx-gp_waitq);
   } else if (need_wait) {
   wait_event(xxx-gp_waitq, xxx-gp_state == GP_PASSED);

Suppose the wakeup is delayed until after the state has been updated
back to GP_IDLE?  Ah, presumably the non-zero -gp_count prevents this.
Never mind!

   } else {
   BUG_ON(xxx-gp_state != GP_PASSED);
   }
 }
 
 static void cb_rcu_func(struct rcu_head *rcu)
 {
   struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
   long flags;
 
   BUG_ON(xxx-gp_state != GP_PASSED);
   BUG_ON(xxx-cb_state == CB_IDLE);
 
   spin_lock_irqsave(xxx-xxx_lock, flags);
   if 

Re: [RFC] introduce synchronize_sched_{enter,exit}()

2013-09-29 Thread Steven Rostedt
On Sun, 29 Sep 2013 20:36:34 +0200
Oleg Nesterov o...@redhat.com wrote:

 
 Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
 (Peter, I think they should be unified anyway, but lets ignore this for
 now). Or freeze_super() (which currently looks buggy), perhaps something
 else. This pattern
 

Just so I'm clear to what you are trying to implement... This is to
handle the case (as Paul said) to see changes to state by RCU and back
again? That is, it isn't enough to see that the state changed to
something (like SLOW MODE), but we also need a way to see it change
back?

With get_online_cpus(), we need to see the state where it changed to
performing hotplug where holders need to go into the slow path, and
then also see the state change to no longe performing hotplug and the
holders now go back to fast path. Is this the rational for this email?

Thanks,

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/