Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Thu, Oct 03, 2013 at 09:04:59AM +0200, Ingo Molnar wrote: > > * Peter Zijlstra wrote: > > > > > Fully agreed! :-) haha.. never realized I send that email completely empty. It was supposed to contain the patch I later send as 2/3. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
* Peter Zijlstra wrote: > Fully agreed! :-) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
* Peter Zijlstra pet...@infradead.org wrote: Fully agreed! :-) Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Thu, Oct 03, 2013 at 09:04:59AM +0200, Ingo Molnar wrote: * Peter Zijlstra pet...@infradead.org wrote: Fully agreed! :-) haha.. never realized I send that email completely empty. It was supposed to contain the patch I later send as 2/3. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
-- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On 09/30, Peter Zijlstra wrote: > > On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote: > > For that we'd have to decrement xxx->gp_count from cb_rcu_func(), > > wouldn't we? > > > > Without that there's no guarantee the fast path readers will have a MB > > to observe the write critical section, unless I'm completely missing > > something obviuos here. > > Duh.. we should be looking at gp_state like Paul said. Yes, yes, that is why we have xxx_is_idle(). Its name is confusing even ignoring "xxx". OK, I'll try to invent the naming (but I'd like to hear suggestions ;) and send the patch. I am going to add "exclusive" and "rcu_domain/ops" later, currently percpu_rw_semaphore needs ->rw_sem anyway. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On 09/30, Peter Zijlstra wrote: > > On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote: > > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(), > > (Peter, I think they should be unified anyway, but lets ignore this for > > now). > > If you think the percpu_rwsem users can benefit sure.. So far its good I > didn't go the percpu_rwsem route for it looks like we got something > better at the end of it ;-) I think you could simply improve percpu_rwsem instead. Once we add task_struct->cpuhp_ctr percpu_rwsem and get_online_cpus/hotplug_begin becomes absolutely congruent. OTOH, it would be simpler to change hotplug first, then copy-and-paste the improvents into percpu_rwsem, then see if we can simply convert cpu_hotplug_begin/end into percpu_down/up_write. > Well, if we make percpu_rwsem the defacto container of the pattern and > use that throughout, we'd have only a single implementation Not sure. I think it can have other users. But even if not, please look at "struct sb_writers". Yes, I believe it makes sense to use percpu_rwsem here, but note that it is actually array of semaphores. I do not think each element needs its own xxx_struct. > and don't > need the abstraction. And even if struct percpu_rw_semaphore will be the only container of xxx_struct, I think the code looks better and more understandable this way, exactly because it adds the new abstraction layer. Performance-wise this should be free. > > static void cb_rcu_func(struct rcu_head *rcu) > > { > > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head); > > long flags; > > > > BUG_ON(xxx->gp_state != GP_PASSED); > > BUG_ON(xxx->cb_state == CB_IDLE); > > > > spin_lock_irqsave(>xxx_lock, flags); > > if (xxx->gp_count) { > > xxx->cb_state = CB_IDLE; > > This seems to be when a new xxx_begin() has happened after our last > xxx_end() and the sync_sched() from xxx_begin() merges with the > xxx_end() one and we're done. Yes, > > } else if (xxx->cb_state == CB_REPLAY) { > > xxx->cb_state = CB_PENDING; > > call_rcu_sched(>cb_head, cb_rcu_func); > > A later xxx_exit() has happened, and we need to requeue to catch a later > GP. Exactly. > So I don't immediately see the point of the concurrent write side; > percpu_rwsem wouldn't allow this and afaict neither would > freeze_super(). Oh I disagree. Even ignoring the fact I believe xxx_struct itself can have more users (I can be wrong of course), I do think that percpu_down_write_nonexclusive() makes sense (except "exclusive" should be the argument of percpu_init_rwsem). And in fact the initial implementation I sent didn't even has the "exclusive" mode. Please look at uprobes (currently the only user). We do not really need the global write-lock, we can do the per-uprobe locking. However, every caller needs to block the percpu_down_read() callers (dup_mmap). > Other than that; yes this makes sense if you care about write side > performance and I think its solid. Great ;) Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote: > For that we'd have to decrement xxx->gp_count from cb_rcu_func(), > wouldn't we? > > Without that there's no guarantee the fast path readers will have a MB > to observe the write critical section, unless I'm completely missing > something obviuos here. Duh.. we should be looking at gp_state like Paul said. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Mon, Sep 30, 2013 at 02:59:42PM +0200, Peter Zijlstra wrote: > > > > static void cb_rcu_func(struct rcu_head *rcu) > > { > > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head); > > long flags; > > > > BUG_ON(xxx->gp_state != GP_PASSED); > > BUG_ON(xxx->cb_state == CB_IDLE); > > > > spin_lock_irqsave(>xxx_lock, flags); > > if (xxx->gp_count) { > > xxx->cb_state = CB_IDLE; > > This seems to be when a new xxx_begin() has happened after our last > xxx_end() and the sync_sched() from xxx_begin() merges with the > xxx_end() one and we're done. > > > } else if (xxx->cb_state == CB_REPLAY) { > > xxx->cb_state = CB_PENDING; > > call_rcu_sched(>cb_head, cb_rcu_func); > > A later xxx_exit() has happened, and we need to requeue to catch a later > GP. > > > } else { > > xxx->cb_state = CB_IDLE; > > xxx->gp_state = GP_IDLE; > > Nothing fancy happened and we're done. > > > } > > spin_unlock_irqrestore(>xxx_lock, flags); > > } > > > > void xxx_exit(struct xxx_struct *xxx) > > { > > spin_lock_irq(>xxx_lock); > > if (!--xxx->gp_count) { > > if (xxx->cb_state == CB_IDLE) { > > xxx->cb_state = CB_PENDING; > > call_rcu_sched(>cb_head, cb_rcu_func); > > } else if (xxx->cb_state == CB_PENDING) { > > xxx->cb_state = CB_REPLAY; > > } > > } > > spin_unlock_irq(>xxx_lock); > > } > > So I don't immediately see the point of the concurrent write side; > percpu_rwsem wouldn't allow this and afaict neither would > freeze_super(). > > Other than that; yes this makes sense if you care about write side > performance and I think its solid. Hmm, wait. I don't see how this is equivalent to: xxx_end() { synchronize_sched(); atomic_dec(>counter); } For that we'd have to decrement xxx->gp_count from cb_rcu_func(), wouldn't we? Without that there's no guarantee the fast path readers will have a MB to observe the write critical section, unless I'm completely missing something obviuos here. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On 09/29, Steven Rostedt wrote: > > On Sun, 29 Sep 2013 20:36:34 +0200 > Oleg Nesterov wrote: > > > > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(), > > (Peter, I think they should be unified anyway, but lets ignore this for > > now). Or freeze_super() (which currently looks buggy), perhaps something > > else. This pattern > > > > Just so I'm clear to what you are trying to implement... This is to > handle the case (as Paul said) to see changes to state by RCU and back > again? That is, it isn't enough to see that the state changed to > something (like SLOW MODE), but we also need a way to see it change > back? Suppose this code was applied as is. Now we can change percpu_rwsem, see the "patch" below. (please ignore _expedited in the current code). This immediately makes percpu_up_write() much faster, it no longer blocks. And the contending writers (or even the same writer which takes it again) can avoid synchronize_sched() in percpu_down_write(). And to remind, we can add xxx_struct->exclusive (or add the argument to xxx_enter/exit), and then (with some other changes) we can kill percpu_rw_semaphore->rw_sem. > With get_online_cpus(), we need to see the state where it changed to > "performing hotplug" where holders need to go into the slow path, and > then also see the state change to "no longe performing hotplug" and the > holders now go back to fast path. Is this the rational for this email? The same. cpu_hotplug_begin/end (I mean the code written by Peter) can be changed to use xxx_enter/exit. Oleg. --- x/include/linux/percpu-rwsem.h +++ x/include/linux/percpu-rwsem.h @@ -8,8 +8,8 @@ #include struct percpu_rw_semaphore { + xxx_struct xxx; unsigned int __percpu *fast_read_ctr; - atomic_twrite_ctr; struct rw_semaphore rw_sem; atomic_tslow_read_ctr; wait_queue_head_t write_waitq; --- x/lib/percpu-rwsem.c +++ x/lib/percpu-rwsem.c @@ -17,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw /* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */ __init_rwsem(>rw_sem, name, rwsem_key); - atomic_set(>write_ctr, 0); + xxx_init(>xxx, ...); atomic_set(>slow_read_ctr, 0); init_waitqueue_head(>write_waitq); return 0; @@ -25,6 +25,14 @@ int __percpu_init_rwsem(struct percpu_rw void percpu_free_rwsem(struct percpu_rw_semaphore *brw) { + might_sleep(); + + // pseudo code which needs another simple xxx_ helper + if (xxx->gp_state == GP_REPLAY) + xxx->gp_state == GP_PENDING; + if (xxx->gp_state) + synchronize_sched(); + free_percpu(brw->fast_read_ctr); brw->fast_read_ctr = NULL; /* catch use after free bugs */ } @@ -57,7 +65,7 @@ static bool update_fast_ctr(struct percp bool success = false; preempt_disable(); - if (likely(!atomic_read(>write_ctr))) { + if (likely(xxx_is_idle(>xxx))) { __this_cpu_add(*brw->fast_read_ctr, val); success = true; } @@ -126,20 +134,7 @@ static int clear_fast_ctr(struct percpu_ */ void percpu_down_write(struct percpu_rw_semaphore *brw) { - /* tell update_fast_ctr() there is a pending writer */ - atomic_inc(>write_ctr); - /* -* 1. Ensures that write_ctr != 0 is visible to any down_read/up_read -*so that update_fast_ctr() can't succeed. -* -* 2. Ensures we see the result of every previous this_cpu_add() in -*update_fast_ctr(). -* -* 3. Ensures that if any reader has exited its critical section via -*fast-path, it executes a full memory barrier before we return. -*See R_W case in the comment above update_fast_ctr(). -*/ - synchronize_sched_expedited(); + xxx_enter(>xxx); /* exclude other writers, and block the new readers completely */ down_write(>rw_sem); @@ -159,7 +154,5 @@ void percpu_up_write(struct percpu_rw_se * Insert the barrier before the next fast-path in down_read, * see W_R case in the comment above update_fast_ctr(). */ - synchronize_sched_expedited(); - /* the last writer unblocks update_fast_ctr() */ - atomic_dec(>write_ctr); + xxx_exit(); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote: > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(), > (Peter, I think they should be unified anyway, but lets ignore this for > now). If you think the percpu_rwsem users can benefit sure.. So far its good I didn't go the percpu_rwsem route for it looks like we got something better at the end of it ;-) > Or freeze_super() (which currently looks buggy), perhaps something > else. This pattern > > writer: > state = SLOW_MODE; > synchronize_rcu/sched(); > > reader: > preempt_disable(); // or rcu_read_lock(); > if (state != SLOW_MODE) > ... > > is quite common. Well, if we make percpu_rwsem the defacto container of the pattern and use that throughout, we'd have only a single implementation and don't need the abstraction. That said; we could still use the idea proposed; so let me take a look. > // .h --- > > struct xxx_struct { > int gp_state; > > int gp_count; > wait_queue_head_t gp_waitq; > > int cb_state; > struct rcu_head cb_head; > }; > > static inline bool xxx_is_idle(struct xxx_struct *xxx) > { > return !xxx->gp_state; /* GP_IDLE */ > } > > extern void xxx_enter(struct xxx_struct *xxx); > extern void xxx_exit(struct xxx_struct *xxx); > > // .c --- > > enum { GP_IDLE = 0, GP_PENDING, GP_PASSED }; > > enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY }; > > #define xxx_lock gp_waitq.lock > > void xxx_enter(struct xxx_struct *xxx) > { > bool need_wait, need_sync; > > spin_lock_irq(>xxx_lock); > need_wait = xxx->gp_count++; > need_sync = xxx->gp_state == GP_IDLE; > if (need_sync) > xxx->gp_state = GP_PENDING; > spin_unlock_irq(>xxx_lock); > > BUG_ON(need_wait && need_sync); > > } if (need_sync) { > synchronize_sched(); > xxx->gp_state = GP_PASSED; > wake_up_all(>gp_waitq); > } else if (need_wait) { > wait_event(>gp_waitq, xxx->gp_state == GP_PASSED); > } else { > BUG_ON(xxx->gp_state != GP_PASSED); > } > } > > static void cb_rcu_func(struct rcu_head *rcu) > { > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head); > long flags; > > BUG_ON(xxx->gp_state != GP_PASSED); > BUG_ON(xxx->cb_state == CB_IDLE); > > spin_lock_irqsave(>xxx_lock, flags); > if (xxx->gp_count) { > xxx->cb_state = CB_IDLE; This seems to be when a new xxx_begin() has happened after our last xxx_end() and the sync_sched() from xxx_begin() merges with the xxx_end() one and we're done. > } else if (xxx->cb_state == CB_REPLAY) { > xxx->cb_state = CB_PENDING; > call_rcu_sched(>cb_head, cb_rcu_func); A later xxx_exit() has happened, and we need to requeue to catch a later GP. > } else { > xxx->cb_state = CB_IDLE; > xxx->gp_state = GP_IDLE; Nothing fancy happened and we're done. > } > spin_unlock_irqrestore(>xxx_lock, flags); > } > > void xxx_exit(struct xxx_struct *xxx) > { > spin_lock_irq(>xxx_lock); > if (!--xxx->gp_count) { > if (xxx->cb_state == CB_IDLE) { > xxx->cb_state = CB_PENDING; > call_rcu_sched(>cb_head, cb_rcu_func); > } else if (xxx->cb_state == CB_PENDING) { > xxx->cb_state = CB_REPLAY; > } > } > spin_unlock_irq(>xxx_lock); > } So I don't immediately see the point of the concurrent write side; percpu_rwsem wouldn't allow this and afaict neither would freeze_super(). Other than that; yes this makes sense if you care about write side performance and I think its solid. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On 09/29, Paul E. McKenney wrote: > > On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote: > > > > struct xxx_struct { > > atomic_t counter; > > }; > > > > static inline bool xxx_is_idle(struct xxx_struct *xxx) > > { > > return atomic_read(>counter) == 0; > > } > > > > static inline void xxx_enter(struct xxx_struct *xxx) > > { > > atomic_inc(>counter); > > synchronize_sched(); > > } > > > > static inline void xxx_enter(struct xxx_struct *xxx) > > { > > synchronize_sched(); > > atomic_dec(>counter); > > } > > But there is nothing for synchronize_sched() to wait for in the above. > Presumably the caller of xxx_is_idle() is required to disable preemption > or be under rcu_read_lock_sched()? Yes, yes, sure, xxx_is_idle() should be called under preempt_disable(). (or rcu_read_lock() if xxx_enter() uses synchronize_rcu()). > So you are trying to make something that abstracts the RCU-protected > state-change pattern? Or perhaps more accurately, the RCU-protected > state-change-and-back pattern? Yes, exactly. > > struct xxx_struct { > > int gp_state; > > > > int gp_count; > > wait_queue_head_t gp_waitq; > > > > int cb_state; > > struct rcu_head cb_head; > > spinlock_t xxx_lock; /* ? */ See #define xxx_lockgp_waitq.lock in .c below, but we can add another spinlock. > This spinlock might not make the big-system guys happy, but it appears to > be needed below. Only the writers use this spinlock, and they should synchronize with each other anyway. I don't think this can really penalize, say, percpu_down_write or cpu_hotplug_begin. > > // .c > > --- > > > > enum { GP_IDLE = 0, GP_PENDING, GP_PASSED }; > > > > enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY }; > > > > #define xxx_lockgp_waitq.lock > > > > void xxx_enter(struct xxx_struct *xxx) > > { > > bool need_wait, need_sync; > > > > spin_lock_irq(>xxx_lock); > > need_wait = xxx->gp_count++; > > need_sync = xxx->gp_state == GP_IDLE; > > Suppose ->gp_state is GP_PASSED. It could transition to GP_IDLE at any > time, right? As you already pointed below - no. Once we incremented ->nr_writers, nobody can set GP_IDLE. And if the caller is the "first" writer (need_sync == T) nobody else can change ->gp_state, so xxx_enter() sets GP_PASSED lockless. > > if (need_sync) > > xxx->gp_state = GP_PENDING; > > spin_unlock_irq(>xxx_lock); > > > > BUG_ON(need_wait && need_sync); > > > > } if (need_sync) { > > synchronize_sched(); > > xxx->gp_state = GP_PASSED; > > wake_up_all(>gp_waitq); > > } else if (need_wait) { > > wait_event(>gp_waitq, xxx->gp_state == GP_PASSED); > > Suppose the wakeup is delayed until after the state has been updated > back to GP_IDLE? Ah, presumably the non-zero ->gp_count prevents this. Yes, exactly. > > static void cb_rcu_func(struct rcu_head *rcu) > > { > > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head); > > long flags; > > > > BUG_ON(xxx->gp_state != GP_PASSED); > > BUG_ON(xxx->cb_state == CB_IDLE); > > > > spin_lock_irqsave(>xxx_lock, flags); > > if (xxx->gp_count) { > > xxx->cb_state = CB_IDLE; > > } else if (xxx->cb_state == CB_REPLAY) { > > xxx->cb_state = CB_PENDING; > > call_rcu_sched(>cb_head, cb_rcu_func); > > } else { > > xxx->cb_state = CB_IDLE; > > xxx->gp_state = GP_IDLE; > > } > > It took me a bit to work out the above. It looks like the intent is > to have the last xxx_exit() put the state back to GP_IDLE, which appears > to be the state in which readers can use a fastpath. Yes, and we we offload this work to rcu callback so xxx_exit() doesn't block. The only complication is the next writer which does xxx_enter() after xxx_exit(). If there are no other writers, the next xxx_exit() should do rcu_cancel(>cb_head); call_rcu_sched(>cb_head, cb_rcu_func); to "extend" the gp, but since we do not have rcu_cancel() it simply sets CB_REPLAY to instruct cb_rcu_func() to reschedule itself. > This works because if ->gp_count is non-zero and ->cb_state is CB_IDLE, > there must be an xxx_exit() in our future. Yes, but ->cb_state doesn't really matter if ->gp_count != 0 in xxx_exit() or cb_rcu_func() (except it can't be CB_IDLE in cb_rcu_func). > > void xxx_exit(struct xxx_struct *xxx) > > { > > spin_lock_irq(>xxx_lock); > > if (!--xxx->gp_count) { > > if (xxx->cb_state == CB_IDLE) { > > xxx->cb_state = CB_PENDING; > > call_rcu_sched(>cb_head, cb_rcu_func); > > } else if (xxx->cb_state == CB_PENDING)
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On 09/29, Paul E. McKenney wrote: On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote: struct xxx_struct { atomic_t counter; }; static inline bool xxx_is_idle(struct xxx_struct *xxx) { return atomic_read(xxx-counter) == 0; } static inline void xxx_enter(struct xxx_struct *xxx) { atomic_inc(xxx-counter); synchronize_sched(); } static inline void xxx_enter(struct xxx_struct *xxx) { synchronize_sched(); atomic_dec(xxx-counter); } But there is nothing for synchronize_sched() to wait for in the above. Presumably the caller of xxx_is_idle() is required to disable preemption or be under rcu_read_lock_sched()? Yes, yes, sure, xxx_is_idle() should be called under preempt_disable(). (or rcu_read_lock() if xxx_enter() uses synchronize_rcu()). So you are trying to make something that abstracts the RCU-protected state-change pattern? Or perhaps more accurately, the RCU-protected state-change-and-back pattern? Yes, exactly. struct xxx_struct { int gp_state; int gp_count; wait_queue_head_t gp_waitq; int cb_state; struct rcu_head cb_head; spinlock_t xxx_lock; /* ? */ See #define xxx_lockgp_waitq.lock in .c below, but we can add another spinlock. This spinlock might not make the big-system guys happy, but it appears to be needed below. Only the writers use this spinlock, and they should synchronize with each other anyway. I don't think this can really penalize, say, percpu_down_write or cpu_hotplug_begin. // .c --- enum { GP_IDLE = 0, GP_PENDING, GP_PASSED }; enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY }; #define xxx_lockgp_waitq.lock void xxx_enter(struct xxx_struct *xxx) { bool need_wait, need_sync; spin_lock_irq(xxx-xxx_lock); need_wait = xxx-gp_count++; need_sync = xxx-gp_state == GP_IDLE; Suppose -gp_state is GP_PASSED. It could transition to GP_IDLE at any time, right? As you already pointed below - no. Once we incremented -nr_writers, nobody can set GP_IDLE. And if the caller is the first writer (need_sync == T) nobody else can change -gp_state, so xxx_enter() sets GP_PASSED lockless. if (need_sync) xxx-gp_state = GP_PENDING; spin_unlock_irq(xxx-xxx_lock); BUG_ON(need_wait need_sync); } if (need_sync) { synchronize_sched(); xxx-gp_state = GP_PASSED; wake_up_all(xxx-gp_waitq); } else if (need_wait) { wait_event(xxx-gp_waitq, xxx-gp_state == GP_PASSED); Suppose the wakeup is delayed until after the state has been updated back to GP_IDLE? Ah, presumably the non-zero -gp_count prevents this. Yes, exactly. static void cb_rcu_func(struct rcu_head *rcu) { struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head); long flags; BUG_ON(xxx-gp_state != GP_PASSED); BUG_ON(xxx-cb_state == CB_IDLE); spin_lock_irqsave(xxx-xxx_lock, flags); if (xxx-gp_count) { xxx-cb_state = CB_IDLE; } else if (xxx-cb_state == CB_REPLAY) { xxx-cb_state = CB_PENDING; call_rcu_sched(xxx-cb_head, cb_rcu_func); } else { xxx-cb_state = CB_IDLE; xxx-gp_state = GP_IDLE; } It took me a bit to work out the above. It looks like the intent is to have the last xxx_exit() put the state back to GP_IDLE, which appears to be the state in which readers can use a fastpath. Yes, and we we offload this work to rcu callback so xxx_exit() doesn't block. The only complication is the next writer which does xxx_enter() after xxx_exit(). If there are no other writers, the next xxx_exit() should do rcu_cancel(xxx-cb_head); call_rcu_sched(xxx-cb_head, cb_rcu_func); to extend the gp, but since we do not have rcu_cancel() it simply sets CB_REPLAY to instruct cb_rcu_func() to reschedule itself. This works because if -gp_count is non-zero and -cb_state is CB_IDLE, there must be an xxx_exit() in our future. Yes, but -cb_state doesn't really matter if -gp_count != 0 in xxx_exit() or cb_rcu_func() (except it can't be CB_IDLE in cb_rcu_func). void xxx_exit(struct xxx_struct *xxx) { spin_lock_irq(xxx-xxx_lock); if (!--xxx-gp_count) { if (xxx-cb_state == CB_IDLE) { xxx-cb_state = CB_PENDING; call_rcu_sched(xxx-cb_head, cb_rcu_func); } else if (xxx-cb_state == CB_PENDING) { xxx-cb_state = CB_REPLAY; } } spin_unlock_irq(xxx-xxx_lock); } Then we also have something like this? bool
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote: Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(), (Peter, I think they should be unified anyway, but lets ignore this for now). If you think the percpu_rwsem users can benefit sure.. So far its good I didn't go the percpu_rwsem route for it looks like we got something better at the end of it ;-) Or freeze_super() (which currently looks buggy), perhaps something else. This pattern writer: state = SLOW_MODE; synchronize_rcu/sched(); reader: preempt_disable(); // or rcu_read_lock(); if (state != SLOW_MODE) ... is quite common. Well, if we make percpu_rwsem the defacto container of the pattern and use that throughout, we'd have only a single implementation and don't need the abstraction. That said; we could still use the idea proposed; so let me take a look. // .h --- struct xxx_struct { int gp_state; int gp_count; wait_queue_head_t gp_waitq; int cb_state; struct rcu_head cb_head; }; static inline bool xxx_is_idle(struct xxx_struct *xxx) { return !xxx-gp_state; /* GP_IDLE */ } extern void xxx_enter(struct xxx_struct *xxx); extern void xxx_exit(struct xxx_struct *xxx); // .c --- enum { GP_IDLE = 0, GP_PENDING, GP_PASSED }; enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY }; #define xxx_lock gp_waitq.lock void xxx_enter(struct xxx_struct *xxx) { bool need_wait, need_sync; spin_lock_irq(xxx-xxx_lock); need_wait = xxx-gp_count++; need_sync = xxx-gp_state == GP_IDLE; if (need_sync) xxx-gp_state = GP_PENDING; spin_unlock_irq(xxx-xxx_lock); BUG_ON(need_wait need_sync); } if (need_sync) { synchronize_sched(); xxx-gp_state = GP_PASSED; wake_up_all(xxx-gp_waitq); } else if (need_wait) { wait_event(xxx-gp_waitq, xxx-gp_state == GP_PASSED); } else { BUG_ON(xxx-gp_state != GP_PASSED); } } static void cb_rcu_func(struct rcu_head *rcu) { struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head); long flags; BUG_ON(xxx-gp_state != GP_PASSED); BUG_ON(xxx-cb_state == CB_IDLE); spin_lock_irqsave(xxx-xxx_lock, flags); if (xxx-gp_count) { xxx-cb_state = CB_IDLE; This seems to be when a new xxx_begin() has happened after our last xxx_end() and the sync_sched() from xxx_begin() merges with the xxx_end() one and we're done. } else if (xxx-cb_state == CB_REPLAY) { xxx-cb_state = CB_PENDING; call_rcu_sched(xxx-cb_head, cb_rcu_func); A later xxx_exit() has happened, and we need to requeue to catch a later GP. } else { xxx-cb_state = CB_IDLE; xxx-gp_state = GP_IDLE; Nothing fancy happened and we're done. } spin_unlock_irqrestore(xxx-xxx_lock, flags); } void xxx_exit(struct xxx_struct *xxx) { spin_lock_irq(xxx-xxx_lock); if (!--xxx-gp_count) { if (xxx-cb_state == CB_IDLE) { xxx-cb_state = CB_PENDING; call_rcu_sched(xxx-cb_head, cb_rcu_func); } else if (xxx-cb_state == CB_PENDING) { xxx-cb_state = CB_REPLAY; } } spin_unlock_irq(xxx-xxx_lock); } So I don't immediately see the point of the concurrent write side; percpu_rwsem wouldn't allow this and afaict neither would freeze_super(). Other than that; yes this makes sense if you care about write side performance and I think its solid. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On 09/29, Steven Rostedt wrote: On Sun, 29 Sep 2013 20:36:34 +0200 Oleg Nesterov o...@redhat.com wrote: Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(), (Peter, I think they should be unified anyway, but lets ignore this for now). Or freeze_super() (which currently looks buggy), perhaps something else. This pattern Just so I'm clear to what you are trying to implement... This is to handle the case (as Paul said) to see changes to state by RCU and back again? That is, it isn't enough to see that the state changed to something (like SLOW MODE), but we also need a way to see it change back? Suppose this code was applied as is. Now we can change percpu_rwsem, see the patch below. (please ignore _expedited in the current code). This immediately makes percpu_up_write() much faster, it no longer blocks. And the contending writers (or even the same writer which takes it again) can avoid synchronize_sched() in percpu_down_write(). And to remind, we can add xxx_struct-exclusive (or add the argument to xxx_enter/exit), and then (with some other changes) we can kill percpu_rw_semaphore-rw_sem. With get_online_cpus(), we need to see the state where it changed to performing hotplug where holders need to go into the slow path, and then also see the state change to no longe performing hotplug and the holders now go back to fast path. Is this the rational for this email? The same. cpu_hotplug_begin/end (I mean the code written by Peter) can be changed to use xxx_enter/exit. Oleg. --- x/include/linux/percpu-rwsem.h +++ x/include/linux/percpu-rwsem.h @@ -8,8 +8,8 @@ #include linux/lockdep.h struct percpu_rw_semaphore { + xxx_struct xxx; unsigned int __percpu *fast_read_ctr; - atomic_twrite_ctr; struct rw_semaphore rw_sem; atomic_tslow_read_ctr; wait_queue_head_t write_waitq; --- x/lib/percpu-rwsem.c +++ x/lib/percpu-rwsem.c @@ -17,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw /* -rw_sem represents the whole percpu_rw_semaphore for lockdep */ __init_rwsem(brw-rw_sem, name, rwsem_key); - atomic_set(brw-write_ctr, 0); + xxx_init(brw-xxx, ...); atomic_set(brw-slow_read_ctr, 0); init_waitqueue_head(brw-write_waitq); return 0; @@ -25,6 +25,14 @@ int __percpu_init_rwsem(struct percpu_rw void percpu_free_rwsem(struct percpu_rw_semaphore *brw) { + might_sleep(); + + // pseudo code which needs another simple xxx_ helper + if (xxx-gp_state == GP_REPLAY) + xxx-gp_state == GP_PENDING; + if (xxx-gp_state) + synchronize_sched(); + free_percpu(brw-fast_read_ctr); brw-fast_read_ctr = NULL; /* catch use after free bugs */ } @@ -57,7 +65,7 @@ static bool update_fast_ctr(struct percp bool success = false; preempt_disable(); - if (likely(!atomic_read(brw-write_ctr))) { + if (likely(xxx_is_idle(brw-xxx))) { __this_cpu_add(*brw-fast_read_ctr, val); success = true; } @@ -126,20 +134,7 @@ static int clear_fast_ctr(struct percpu_ */ void percpu_down_write(struct percpu_rw_semaphore *brw) { - /* tell update_fast_ctr() there is a pending writer */ - atomic_inc(brw-write_ctr); - /* -* 1. Ensures that write_ctr != 0 is visible to any down_read/up_read -*so that update_fast_ctr() can't succeed. -* -* 2. Ensures we see the result of every previous this_cpu_add() in -*update_fast_ctr(). -* -* 3. Ensures that if any reader has exited its critical section via -*fast-path, it executes a full memory barrier before we return. -*See R_W case in the comment above update_fast_ctr(). -*/ - synchronize_sched_expedited(); + xxx_enter(brw-xxx); /* exclude other writers, and block the new readers completely */ down_write(brw-rw_sem); @@ -159,7 +154,5 @@ void percpu_up_write(struct percpu_rw_se * Insert the barrier before the next fast-path in down_read, * see W_R case in the comment above update_fast_ctr(). */ - synchronize_sched_expedited(); - /* the last writer unblocks update_fast_ctr() */ - atomic_dec(brw-write_ctr); + xxx_exit(); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Mon, Sep 30, 2013 at 02:59:42PM +0200, Peter Zijlstra wrote: static void cb_rcu_func(struct rcu_head *rcu) { struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head); long flags; BUG_ON(xxx-gp_state != GP_PASSED); BUG_ON(xxx-cb_state == CB_IDLE); spin_lock_irqsave(xxx-xxx_lock, flags); if (xxx-gp_count) { xxx-cb_state = CB_IDLE; This seems to be when a new xxx_begin() has happened after our last xxx_end() and the sync_sched() from xxx_begin() merges with the xxx_end() one and we're done. } else if (xxx-cb_state == CB_REPLAY) { xxx-cb_state = CB_PENDING; call_rcu_sched(xxx-cb_head, cb_rcu_func); A later xxx_exit() has happened, and we need to requeue to catch a later GP. } else { xxx-cb_state = CB_IDLE; xxx-gp_state = GP_IDLE; Nothing fancy happened and we're done. } spin_unlock_irqrestore(xxx-xxx_lock, flags); } void xxx_exit(struct xxx_struct *xxx) { spin_lock_irq(xxx-xxx_lock); if (!--xxx-gp_count) { if (xxx-cb_state == CB_IDLE) { xxx-cb_state = CB_PENDING; call_rcu_sched(xxx-cb_head, cb_rcu_func); } else if (xxx-cb_state == CB_PENDING) { xxx-cb_state = CB_REPLAY; } } spin_unlock_irq(xxx-xxx_lock); } So I don't immediately see the point of the concurrent write side; percpu_rwsem wouldn't allow this and afaict neither would freeze_super(). Other than that; yes this makes sense if you care about write side performance and I think its solid. Hmm, wait. I don't see how this is equivalent to: xxx_end() { synchronize_sched(); atomic_dec(xxx-counter); } For that we'd have to decrement xxx-gp_count from cb_rcu_func(), wouldn't we? Without that there's no guarantee the fast path readers will have a MB to observe the write critical section, unless I'm completely missing something obviuos here. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote: For that we'd have to decrement xxx-gp_count from cb_rcu_func(), wouldn't we? Without that there's no guarantee the fast path readers will have a MB to observe the write critical section, unless I'm completely missing something obviuos here. Duh.. we should be looking at gp_state like Paul said. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On 09/30, Peter Zijlstra wrote: On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote: Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(), (Peter, I think they should be unified anyway, but lets ignore this for now). If you think the percpu_rwsem users can benefit sure.. So far its good I didn't go the percpu_rwsem route for it looks like we got something better at the end of it ;-) I think you could simply improve percpu_rwsem instead. Once we add task_struct-cpuhp_ctr percpu_rwsem and get_online_cpus/hotplug_begin becomes absolutely congruent. OTOH, it would be simpler to change hotplug first, then copy-and-paste the improvents into percpu_rwsem, then see if we can simply convert cpu_hotplug_begin/end into percpu_down/up_write. Well, if we make percpu_rwsem the defacto container of the pattern and use that throughout, we'd have only a single implementation Not sure. I think it can have other users. But even if not, please look at struct sb_writers. Yes, I believe it makes sense to use percpu_rwsem here, but note that it is actually array of semaphores. I do not think each element needs its own xxx_struct. and don't need the abstraction. And even if struct percpu_rw_semaphore will be the only container of xxx_struct, I think the code looks better and more understandable this way, exactly because it adds the new abstraction layer. Performance-wise this should be free. static void cb_rcu_func(struct rcu_head *rcu) { struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head); long flags; BUG_ON(xxx-gp_state != GP_PASSED); BUG_ON(xxx-cb_state == CB_IDLE); spin_lock_irqsave(xxx-xxx_lock, flags); if (xxx-gp_count) { xxx-cb_state = CB_IDLE; This seems to be when a new xxx_begin() has happened after our last xxx_end() and the sync_sched() from xxx_begin() merges with the xxx_end() one and we're done. Yes, } else if (xxx-cb_state == CB_REPLAY) { xxx-cb_state = CB_PENDING; call_rcu_sched(xxx-cb_head, cb_rcu_func); A later xxx_exit() has happened, and we need to requeue to catch a later GP. Exactly. So I don't immediately see the point of the concurrent write side; percpu_rwsem wouldn't allow this and afaict neither would freeze_super(). Oh I disagree. Even ignoring the fact I believe xxx_struct itself can have more users (I can be wrong of course), I do think that percpu_down_write_nonexclusive() makes sense (except exclusive should be the argument of percpu_init_rwsem). And in fact the initial implementation I sent didn't even has the exclusive mode. Please look at uprobes (currently the only user). We do not really need the global write-lock, we can do the per-uprobe locking. However, every caller needs to block the percpu_down_read() callers (dup_mmap). Other than that; yes this makes sense if you care about write side performance and I think its solid. Great ;) Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On 09/30, Peter Zijlstra wrote: On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote: For that we'd have to decrement xxx-gp_count from cb_rcu_func(), wouldn't we? Without that there's no guarantee the fast path readers will have a MB to observe the write critical section, unless I'm completely missing something obviuos here. Duh.. we should be looking at gp_state like Paul said. Yes, yes, that is why we have xxx_is_idle(). Its name is confusing even ignoring xxx. OK, I'll try to invent the naming (but I'd like to hear suggestions ;) and send the patch. I am going to add exclusive and rcu_domain/ops later, currently percpu_rw_semaphore needs -rw_sem anyway. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Sun, 29 Sep 2013 20:36:34 +0200 Oleg Nesterov wrote: > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(), > (Peter, I think they should be unified anyway, but lets ignore this for > now). Or freeze_super() (which currently looks buggy), perhaps something > else. This pattern > Just so I'm clear to what you are trying to implement... This is to handle the case (as Paul said) to see changes to state by RCU and back again? That is, it isn't enough to see that the state changed to something (like SLOW MODE), but we also need a way to see it change back? With get_online_cpus(), we need to see the state where it changed to "performing hotplug" where holders need to go into the slow path, and then also see the state change to "no longe performing hotplug" and the holders now go back to fast path. Is this the rational for this email? Thanks, -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote: > Hello. > > Paul, Peter, et al, could you review the code below? > > I am not sending the patch, I think it is simpler to read the code > inline (just in case, I didn't try to compile it yet). > > It is functionally equivalent to > > struct xxx_struct { > atomic_t counter; > }; > > static inline bool xxx_is_idle(struct xxx_struct *xxx) > { > return atomic_read(>counter) == 0; > } > > static inline void xxx_enter(struct xxx_struct *xxx) > { > atomic_inc(>counter); > synchronize_sched(); > } > > static inline void xxx_enter(struct xxx_struct *xxx) > { > synchronize_sched(); > atomic_dec(>counter); > } But there is nothing for synchronize_sched() to wait for in the above. Presumably the caller of xxx_is_idle() is required to disable preemption or be under rcu_read_lock_sched()? > except: it records the state and synchronize_sched() is only called by > xxx_enter() and only if necessary. > > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(), > (Peter, I think they should be unified anyway, but lets ignore this for > now). Or freeze_super() (which currently looks buggy), perhaps something > else. This pattern > > writer: > state = SLOW_MODE; > synchronize_rcu/sched(); > > reader: > preempt_disable(); // or rcu_read_lock(); > if (state != SLOW_MODE) > ... > > is quite common. And this does guarantee that by the time the writer's synchronize_whatever() exits, all readers will know that state==SLOW_MODE. > Note: > - This implementation allows multiple writers, and sometimes > this makes sense. If each writer atomically incremented SLOW_MODE, did its update, then atomically decremented it, sure. You could be more clever and avoid unneeded synchronize_whatever() calls, but I would have to see a good reason for doing so before recommending this. OK, but you appear to be doing this below anyway. ;-) > - But it's trivial to add "bool xxx->exclusive" set by xxx_init(). > If it is true only one xxx_enter() is possible, other callers > should block until xxx_exit(). This is what percpu_down_write() > actually needs. Agreed. > - Probably it makes sense to add xxx->rcu_domain = RCU/SCHED/ETC. Or just have pointers to the RCU functions in the xxx structure... So you are trying to make something that abstracts the RCU-protected state-change pattern? Or perhaps more accurately, the RCU-protected state-change-and-back pattern? > Do you think it is correct? Makes sense? (BUG_ON's are just comments). ... Maybe ... Please see below for commentary and a question. Thanx, Paul > Oleg. > > // .h --- > > struct xxx_struct { > int gp_state; > > int gp_count; > wait_queue_head_t gp_waitq; > > int cb_state; > struct rcu_head cb_head; spinlock_t xxx_lock; /* ? */ This spinlock might not make the big-system guys happy, but it appears to be needed below. > }; > > static inline bool xxx_is_idle(struct xxx_struct *xxx) > { > return !xxx->gp_state; /* GP_IDLE */ > } > > extern void xxx_enter(struct xxx_struct *xxx); > extern void xxx_exit(struct xxx_struct *xxx); > > // .c --- > > enum { GP_IDLE = 0, GP_PENDING, GP_PASSED }; > > enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY }; > > #define xxx_lock gp_waitq.lock > > void xxx_enter(struct xxx_struct *xxx) > { > bool need_wait, need_sync; > > spin_lock_irq(>xxx_lock); > need_wait = xxx->gp_count++; > need_sync = xxx->gp_state == GP_IDLE; Suppose ->gp_state is GP_PASSED. It could transition to GP_IDLE at any time, right? > if (need_sync) > xxx->gp_state = GP_PENDING; > spin_unlock_irq(>xxx_lock); > > BUG_ON(need_wait && need_sync); > > } if (need_sync) { > synchronize_sched(); > xxx->gp_state = GP_PASSED; > wake_up_all(>gp_waitq); > } else if (need_wait) { > wait_event(>gp_waitq, xxx->gp_state == GP_PASSED); Suppose the wakeup is delayed until after the state has been updated back to GP_IDLE? Ah, presumably the non-zero ->gp_count prevents this. Never mind! > } else { > BUG_ON(xxx->gp_state != GP_PASSED); > } > } > > static void cb_rcu_func(struct rcu_head *rcu) > { > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head); > long flags; > > BUG_ON(xxx->gp_state !=
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote: Hello. Paul, Peter, et al, could you review the code below? I am not sending the patch, I think it is simpler to read the code inline (just in case, I didn't try to compile it yet). It is functionally equivalent to struct xxx_struct { atomic_t counter; }; static inline bool xxx_is_idle(struct xxx_struct *xxx) { return atomic_read(xxx-counter) == 0; } static inline void xxx_enter(struct xxx_struct *xxx) { atomic_inc(xxx-counter); synchronize_sched(); } static inline void xxx_enter(struct xxx_struct *xxx) { synchronize_sched(); atomic_dec(xxx-counter); } But there is nothing for synchronize_sched() to wait for in the above. Presumably the caller of xxx_is_idle() is required to disable preemption or be under rcu_read_lock_sched()? except: it records the state and synchronize_sched() is only called by xxx_enter() and only if necessary. Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(), (Peter, I think they should be unified anyway, but lets ignore this for now). Or freeze_super() (which currently looks buggy), perhaps something else. This pattern writer: state = SLOW_MODE; synchronize_rcu/sched(); reader: preempt_disable(); // or rcu_read_lock(); if (state != SLOW_MODE) ... is quite common. And this does guarantee that by the time the writer's synchronize_whatever() exits, all readers will know that state==SLOW_MODE. Note: - This implementation allows multiple writers, and sometimes this makes sense. If each writer atomically incremented SLOW_MODE, did its update, then atomically decremented it, sure. You could be more clever and avoid unneeded synchronize_whatever() calls, but I would have to see a good reason for doing so before recommending this. OK, but you appear to be doing this below anyway. ;-) - But it's trivial to add bool xxx-exclusive set by xxx_init(). If it is true only one xxx_enter() is possible, other callers should block until xxx_exit(). This is what percpu_down_write() actually needs. Agreed. - Probably it makes sense to add xxx-rcu_domain = RCU/SCHED/ETC. Or just have pointers to the RCU functions in the xxx structure... So you are trying to make something that abstracts the RCU-protected state-change pattern? Or perhaps more accurately, the RCU-protected state-change-and-back pattern? Do you think it is correct? Makes sense? (BUG_ON's are just comments). ... Maybe ... Please see below for commentary and a question. Thanx, Paul Oleg. // .h --- struct xxx_struct { int gp_state; int gp_count; wait_queue_head_t gp_waitq; int cb_state; struct rcu_head cb_head; spinlock_t xxx_lock; /* ? */ This spinlock might not make the big-system guys happy, but it appears to be needed below. }; static inline bool xxx_is_idle(struct xxx_struct *xxx) { return !xxx-gp_state; /* GP_IDLE */ } extern void xxx_enter(struct xxx_struct *xxx); extern void xxx_exit(struct xxx_struct *xxx); // .c --- enum { GP_IDLE = 0, GP_PENDING, GP_PASSED }; enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY }; #define xxx_lock gp_waitq.lock void xxx_enter(struct xxx_struct *xxx) { bool need_wait, need_sync; spin_lock_irq(xxx-xxx_lock); need_wait = xxx-gp_count++; need_sync = xxx-gp_state == GP_IDLE; Suppose -gp_state is GP_PASSED. It could transition to GP_IDLE at any time, right? if (need_sync) xxx-gp_state = GP_PENDING; spin_unlock_irq(xxx-xxx_lock); BUG_ON(need_wait need_sync); } if (need_sync) { synchronize_sched(); xxx-gp_state = GP_PASSED; wake_up_all(xxx-gp_waitq); } else if (need_wait) { wait_event(xxx-gp_waitq, xxx-gp_state == GP_PASSED); Suppose the wakeup is delayed until after the state has been updated back to GP_IDLE? Ah, presumably the non-zero -gp_count prevents this. Never mind! } else { BUG_ON(xxx-gp_state != GP_PASSED); } } static void cb_rcu_func(struct rcu_head *rcu) { struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head); long flags; BUG_ON(xxx-gp_state != GP_PASSED); BUG_ON(xxx-cb_state == CB_IDLE); spin_lock_irqsave(xxx-xxx_lock, flags); if
Re: [RFC] introduce synchronize_sched_{enter,exit}()
On Sun, 29 Sep 2013 20:36:34 +0200 Oleg Nesterov o...@redhat.com wrote: Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(), (Peter, I think they should be unified anyway, but lets ignore this for now). Or freeze_super() (which currently looks buggy), perhaps something else. This pattern Just so I'm clear to what you are trying to implement... This is to handle the case (as Paul said) to see changes to state by RCU and back again? That is, it isn't enough to see that the state changed to something (like SLOW MODE), but we also need a way to see it change back? With get_online_cpus(), we need to see the state where it changed to performing hotplug where holders need to go into the slow path, and then also see the state change to no longe performing hotplug and the holders now go back to fast path. Is this the rational for this email? Thanks, -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/