Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-02 Thread Oleg Nesterov
On 09/30, Paul E. McKenney wrote:
>
> On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
> > On Sun, 30 Sep 2007, Oleg Nesterov wrote:
> > 
> > > Ah, but I asked the different question. We must see CPU 1's stores by
> > > definition, but what about CPU 0's stores (which could be seen by CPU 1)?
> > > 
> > > Let's take a "real life" example,
> > > 
> > > A = B = X = 0;
> > > P = Q = 
> > > 
> > > CPU_0   CPU_1   CPU_2
> > > 
> > > P =  *P = 1; if (X) {
> > > wmb();  rmb();
> > > X = 1;  BUG_ON(*P != 1 && *Q != 1);
> > > }
> > > 
> > > So, is it possible that CPU_1 sees P == , but CPU_2 sees P ==  ?
> > 
> > That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
> > from a cache inv. POV) to *P=1, that must have happened after P= (in 
> > order for *P to assign B). So P= happened, from a pure time POV, before 
> > the rmb(), and the rmb() should guarantee that CPU_2 sees P= too.
> 
> Actually, CPU designers have to go quite a ways out of their way to
> prevent this BUG_ON from happening.  One way that it would happen
> naturally would be if the cache line containing P were owned by CPU 2,
> and if CPUs 0 and 1 shared a store buffer that they both snooped.  So,
> here is what could happen given careless or sadistic CPU designers:
> 
> o CPU 0 stores  to P, but misses the cache, so puts the
>   result in the store buffer.  This means that only CPUs 0 and 1
>   can see it.
> 
> o CPU 1 fetches P, and sees , so stores a 1 to B.  Again,
>   this value for P is visible only to CPUs 0 and 1.
> 
> o CPU 1 executes a wmb(), which forces CPU 1's stores to happen
>   in order.  But it does nothing about CPU 0's stores, nor about CPU
>   1's loads, for that matter (and the only reason that POWER ends
>   up working the way you would like is because wmb() turns into
>   "sync" rather than the "eieio" instruction that would have been
>   used for smp_wmb() -- which is maybe what Oleg was thinking of,
>   but happened to abbreviate.  If my analysis is buggy, Anton and
>   Paulus will no doubt correct me...)
> 
> o CPU 1 stores to X.
> 
> o CPU 2 loads X, and sees that the value is 1.
> 
> o CPU 2 does an rmb(), which orders its loads, but does nothing
>   about anyone else's loads or stores.
> 
> o CPU 2 fetches P from its cached copy, which still points to A,
>   which is still zero.  So the BUG_ON fires.
> 
> o Some time later, CPU 0 gets the cache line containing P from
>   CPU 2, and updates it from the value in the store buffer, but
>   too late...
> 
> Unfortunately, cache-coherence protocols don't care much about pure
> time...  It is possible to make a 16-CPU machine believe that a single
> variable has more than ten different values -at- -the- -same- -time-.

Davide, Paul, thank you very much! I've been wondering about this for the
long time, now I know the answer. Great.

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-02 Thread Oleg Nesterov
On 09/30, Paul E. McKenney wrote:

 On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
  On Sun, 30 Sep 2007, Oleg Nesterov wrote:
  
   Ah, but I asked the different question. We must see CPU 1's stores by
   definition, but what about CPU 0's stores (which could be seen by CPU 1)?
   
   Let's take a real life example,
   
   A = B = X = 0;
   P = Q = A;
   
   CPU_0   CPU_1   CPU_2
   
   P = B; *P = 1; if (X) {
   wmb();  rmb();
   X = 1;  BUG_ON(*P != 1  *Q != 1);
   }
   
   So, is it possible that CPU_1 sees P == B, but CPU_2 sees P == A ?
  
  That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
  from a cache inv. POV) to *P=1, that must have happened after P=B (in 
  order for *P to assign B). So P=B happened, from a pure time POV, before 
  the rmb(), and the rmb() should guarantee that CPU_2 sees P=B too.
 
 Actually, CPU designers have to go quite a ways out of their way to
 prevent this BUG_ON from happening.  One way that it would happen
 naturally would be if the cache line containing P were owned by CPU 2,
 and if CPUs 0 and 1 shared a store buffer that they both snooped.  So,
 here is what could happen given careless or sadistic CPU designers:
 
 o CPU 0 stores B to P, but misses the cache, so puts the
   result in the store buffer.  This means that only CPUs 0 and 1
   can see it.
 
 o CPU 1 fetches P, and sees B, so stores a 1 to B.  Again,
   this value for P is visible only to CPUs 0 and 1.
 
 o CPU 1 executes a wmb(), which forces CPU 1's stores to happen
   in order.  But it does nothing about CPU 0's stores, nor about CPU
   1's loads, for that matter (and the only reason that POWER ends
   up working the way you would like is because wmb() turns into
   sync rather than the eieio instruction that would have been
   used for smp_wmb() -- which is maybe what Oleg was thinking of,
   but happened to abbreviate.  If my analysis is buggy, Anton and
   Paulus will no doubt correct me...)
 
 o CPU 1 stores to X.
 
 o CPU 2 loads X, and sees that the value is 1.
 
 o CPU 2 does an rmb(), which orders its loads, but does nothing
   about anyone else's loads or stores.
 
 o CPU 2 fetches P from its cached copy, which still points to A,
   which is still zero.  So the BUG_ON fires.
 
 o Some time later, CPU 0 gets the cache line containing P from
   CPU 2, and updates it from the value in the store buffer, but
   too late...
 
 Unfortunately, cache-coherence protocols don't care much about pure
 time...  It is possible to make a 16-CPU machine believe that a single
 variable has more than ten different values -at- -the- -same- -time-.

Davide, Paul, thank you very much! I've been wondering about this for the
long time, now I know the answer. Great.

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Paul E. McKenney
On Mon, Oct 01, 2007 at 03:09:16PM -0700, Davide Libenzi wrote:
> On Mon, 1 Oct 2007, Paul E. McKenney wrote:
> 
> > That would indeed be one approach that CPU designers could take to
> > avoid being careless or sadistic.  ;-)
> 
> That'd be the easier (unique maybe) approach too for them, from an silicon 
> complexity POV. Distinguishing between different CPUs stores once inside a 
> shared store buffer, would require tagging them in some way. That'd defeat 
> most of the pros of having a shared store buffer ;)

Tagging requires but one bit per entry.  Depends on the workload -- if
lots of barriers, bursty stores and little sharing, tagging might win.
If lots of sharing, then your suggested approach might win.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Davide Libenzi
On Mon, 1 Oct 2007, Paul E. McKenney wrote:

> That would indeed be one approach that CPU designers could take to
> avoid being careless or sadistic.  ;-)

That'd be the easier (unique maybe) approach too for them, from an silicon 
complexity POV. Distinguishing between different CPUs stores once inside a 
shared store buffer, would require tagging them in some way. That'd defeat 
most of the pros of having a shared store buffer ;)



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Paul E. McKenney
On Mon, Oct 01, 2007 at 11:44:25AM -0700, Davide Libenzi wrote:
> On Sun, 30 Sep 2007, Paul E. McKenney wrote:
> 
> > On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
> > > On Sun, 30 Sep 2007, Oleg Nesterov wrote:
> > > 
> > > > Ah, but I asked the different question. We must see CPU 1's stores by
> > > > definition, but what about CPU 0's stores (which could be seen by CPU 
> > > > 1)?
> > > > 
> > > > Let's take a "real life" example,
> > > > 
> > > > A = B = X = 0;
> > > > P = Q = 
> > > > 
> > > > CPU_0   CPU_1   CPU_2
> > > > 
> > > > P =  *P = 1; if (X) {
> > > > wmb();  rmb();
> > > > X = 1;  BUG_ON(*P != 1 && *Q != 1);
> > > > }
> > > > 
> > > > So, is it possible that CPU_1 sees P == , but CPU_2 sees P ==  ?
> > > 
> > > That can't be. CPU_2 sees X=1, that happened after (or same time at most 
> > > - 
> > > from a cache inv. POV) to *P=1, that must have happened after P= (in 
> > > order for *P to assign B). So P= happened, from a pure time POV, before 
> > > the rmb(), and the rmb() should guarantee that CPU_2 sees P= too.
> > 
> > Actually, CPU designers have to go quite a ways out of their way to
> > prevent this BUG_ON from happening.  One way that it would happen
> > naturally would be if the cache line containing P were owned by CPU 2,
> > and if CPUs 0 and 1 shared a store buffer that they both snooped.  So,
> > here is what could happen given careless or sadistic CPU designers:
> 
> Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted
> that it was a cross-CPU sync point (ala read_barrier_depends). If that's a
> local CPU load ordering only, things are different, clearly. But ...
> 
> > o   CPU 0 stores  to P, but misses the cache, so puts the
> > result in the store buffer.  This means that only CPUs 0 and 1
> > can see it.
> > 
> > o   CPU 1 fetches P, and sees , so stores a 1 to B.  Again,
> > this value for P is visible only to CPUs 0 and 1.
> > 
> > o   CPU 1 executes a wmb(), which forces CPU 1's stores to happen
> > in order.  But it does nothing about CPU 0's stores, nor about CPU
> > 1's loads, for that matter (and the only reason that POWER ends
> > up working the way you would like is because wmb() turns into
> > "sync" rather than the "eieio" instruction that would have been
> > used for smp_wmb() -- which is maybe what Oleg was thinking of,
> > but happened to abbreviate.  If my analysis is buggy, Anton and
> > Paulus will no doubt correct me...)
> 
> If a store buffer is shared between CPU_0 and CPU_1, it is very likely 
> that a sync done on CPU_1 is going to sync even CPU_0 stores that are 
> held in the buffer at the time of CPU_1's sync.

That would indeed be one approach that CPU designers could take to
avoid being careless or sadistic.  ;-)

Another approach would be to make CPU 1 refrain from snooping CPU 0's
entries in the shared store queue.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Davide Libenzi
On Sun, 30 Sep 2007, Paul E. McKenney wrote:

> On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
> > On Sun, 30 Sep 2007, Oleg Nesterov wrote:
> > 
> > > Ah, but I asked the different question. We must see CPU 1's stores by
> > > definition, but what about CPU 0's stores (which could be seen by CPU 1)?
> > > 
> > > Let's take a "real life" example,
> > > 
> > > A = B = X = 0;
> > > P = Q = 
> > > 
> > > CPU_0   CPU_1   CPU_2
> > > 
> > > P =  *P = 1; if (X) {
> > > wmb();  rmb();
> > > X = 1;  BUG_ON(*P != 1 && *Q != 1);
> > > }
> > > 
> > > So, is it possible that CPU_1 sees P == , but CPU_2 sees P ==  ?
> > 
> > That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
> > from a cache inv. POV) to *P=1, that must have happened after P= (in 
> > order for *P to assign B). So P= happened, from a pure time POV, before 
> > the rmb(), and the rmb() should guarantee that CPU_2 sees P= too.
> 
> Actually, CPU designers have to go quite a ways out of their way to
> prevent this BUG_ON from happening.  One way that it would happen
> naturally would be if the cache line containing P were owned by CPU 2,
> and if CPUs 0 and 1 shared a store buffer that they both snooped.  So,
> here is what could happen given careless or sadistic CPU designers:

Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted
that it was a cross-CPU sync point (ala read_barrier_depends). If that's a
local CPU load ordering only, things are different, clearly. But ...



> 
> o CPU 0 stores  to P, but misses the cache, so puts the
>   result in the store buffer.  This means that only CPUs 0 and 1
>   can see it.
> 
> o CPU 1 fetches P, and sees , so stores a 1 to B.  Again,
>   this value for P is visible only to CPUs 0 and 1.
> 
> o CPU 1 executes a wmb(), which forces CPU 1's stores to happen
>   in order.  But it does nothing about CPU 0's stores, nor about CPU
>   1's loads, for that matter (and the only reason that POWER ends
>   up working the way you would like is because wmb() turns into
>   "sync" rather than the "eieio" instruction that would have been
>   used for smp_wmb() -- which is maybe what Oleg was thinking of,
>   but happened to abbreviate.  If my analysis is buggy, Anton and
>   Paulus will no doubt correct me...)

If a store buffer is shared between CPU_0 and CPU_1, it is very likely 
that a sync done on CPU_1 is going to sync even CPU_0 stores that are 
held in the buffer at the time of CPU_1's sync.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Paul E. McKenney
On Sun, Sep 30, 2007 at 08:31:02PM +0400, Oleg Nesterov wrote:
> On 09/28, Paul E. McKenney wrote:
> >
> > On Fri, Sep 28, 2007 at 06:47:14PM +0400, Oleg Nesterov wrote:
> > > Ah, I was confused by the comment,
> > > 
> > >   smp_mb();  /* Don't call for memory barriers before we see zero. */
> > >^^
> > > So, in fact, we need this barrier to make sure that _other_ CPUs see these
> > > changes in order, thanks. Of course, _we_ already saw zero.
> > 
> > Fair point!
> > 
> > Perhaps: "Ensure that all CPUs see their rcu_mb_flag -after- the
> > rcu_flipctrs sum to zero" or some such?
> > 
> > > But in that particular case this doesn't matter, rcu_try_flip_waitzero()
> > > is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
> > > it doesn't really need the barrier? (besides, it is always called under
> > > fliplock).
> > 
> > The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,
> > so we cannot necessarily rely on locking to trivialize all of this.
> 
> Yes, but still I think this mb() is not necessary. Becasue we don't need
> the "if we saw rcu_mb_flag we must see sum(lastidx)==0" property. When another
> CPU calls rcu_try_flip_waitzero(), it will use another lastidx. OK, minor 
> issue,
> please forget.

Will do!  ;-)

> > > OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
> > > store buffer, the situation is simple, we can replace "CPU 0 stores" with
> > > "CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?
> > > 
> > > Suppose that CPU 1 does
> > > 
> > >   wmb();
> > >   B = 0
> > > 
> > > Can we assume that CPU 2 doing
> > > 
> > >   if (B == 0) {
> > >   rmb();
> > > 
> > > must see all invalidations from CPU 0 which were seen by CPU 1 before 
> > > wmb() ?
> > 
> > Yes.  CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
> > reads following its rmb() must therefore see all of CPU 1's stores
> > preceding the wmb().
> 
> Ah, but I asked the different question. We must see CPU 1's stores by
> definition, but what about CPU 0's stores (which could be seen by CPU 1)?
> 
> Let's take a "real life" example,
> 
> A = B = X = 0;
> P = Q = 
> 
> CPU_0   CPU_1   CPU_2
> 
> P =  *P = 1; if (X) {
> wmb();  rmb();
> X = 1;  BUG_ON(*P != 1 && *Q != 1);
> }
> 
> So, is it possible that CPU_1 sees P == , but CPU_2 sees P ==  ?

It depends.  ;-)

o   Itanium: because both wmb() and rmb() map to the "mf"
instruction, and because "mf" instructions map to a
single global order, the BUG_ON cannot happen.  (But
I could easily be mistaken -- I cannot call myself an
Itanium memory-ordering expert.)  See:

ftp://download.intel.com/design/Itanium/Downloads/25142901.pdf

for the official story.

o   POWER: because wmb() maps to the "sync" instruction,
cumulativity applies, so that any instruction provably
following "X = 1" will see "P = " if the "*P = 1"
statement saw it.  So the BUG_ON cannot happen.

o   i386: memory ordering respects transitive visibility,
which seems to be similar to POWER's cumulativity
(http://developer.intel.com/products/processor/manuals/318147.pdf),
so the BUG_ON cannot happen.

o   x86_64: same as i386.

o   s390: the basic memory-ordering model is tight enough that the
BUG_ON cannot happen.  (If I am confused about this, the s390
guys will not be shy about correcting me!)

o   ARM: beats the heck out of me.

> > The other approach would be to simply have a separate thread for this
> > purpose.  Batching would amortize the overhead (a single trip around the
> > CPUs could satisfy an arbitrarily large number of synchronize_sched()
> > requests).
> 
> Yes, this way we don't need to uglify migration_thread(). OTOH, we need
> another kthread ;)

True enough!!!

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Paul E. McKenney
On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
> On Sun, 30 Sep 2007, Oleg Nesterov wrote:
> 
> > Ah, but I asked the different question. We must see CPU 1's stores by
> > definition, but what about CPU 0's stores (which could be seen by CPU 1)?
> > 
> > Let's take a "real life" example,
> > 
> > A = B = X = 0;
> > P = Q = 
> > 
> > CPU_0   CPU_1   CPU_2
> > 
> > P =  *P = 1; if (X) {
> > wmb();  rmb();
> > X = 1;  BUG_ON(*P != 1 && *Q != 1);
> > }
> > 
> > So, is it possible that CPU_1 sees P == , but CPU_2 sees P ==  ?
> 
> That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
> from a cache inv. POV) to *P=1, that must have happened after P= (in 
> order for *P to assign B). So P= happened, from a pure time POV, before 
> the rmb(), and the rmb() should guarantee that CPU_2 sees P= too.

Actually, CPU designers have to go quite a ways out of their way to
prevent this BUG_ON from happening.  One way that it would happen
naturally would be if the cache line containing P were owned by CPU 2,
and if CPUs 0 and 1 shared a store buffer that they both snooped.  So,
here is what could happen given careless or sadistic CPU designers:

o   CPU 0 stores  to P, but misses the cache, so puts the
result in the store buffer.  This means that only CPUs 0 and 1
can see it.

o   CPU 1 fetches P, and sees , so stores a 1 to B.  Again,
this value for P is visible only to CPUs 0 and 1.

o   CPU 1 executes a wmb(), which forces CPU 1's stores to happen
in order.  But it does nothing about CPU 0's stores, nor about CPU
1's loads, for that matter (and the only reason that POWER ends
up working the way you would like is because wmb() turns into
"sync" rather than the "eieio" instruction that would have been
used for smp_wmb() -- which is maybe what Oleg was thinking of,
but happened to abbreviate.  If my analysis is buggy, Anton and
Paulus will no doubt correct me...)

o   CPU 1 stores to X.

o   CPU 2 loads X, and sees that the value is 1.

o   CPU 2 does an rmb(), which orders its loads, but does nothing
about anyone else's loads or stores.

o   CPU 2 fetches P from its cached copy, which still points to A,
which is still zero.  So the BUG_ON fires.

o   Some time later, CPU 0 gets the cache line containing P from
CPU 2, and updates it from the value in the store buffer, but
too late...

Unfortunately, cache-coherence protocols don't care much about pure
time...  It is possible to make a 16-CPU machine believe that a single
variable has more than ten different values -at- -the- -same- -time-.
This is easy to do -- have all the CPUs store different values to the
same variable at the same time, then reload, collecting timestamps
between each pair of operations.  On a large SMP, the values will sit
in the store buffers for many hundreds of nanoseconds, perhaps even
several microseconds, while the cache line containing the variable
being stored to shuttles around among the CPUs.  ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Paul E. McKenney
On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
 On Sun, 30 Sep 2007, Oleg Nesterov wrote:
 
  Ah, but I asked the different question. We must see CPU 1's stores by
  definition, but what about CPU 0's stores (which could be seen by CPU 1)?
  
  Let's take a real life example,
  
  A = B = X = 0;
  P = Q = A;
  
  CPU_0   CPU_1   CPU_2
  
  P = B; *P = 1; if (X) {
  wmb();  rmb();
  X = 1;  BUG_ON(*P != 1  *Q != 1);
  }
  
  So, is it possible that CPU_1 sees P == B, but CPU_2 sees P == A ?
 
 That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
 from a cache inv. POV) to *P=1, that must have happened after P=B (in 
 order for *P to assign B). So P=B happened, from a pure time POV, before 
 the rmb(), and the rmb() should guarantee that CPU_2 sees P=B too.

Actually, CPU designers have to go quite a ways out of their way to
prevent this BUG_ON from happening.  One way that it would happen
naturally would be if the cache line containing P were owned by CPU 2,
and if CPUs 0 and 1 shared a store buffer that they both snooped.  So,
here is what could happen given careless or sadistic CPU designers:

o   CPU 0 stores B to P, but misses the cache, so puts the
result in the store buffer.  This means that only CPUs 0 and 1
can see it.

o   CPU 1 fetches P, and sees B, so stores a 1 to B.  Again,
this value for P is visible only to CPUs 0 and 1.

o   CPU 1 executes a wmb(), which forces CPU 1's stores to happen
in order.  But it does nothing about CPU 0's stores, nor about CPU
1's loads, for that matter (and the only reason that POWER ends
up working the way you would like is because wmb() turns into
sync rather than the eieio instruction that would have been
used for smp_wmb() -- which is maybe what Oleg was thinking of,
but happened to abbreviate.  If my analysis is buggy, Anton and
Paulus will no doubt correct me...)

o   CPU 1 stores to X.

o   CPU 2 loads X, and sees that the value is 1.

o   CPU 2 does an rmb(), which orders its loads, but does nothing
about anyone else's loads or stores.

o   CPU 2 fetches P from its cached copy, which still points to A,
which is still zero.  So the BUG_ON fires.

o   Some time later, CPU 0 gets the cache line containing P from
CPU 2, and updates it from the value in the store buffer, but
too late...

Unfortunately, cache-coherence protocols don't care much about pure
time...  It is possible to make a 16-CPU machine believe that a single
variable has more than ten different values -at- -the- -same- -time-.
This is easy to do -- have all the CPUs store different values to the
same variable at the same time, then reload, collecting timestamps
between each pair of operations.  On a large SMP, the values will sit
in the store buffers for many hundreds of nanoseconds, perhaps even
several microseconds, while the cache line containing the variable
being stored to shuttles around among the CPUs.  ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Paul E. McKenney
On Sun, Sep 30, 2007 at 08:31:02PM +0400, Oleg Nesterov wrote:
 On 09/28, Paul E. McKenney wrote:
 
  On Fri, Sep 28, 2007 at 06:47:14PM +0400, Oleg Nesterov wrote:
   Ah, I was confused by the comment,
   
 smp_mb();  /* Don't call for memory barriers before we see zero. */
  ^^
   So, in fact, we need this barrier to make sure that _other_ CPUs see these
   changes in order, thanks. Of course, _we_ already saw zero.
  
  Fair point!
  
  Perhaps: Ensure that all CPUs see their rcu_mb_flag -after- the
  rcu_flipctrs sum to zero or some such?
  
   But in that particular case this doesn't matter, rcu_try_flip_waitzero()
   is the only function which reads the non-local per_cpu(rcu_flipctr), so
   it doesn't really need the barrier? (besides, it is always called under
   fliplock).
  
  The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,
  so we cannot necessarily rely on locking to trivialize all of this.
 
 Yes, but still I think this mb() is not necessary. Becasue we don't need
 the if we saw rcu_mb_flag we must see sum(lastidx)==0 property. When another
 CPU calls rcu_try_flip_waitzero(), it will use another lastidx. OK, minor 
 issue,
 please forget.

Will do!  ;-)

   OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
   store buffer, the situation is simple, we can replace CPU 0 stores with
   CPU 1 stores. But what if CPU 0 is equally far from CPUs 1 and 2?
   
   Suppose that CPU 1 does
   
 wmb();
 B = 0
   
   Can we assume that CPU 2 doing
   
 if (B == 0) {
 rmb();
   
   must see all invalidations from CPU 0 which were seen by CPU 1 before 
   wmb() ?
  
  Yes.  CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
  reads following its rmb() must therefore see all of CPU 1's stores
  preceding the wmb().
 
 Ah, but I asked the different question. We must see CPU 1's stores by
 definition, but what about CPU 0's stores (which could be seen by CPU 1)?
 
 Let's take a real life example,
 
 A = B = X = 0;
 P = Q = A;
 
 CPU_0   CPU_1   CPU_2
 
 P = B; *P = 1; if (X) {
 wmb();  rmb();
 X = 1;  BUG_ON(*P != 1  *Q != 1);
 }
 
 So, is it possible that CPU_1 sees P == B, but CPU_2 sees P == A ?

It depends.  ;-)

o   Itanium: because both wmb() and rmb() map to the mf
instruction, and because mf instructions map to a
single global order, the BUG_ON cannot happen.  (But
I could easily be mistaken -- I cannot call myself an
Itanium memory-ordering expert.)  See:

ftp://download.intel.com/design/Itanium/Downloads/25142901.pdf

for the official story.

o   POWER: because wmb() maps to the sync instruction,
cumulativity applies, so that any instruction provably
following X = 1 will see P = B if the *P = 1
statement saw it.  So the BUG_ON cannot happen.

o   i386: memory ordering respects transitive visibility,
which seems to be similar to POWER's cumulativity
(http://developer.intel.com/products/processor/manuals/318147.pdf),
so the BUG_ON cannot happen.

o   x86_64: same as i386.

o   s390: the basic memory-ordering model is tight enough that the
BUG_ON cannot happen.  (If I am confused about this, the s390
guys will not be shy about correcting me!)

o   ARM: beats the heck out of me.

  The other approach would be to simply have a separate thread for this
  purpose.  Batching would amortize the overhead (a single trip around the
  CPUs could satisfy an arbitrarily large number of synchronize_sched()
  requests).
 
 Yes, this way we don't need to uglify migration_thread(). OTOH, we need
 another kthread ;)

True enough!!!

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Davide Libenzi
On Sun, 30 Sep 2007, Paul E. McKenney wrote:

 On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
  On Sun, 30 Sep 2007, Oleg Nesterov wrote:
  
   Ah, but I asked the different question. We must see CPU 1's stores by
   definition, but what about CPU 0's stores (which could be seen by CPU 1)?
   
   Let's take a real life example,
   
   A = B = X = 0;
   P = Q = A;
   
   CPU_0   CPU_1   CPU_2
   
   P = B; *P = 1; if (X) {
   wmb();  rmb();
   X = 1;  BUG_ON(*P != 1  *Q != 1);
   }
   
   So, is it possible that CPU_1 sees P == B, but CPU_2 sees P == A ?
  
  That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
  from a cache inv. POV) to *P=1, that must have happened after P=B (in 
  order for *P to assign B). So P=B happened, from a pure time POV, before 
  the rmb(), and the rmb() should guarantee that CPU_2 sees P=B too.
 
 Actually, CPU designers have to go quite a ways out of their way to
 prevent this BUG_ON from happening.  One way that it would happen
 naturally would be if the cache line containing P were owned by CPU 2,
 and if CPUs 0 and 1 shared a store buffer that they both snooped.  So,
 here is what could happen given careless or sadistic CPU designers:

Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted
that it was a cross-CPU sync point (ala read_barrier_depends). If that's a
local CPU load ordering only, things are different, clearly. But ...



 
 o CPU 0 stores B to P, but misses the cache, so puts the
   result in the store buffer.  This means that only CPUs 0 and 1
   can see it.
 
 o CPU 1 fetches P, and sees B, so stores a 1 to B.  Again,
   this value for P is visible only to CPUs 0 and 1.
 
 o CPU 1 executes a wmb(), which forces CPU 1's stores to happen
   in order.  But it does nothing about CPU 0's stores, nor about CPU
   1's loads, for that matter (and the only reason that POWER ends
   up working the way you would like is because wmb() turns into
   sync rather than the eieio instruction that would have been
   used for smp_wmb() -- which is maybe what Oleg was thinking of,
   but happened to abbreviate.  If my analysis is buggy, Anton and
   Paulus will no doubt correct me...)

If a store buffer is shared between CPU_0 and CPU_1, it is very likely 
that a sync done on CPU_1 is going to sync even CPU_0 stores that are 
held in the buffer at the time of CPU_1's sync.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Paul E. McKenney
On Mon, Oct 01, 2007 at 11:44:25AM -0700, Davide Libenzi wrote:
 On Sun, 30 Sep 2007, Paul E. McKenney wrote:
 
  On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
   On Sun, 30 Sep 2007, Oleg Nesterov wrote:
   
Ah, but I asked the different question. We must see CPU 1's stores by
definition, but what about CPU 0's stores (which could be seen by CPU 
1)?

Let's take a real life example,

A = B = X = 0;
P = Q = A;

CPU_0   CPU_1   CPU_2

P = B; *P = 1; if (X) {
wmb();  rmb();
X = 1;  BUG_ON(*P != 1  *Q != 1);
}

So, is it possible that CPU_1 sees P == B, but CPU_2 sees P == A ?
   
   That can't be. CPU_2 sees X=1, that happened after (or same time at most 
   - 
   from a cache inv. POV) to *P=1, that must have happened after P=B (in 
   order for *P to assign B). So P=B happened, from a pure time POV, before 
   the rmb(), and the rmb() should guarantee that CPU_2 sees P=B too.
  
  Actually, CPU designers have to go quite a ways out of their way to
  prevent this BUG_ON from happening.  One way that it would happen
  naturally would be if the cache line containing P were owned by CPU 2,
  and if CPUs 0 and 1 shared a store buffer that they both snooped.  So,
  here is what could happen given careless or sadistic CPU designers:
 
 Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted
 that it was a cross-CPU sync point (ala read_barrier_depends). If that's a
 local CPU load ordering only, things are different, clearly. But ...
 
  o   CPU 0 stores B to P, but misses the cache, so puts the
  result in the store buffer.  This means that only CPUs 0 and 1
  can see it.
  
  o   CPU 1 fetches P, and sees B, so stores a 1 to B.  Again,
  this value for P is visible only to CPUs 0 and 1.
  
  o   CPU 1 executes a wmb(), which forces CPU 1's stores to happen
  in order.  But it does nothing about CPU 0's stores, nor about CPU
  1's loads, for that matter (and the only reason that POWER ends
  up working the way you would like is because wmb() turns into
  sync rather than the eieio instruction that would have been
  used for smp_wmb() -- which is maybe what Oleg was thinking of,
  but happened to abbreviate.  If my analysis is buggy, Anton and
  Paulus will no doubt correct me...)
 
 If a store buffer is shared between CPU_0 and CPU_1, it is very likely 
 that a sync done on CPU_1 is going to sync even CPU_0 stores that are 
 held in the buffer at the time of CPU_1's sync.

That would indeed be one approach that CPU designers could take to
avoid being careless or sadistic.  ;-)

Another approach would be to make CPU 1 refrain from snooping CPU 0's
entries in the shared store queue.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Davide Libenzi
On Mon, 1 Oct 2007, Paul E. McKenney wrote:

 That would indeed be one approach that CPU designers could take to
 avoid being careless or sadistic.  ;-)

That'd be the easier (unique maybe) approach too for them, from an silicon 
complexity POV. Distinguishing between different CPUs stores once inside a 
shared store buffer, would require tagging them in some way. That'd defeat 
most of the pros of having a shared store buffer ;)



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Paul E. McKenney
On Mon, Oct 01, 2007 at 03:09:16PM -0700, Davide Libenzi wrote:
 On Mon, 1 Oct 2007, Paul E. McKenney wrote:
 
  That would indeed be one approach that CPU designers could take to
  avoid being careless or sadistic.  ;-)
 
 That'd be the easier (unique maybe) approach too for them, from an silicon 
 complexity POV. Distinguishing between different CPUs stores once inside a 
 shared store buffer, would require tagging them in some way. That'd defeat 
 most of the pros of having a shared store buffer ;)

Tagging requires but one bit per entry.  Depends on the workload -- if
lots of barriers, bursty stores and little sharing, tagging might win.
If lots of sharing, then your suggested approach might win.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-30 Thread Davide Libenzi
On Sun, 30 Sep 2007, Oleg Nesterov wrote:

> Ah, but I asked the different question. We must see CPU 1's stores by
> definition, but what about CPU 0's stores (which could be seen by CPU 1)?
> 
> Let's take a "real life" example,
> 
> A = B = X = 0;
> P = Q = 
> 
> CPU_0   CPU_1   CPU_2
> 
> P =  *P = 1; if (X) {
> wmb();  rmb();
> X = 1;  BUG_ON(*P != 1 && *Q != 1);
> }
> 
> So, is it possible that CPU_1 sees P == , but CPU_2 sees P ==  ?

That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
from a cache inv. POV) to *P=1, that must have happened after P= (in 
order for *P to assign B). So P= happened, from a pure time POV, before 
the rmb(), and the rmb() should guarantee that CPU_2 sees P= too.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-30 Thread Oleg Nesterov
On 09/28, Paul E. McKenney wrote:
>
> On Fri, Sep 28, 2007 at 06:47:14PM +0400, Oleg Nesterov wrote:
> > Ah, I was confused by the comment,
> > 
> > smp_mb();  /* Don't call for memory barriers before we see zero. */
> >  ^^
> > So, in fact, we need this barrier to make sure that _other_ CPUs see these
> > changes in order, thanks. Of course, _we_ already saw zero.
> 
> Fair point!
> 
> Perhaps: "Ensure that all CPUs see their rcu_mb_flag -after- the
> rcu_flipctrs sum to zero" or some such?
> 
> > But in that particular case this doesn't matter, rcu_try_flip_waitzero()
> > is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
> > it doesn't really need the barrier? (besides, it is always called under
> > fliplock).
> 
> The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,
> so we cannot necessarily rely on locking to trivialize all of this.

Yes, but still I think this mb() is not necessary. Becasue we don't need
the "if we saw rcu_mb_flag we must see sum(lastidx)==0" property. When another
CPU calls rcu_try_flip_waitzero(), it will use another lastidx. OK, minor issue,
please forget.

> > OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
> > store buffer, the situation is simple, we can replace "CPU 0 stores" with
> > "CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?
> > 
> > Suppose that CPU 1 does
> > 
> > wmb();
> > B = 0
> > 
> > Can we assume that CPU 2 doing
> > 
> > if (B == 0) {
> > rmb();
> > 
> > must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() 
> > ?
> 
> Yes.  CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
> reads following its rmb() must therefore see all of CPU 1's stores
> preceding the wmb().

Ah, but I asked the different question. We must see CPU 1's stores by
definition, but what about CPU 0's stores (which could be seen by CPU 1)?

Let's take a "real life" example,

A = B = X = 0;
P = Q = 

CPU_0   CPU_1   CPU_2

P =  *P = 1; if (X) {
wmb();  rmb();
X = 1;  BUG_ON(*P != 1 && *Q != 1);
}

So, is it possible that CPU_1 sees P == , but CPU_2 sees P ==  ?

> The other approach would be to simply have a separate thread for this
> purpose.  Batching would amortize the overhead (a single trip around the
> CPUs could satisfy an arbitrarily large number of synchronize_sched()
> requests).

Yes, this way we don't need to uglify migration_thread(). OTOH, we need
another kthread ;)

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-30 Thread Oleg Nesterov
On 09/28, Paul E. McKenney wrote:

 On Fri, Sep 28, 2007 at 06:47:14PM +0400, Oleg Nesterov wrote:
  Ah, I was confused by the comment,
  
  smp_mb();  /* Don't call for memory barriers before we see zero. */
   ^^
  So, in fact, we need this barrier to make sure that _other_ CPUs see these
  changes in order, thanks. Of course, _we_ already saw zero.
 
 Fair point!
 
 Perhaps: Ensure that all CPUs see their rcu_mb_flag -after- the
 rcu_flipctrs sum to zero or some such?
 
  But in that particular case this doesn't matter, rcu_try_flip_waitzero()
  is the only function which reads the non-local per_cpu(rcu_flipctr), so
  it doesn't really need the barrier? (besides, it is always called under
  fliplock).
 
 The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,
 so we cannot necessarily rely on locking to trivialize all of this.

Yes, but still I think this mb() is not necessary. Becasue we don't need
the if we saw rcu_mb_flag we must see sum(lastidx)==0 property. When another
CPU calls rcu_try_flip_waitzero(), it will use another lastidx. OK, minor issue,
please forget.

  OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
  store buffer, the situation is simple, we can replace CPU 0 stores with
  CPU 1 stores. But what if CPU 0 is equally far from CPUs 1 and 2?
  
  Suppose that CPU 1 does
  
  wmb();
  B = 0
  
  Can we assume that CPU 2 doing
  
  if (B == 0) {
  rmb();
  
  must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() 
  ?
 
 Yes.  CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
 reads following its rmb() must therefore see all of CPU 1's stores
 preceding the wmb().

Ah, but I asked the different question. We must see CPU 1's stores by
definition, but what about CPU 0's stores (which could be seen by CPU 1)?

Let's take a real life example,

A = B = X = 0;
P = Q = A;

CPU_0   CPU_1   CPU_2

P = B; *P = 1; if (X) {
wmb();  rmb();
X = 1;  BUG_ON(*P != 1  *Q != 1);
}

So, is it possible that CPU_1 sees P == B, but CPU_2 sees P == A ?

 The other approach would be to simply have a separate thread for this
 purpose.  Batching would amortize the overhead (a single trip around the
 CPUs could satisfy an arbitrarily large number of synchronize_sched()
 requests).

Yes, this way we don't need to uglify migration_thread(). OTOH, we need
another kthread ;)

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-30 Thread Davide Libenzi
On Sun, 30 Sep 2007, Oleg Nesterov wrote:

 Ah, but I asked the different question. We must see CPU 1's stores by
 definition, but what about CPU 0's stores (which could be seen by CPU 1)?
 
 Let's take a real life example,
 
 A = B = X = 0;
 P = Q = A;
 
 CPU_0   CPU_1   CPU_2
 
 P = B; *P = 1; if (X) {
 wmb();  rmb();
 X = 1;  BUG_ON(*P != 1  *Q != 1);
 }
 
 So, is it possible that CPU_1 sees P == B, but CPU_2 sees P == A ?

That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
from a cache inv. POV) to *P=1, that must have happened after P=B (in 
order for *P to assign B). So P=B happened, from a pure time POV, before 
the rmb(), and the rmb() should guarantee that CPU_2 sees P=B too.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-28 Thread Paul E. McKenney
On Fri, Sep 28, 2007 at 06:47:14PM +0400, Oleg Nesterov wrote:
> On 09/27, Paul E. McKenney wrote:
> >
> > On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
> > > 
> > > Yes, yes, I see now. We really need this barriers, except I think
> > > rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
> > > 
> > >   // rcu_try_flip_waitzero()
> > >   if (A == 0) {
> > >   mb();
> > >   B == 0;
> > >   }
> > > 
> > > Do we really need the mb() in this case? How it is possible that STORE
> > > goes before LOAD? "Obviously", the LOAD should be completed first, no?
> > 
> > Suppose that A was most recently stored by a CPU that shares a store
> > buffer with this CPU.  Then it is possible that some other CPU sees
> > the store to B as happening before the store that "A==0" above is
> > loading from.  This other CPU would naturally conclude that the store
> > to B must have happened before the load from A.
> 
> Ah, I was confused by the comment,
> 
>   smp_mb();  /* Don't call for memory barriers before we see zero. */
>^^
> So, in fact, we need this barrier to make sure that _other_ CPUs see these
> changes in order, thanks. Of course, _we_ already saw zero.

Fair point!

Perhaps: "Ensure that all CPUs see their rcu_mb_flag -after- the
rcu_flipctrs sum to zero" or some such?

> But in that particular case this doesn't matter, rcu_try_flip_waitzero()
> is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
> it doesn't really need the barrier? (besides, it is always called under
> fliplock).

The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,
so we cannot necessarily rely on locking to trivialize all of this.

> > In more detail, suppose that CPU 0 and 1 share a store buffer, and
> > that CPU 2 and 3 share a second store buffer.  This happens naturally
> > if CPUs 0 and 1 are really just different hardware threads within a
> > single core.
> > 
> > So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
> > and that the cacheline for B is initially owned by CPUs 0 and 1.
> > Then consider the following sequence of events:
> > 
> > o   CPU 0 stores zero to A.  This is a cache miss, so the new value
> > for A is placed in CPU 0's and 1's store buffer.
> > 
> > o   CPU 1 executes the above code, first loading A.  It sees
> > the value of A==0 in the store buffer, and therefore
> > stores zero to B, which hits in the cache.  (I am assuming
> > that we left out the mb() above).
> > 
> > o   CPU 2 loads from B, which misses the cache, and gets the
> > value that CPU 1 stored.  Suppose it checks the value,
> > and based on this check, loads A.  The old value of A might
> > still be in cache, which would lead CPU 2 to conclude that
> > the store to B by CPU 1 must have happened before the store
> > to A by CPU 0.
> > 
> > Memory barriers would prevent this confusion.
> 
> Thanks a lot!!! This fills another gap in my understanding.

Glad it helped!  ;-)

> OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
> store buffer, the situation is simple, we can replace "CPU 0 stores" with
> "CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?
> 
> Suppose that CPU 1 does
> 
>   wmb();
>   B = 0
> 
> Can we assume that CPU 2 doing
> 
>   if (B == 0) {
>   rmb();
> 
> must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() ?

Yes.  CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
reads following its rmb() must therefore see all of CPU 1's stores
preceding the wmb().

> > > > > If this is possible, can't we move the code doing 
> > > > > "s/rcu_flipped/rcu_flip_seen/"
> > > > > from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ?
> > > > 
> > > > I believe that we cannot safely do this.  The 
> > > > rcu_flipped-to-rcu_flip_seen
> > > > transition has to be synchronized to the moving of the callbacks --
> > > > either that or we need more GP_STAGES.
> > > 
> > > Hmm. Still can't understand.
> > 
> > Callbacks would be able to be injected into a grace period after it
> > started.
> 
> Yes, but this is _exactly_ what the current code does in the scenario below,
> 
> > Or are you arguing that as long as interrupts remain disabled between
> > the two events, no harm done?
> 
> no,
> 
> > > Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
> > > rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
> > > (say, another CPU does rcu_try_flip_idle() in between).
> > > 
> > > We ack the flip, call_rcu() enables irqs, the timer interrupt calls
> > > __rcu_advance_callbacks() again and moves the callbacks.
> > > 
> > > So, it is still possible that "move callbacks" and "ack the flip" happen
> > > out of order. But why this is bad?
> 
> Look, what happens is
> 
>   // call_rcu()
>   

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-28 Thread Oleg Nesterov
On 09/27, Paul E. McKenney wrote:
>
> On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
> > 
> > Yes, yes, I see now. We really need this barriers, except I think
> > rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
> > 
> > // rcu_try_flip_waitzero()
> > if (A == 0) {
> > mb();
> > B == 0;
> > }
> > 
> > Do we really need the mb() in this case? How it is possible that STORE
> > goes before LOAD? "Obviously", the LOAD should be completed first, no?
> 
> Suppose that A was most recently stored by a CPU that shares a store
> buffer with this CPU.  Then it is possible that some other CPU sees
> the store to B as happening before the store that "A==0" above is
> loading from.  This other CPU would naturally conclude that the store
> to B must have happened before the load from A.

Ah, I was confused by the comment,

smp_mb();  /* Don't call for memory barriers before we see zero. */
 ^^
So, in fact, we need this barrier to make sure that _other_ CPUs see these
changes in order, thanks. Of course, _we_ already saw zero.

But in that particular case this doesn't matter, rcu_try_flip_waitzero()
is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
it doesn't really need the barrier? (besides, it is always called under
fliplock).

> In more detail, suppose that CPU 0 and 1 share a store buffer, and
> that CPU 2 and 3 share a second store buffer.  This happens naturally
> if CPUs 0 and 1 are really just different hardware threads within a
> single core.
> 
> So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
> and that the cacheline for B is initially owned by CPUs 0 and 1.
> Then consider the following sequence of events:
> 
> o CPU 0 stores zero to A.  This is a cache miss, so the new value
>   for A is placed in CPU 0's and 1's store buffer.
> 
> o CPU 1 executes the above code, first loading A.  It sees
>   the value of A==0 in the store buffer, and therefore
>   stores zero to B, which hits in the cache.  (I am assuming
>   that we left out the mb() above).
> 
> o CPU 2 loads from B, which misses the cache, and gets the
>   value that CPU 1 stored.  Suppose it checks the value,
>   and based on this check, loads A.  The old value of A might
>   still be in cache, which would lead CPU 2 to conclude that
>   the store to B by CPU 1 must have happened before the store
>   to A by CPU 0.
> 
> Memory barriers would prevent this confusion.

Thanks a lot!!! This fills another gap in my understanding.

OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
store buffer, the situation is simple, we can replace "CPU 0 stores" with
"CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?

Suppose that CPU 1 does

wmb();
B = 0

Can we assume that CPU 2 doing

if (B == 0) {
rmb();

must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() ?

> > > > If this is possible, can't we move the code doing 
> > > > "s/rcu_flipped/rcu_flip_seen/"
> > > > from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ?
> > > 
> > > I believe that we cannot safely do this.  The rcu_flipped-to-rcu_flip_seen
> > > transition has to be synchronized to the moving of the callbacks --
> > > either that or we need more GP_STAGES.
> > 
> > Hmm. Still can't understand.
> 
> Callbacks would be able to be injected into a grace period after it
> started.

Yes, but this is _exactly_ what the current code does in the scenario below,

> Or are you arguing that as long as interrupts remain disabled between
> the two events, no harm done?

no,

> > Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
> > rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
> > (say, another CPU does rcu_try_flip_idle() in between).
> > 
> > We ack the flip, call_rcu() enables irqs, the timer interrupt calls
> > __rcu_advance_callbacks() again and moves the callbacks.
> > 
> > So, it is still possible that "move callbacks" and "ack the flip" happen
> > out of order. But why this is bad?

Look, what happens is

// call_rcu()
rcu_flip_flag = rcu_flipped
insert the new callback
// timer irq
move the callbacks (the new one goes to wait[0])

But I still can't understand why this is bad,

> > This can't "speedup" the moving of our callbacks from next to done lists.
> > Yes, RCU state machine can switch to the next state/stage, but this looks
> > safe to me.

Before this callback will be flushed, we need 2 rdp->completed != 
rcu_ctrlblk.completed
further events, we can't miss rcu_read_lock() section, no?

> > Help!

Please :)

> > if (rcu_ctrlblk.completed == rdp->completed)
> > rcu_try_flip();
> > 
> > Could you clarify the check above? Afaics this is just 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-28 Thread Oleg Nesterov
On 09/27, Paul E. McKenney wrote:

 On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
  
  Yes, yes, I see now. We really need this barriers, except I think
  rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
  
  // rcu_try_flip_waitzero()
  if (A == 0) {
  mb();
  B == 0;
  }
  
  Do we really need the mb() in this case? How it is possible that STORE
  goes before LOAD? Obviously, the LOAD should be completed first, no?
 
 Suppose that A was most recently stored by a CPU that shares a store
 buffer with this CPU.  Then it is possible that some other CPU sees
 the store to B as happening before the store that A==0 above is
 loading from.  This other CPU would naturally conclude that the store
 to B must have happened before the load from A.

Ah, I was confused by the comment,

smp_mb();  /* Don't call for memory barriers before we see zero. */
 ^^
So, in fact, we need this barrier to make sure that _other_ CPUs see these
changes in order, thanks. Of course, _we_ already saw zero.

But in that particular case this doesn't matter, rcu_try_flip_waitzero()
is the only function which reads the non-local per_cpu(rcu_flipctr), so
it doesn't really need the barrier? (besides, it is always called under
fliplock).

 In more detail, suppose that CPU 0 and 1 share a store buffer, and
 that CPU 2 and 3 share a second store buffer.  This happens naturally
 if CPUs 0 and 1 are really just different hardware threads within a
 single core.
 
 So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
 and that the cacheline for B is initially owned by CPUs 0 and 1.
 Then consider the following sequence of events:
 
 o CPU 0 stores zero to A.  This is a cache miss, so the new value
   for A is placed in CPU 0's and 1's store buffer.
 
 o CPU 1 executes the above code, first loading A.  It sees
   the value of A==0 in the store buffer, and therefore
   stores zero to B, which hits in the cache.  (I am assuming
   that we left out the mb() above).
 
 o CPU 2 loads from B, which misses the cache, and gets the
   value that CPU 1 stored.  Suppose it checks the value,
   and based on this check, loads A.  The old value of A might
   still be in cache, which would lead CPU 2 to conclude that
   the store to B by CPU 1 must have happened before the store
   to A by CPU 0.
 
 Memory barriers would prevent this confusion.

Thanks a lot!!! This fills another gap in my understanding.

OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
store buffer, the situation is simple, we can replace CPU 0 stores with
CPU 1 stores. But what if CPU 0 is equally far from CPUs 1 and 2?

Suppose that CPU 1 does

wmb();
B = 0

Can we assume that CPU 2 doing

if (B == 0) {
rmb();

must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() ?

If this is possible, can't we move the code doing 
s/rcu_flipped/rcu_flip_seen/
from __rcu_advance_callbacks() to rcu_check_mb() to unify the acks ?
   
   I believe that we cannot safely do this.  The rcu_flipped-to-rcu_flip_seen
   transition has to be synchronized to the moving of the callbacks --
   either that or we need more GP_STAGES.
  
  Hmm. Still can't understand.
 
 Callbacks would be able to be injected into a grace period after it
 started.

Yes, but this is _exactly_ what the current code does in the scenario below,

 Or are you arguing that as long as interrupts remain disabled between
 the two events, no harm done?

no,

  Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
  rdp-completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
  (say, another CPU does rcu_try_flip_idle() in between).
  
  We ack the flip, call_rcu() enables irqs, the timer interrupt calls
  __rcu_advance_callbacks() again and moves the callbacks.
  
  So, it is still possible that move callbacks and ack the flip happen
  out of order. But why this is bad?

Look, what happens is

// call_rcu()
rcu_flip_flag = rcu_flipped
insert the new callback
// timer irq
move the callbacks (the new one goes to wait[0])

But I still can't understand why this is bad,

  This can't speedup the moving of our callbacks from next to done lists.
  Yes, RCU state machine can switch to the next state/stage, but this looks
  safe to me.

Before this callback will be flushed, we need 2 rdp-completed != 
rcu_ctrlblk.completed
further events, we can't miss rcu_read_lock() section, no?

  Help!

Please :)

  if (rcu_ctrlblk.completed == rdp-completed)
  rcu_try_flip();
  
  Could you clarify the check above? Afaics this is just optimization,
  technically it is correct to rcu_try_flip() at any time, even if -completed
  are not synchronized. Most probably in that case 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-28 Thread Paul E. McKenney
On Fri, Sep 28, 2007 at 06:47:14PM +0400, Oleg Nesterov wrote:
 On 09/27, Paul E. McKenney wrote:
 
  On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
   
   Yes, yes, I see now. We really need this barriers, except I think
   rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
   
 // rcu_try_flip_waitzero()
 if (A == 0) {
 mb();
 B == 0;
 }
   
   Do we really need the mb() in this case? How it is possible that STORE
   goes before LOAD? Obviously, the LOAD should be completed first, no?
  
  Suppose that A was most recently stored by a CPU that shares a store
  buffer with this CPU.  Then it is possible that some other CPU sees
  the store to B as happening before the store that A==0 above is
  loading from.  This other CPU would naturally conclude that the store
  to B must have happened before the load from A.
 
 Ah, I was confused by the comment,
 
   smp_mb();  /* Don't call for memory barriers before we see zero. */
^^
 So, in fact, we need this barrier to make sure that _other_ CPUs see these
 changes in order, thanks. Of course, _we_ already saw zero.

Fair point!

Perhaps: Ensure that all CPUs see their rcu_mb_flag -after- the
rcu_flipctrs sum to zero or some such?

 But in that particular case this doesn't matter, rcu_try_flip_waitzero()
 is the only function which reads the non-local per_cpu(rcu_flipctr), so
 it doesn't really need the barrier? (besides, it is always called under
 fliplock).

The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,
so we cannot necessarily rely on locking to trivialize all of this.

  In more detail, suppose that CPU 0 and 1 share a store buffer, and
  that CPU 2 and 3 share a second store buffer.  This happens naturally
  if CPUs 0 and 1 are really just different hardware threads within a
  single core.
  
  So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
  and that the cacheline for B is initially owned by CPUs 0 and 1.
  Then consider the following sequence of events:
  
  o   CPU 0 stores zero to A.  This is a cache miss, so the new value
  for A is placed in CPU 0's and 1's store buffer.
  
  o   CPU 1 executes the above code, first loading A.  It sees
  the value of A==0 in the store buffer, and therefore
  stores zero to B, which hits in the cache.  (I am assuming
  that we left out the mb() above).
  
  o   CPU 2 loads from B, which misses the cache, and gets the
  value that CPU 1 stored.  Suppose it checks the value,
  and based on this check, loads A.  The old value of A might
  still be in cache, which would lead CPU 2 to conclude that
  the store to B by CPU 1 must have happened before the store
  to A by CPU 0.
  
  Memory barriers would prevent this confusion.
 
 Thanks a lot!!! This fills another gap in my understanding.

Glad it helped!  ;-)

 OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
 store buffer, the situation is simple, we can replace CPU 0 stores with
 CPU 1 stores. But what if CPU 0 is equally far from CPUs 1 and 2?
 
 Suppose that CPU 1 does
 
   wmb();
   B = 0
 
 Can we assume that CPU 2 doing
 
   if (B == 0) {
   rmb();
 
 must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() ?

Yes.  CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
reads following its rmb() must therefore see all of CPU 1's stores
preceding the wmb().

 If this is possible, can't we move the code doing 
 s/rcu_flipped/rcu_flip_seen/
 from __rcu_advance_callbacks() to rcu_check_mb() to unify the acks ?

I believe that we cannot safely do this.  The 
rcu_flipped-to-rcu_flip_seen
transition has to be synchronized to the moving of the callbacks --
either that or we need more GP_STAGES.
   
   Hmm. Still can't understand.
  
  Callbacks would be able to be injected into a grace period after it
  started.
 
 Yes, but this is _exactly_ what the current code does in the scenario below,
 
  Or are you arguing that as long as interrupts remain disabled between
  the two events, no harm done?
 
 no,
 
   Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
   rdp-completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
   (say, another CPU does rcu_try_flip_idle() in between).
   
   We ack the flip, call_rcu() enables irqs, the timer interrupt calls
   __rcu_advance_callbacks() again and moves the callbacks.
   
   So, it is still possible that move callbacks and ack the flip happen
   out of order. But why this is bad?
 
 Look, what happens is
 
   // call_rcu()
   rcu_flip_flag = rcu_flipped
   insert the new callback
   // timer irq
   move the callbacks (the new one goes to wait[0])
 
 But I still can't understand why this is bad,
 
   This can't speedup the moving of our callbacks from 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-27 Thread Paul E. McKenney
On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
> On 09/23, Paul E. McKenney wrote:
> >
> > On Sun, Sep 23, 2007 at 09:38:07PM +0400, Oleg Nesterov wrote:
> > > Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and 
> > > rcu_mb_flag?
> > 
> > Looks like it to me, thank you for the tip!
> > 
> > Hmmm...  Why not the same for rcu_data?  I guess because there is
> > very little sharing?  The only example thus far of sharing would be
> > if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
> > critical section were preempted and then started on some other CPU),
> 
> even if rcu lock/unlock happen on different CPUs, rcu_flipctr is not "shared",
> we remember the index, but not the CPU, we always modify the "local" data.

Ah, good point, and good reason to keep rcu_flipctr separate.

> > > OTOH, I can't understand how it works. With ot without local_irq_save(),
> > > another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
> > > at any time, we can see the old value... rcu_try_flip_waitzero() can miss 
> > > us?
> > > 
> > > OK, GP_STAGES > 1, so rcu_try_flip_waitzero() will actually check both
> > > 0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
> > > my understanding is correct. Apart from this why GP_STAGES > 1 ???
> > 
> > This is indeed one reason.
> 
> Thanks a lot! Actually, this explains most of my questions. I was greatly
> confused even before I started to read the code. Looking at rcu_flipctr[2]
> I wrongly assumed that these 2 counters "belong" to different GPs. When
> I started to understand the reality, I was confused by GP_STAGES == 4 ;)

;-)

> > > > +* Disable local interrupts to prevent the grace-period
> > > > +* detection state machine from seeing us half-done.
> > > > +* NMIs can still occur, of course, and might themselves
> > > > +* contain rcu_read_lock().
> > > > +*/
> > > > +
> > > > +   local_irq_save(oldirq);
> > > 
> > > Could you please tell more, why do we need this cli?
> > > 
> > > It can't "protect" rcu_ctrlblk.completed, and the only change which 
> > > affects
> > > the state machine is rcu_flipctr[idx]++, so I can't understand the 
> > > "half-done"
> > > above. (of course, we must disable preemption while changing rcu_flipctr).
> > > 
> > 
> > The problem is not with .completed being incremented, but only
> > by it (1) being incremented (presumably by some other CPU and
> > then (2) having this CPU get a scheduling-clock interrupt, which
> > then causes this CPU to prematurely update the rcu_flip_flag.
> > This is a problem, because updating rcu_flip_flag is making a
> > promise that you won't ever again increment the "old" counter set
> > (at least not until the next counter flip).  If the scheduling
> > clock interrupt were to happen between the time that we pick
> > up the .completed field and the time that we increment our
> > counter, we will have broken that promise, and that could cause
> > someone to prematurely declare the grace period to be finished
> 
> Thanks!
> 
> > The second reason is that cli prohibits preemption.
> 
> yes, this is clear
> 
> > > > +   rcu_ctrlblk.completed++;  /* stands in for rcu_try_flip_g2 */
> > > > +
> > > > +   /*
> > > > +* Need a memory barrier so that other CPUs see the new
> > > > +* counter value before they see the subsequent change of all
> > > > +* the rcu_flip_flag instances to rcu_flipped.
> > > > +*/
> > > 
> > > Why? Any code sequence which relies on that?
> > 
> > Yep.  If a CPU saw the flip flag set to rcu_flipped before it saw
> > the new value of the counter, it might acknowledge the flip, but
> > then later use the old value of the counter.
> 
> Yes, yes, I see now. We really need this barriers, except I think
> rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
> 
>   // rcu_try_flip_waitzero()
>   if (A == 0) {
>   mb();
>   B == 0;
>   }
> 
> Do we really need the mb() in this case? How it is possible that STORE
> goes before LOAD? "Obviously", the LOAD should be completed first, no?

Suppose that A was most recently stored by a CPU that shares a store
buffer with this CPU.  Then it is possible that some other CPU sees
the store to B as happening before the store that "A==0" above is
loading from.  This other CPU would naturally conclude that the store
to B must have happened before the load from A.

In more detail, suppose that CPU 0 and 1 share a store buffer, and
that CPU 2 and 3 share a second store buffer.  This happens naturally
if CPUs 0 and 1 are really just different hardware threads within a
single core.

So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
and that the cacheline for B is initially owned by CPUs 0 and 1.
Then consider the following sequence of events:

o   

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-27 Thread Paul E. McKenney
On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
 On 09/23, Paul E. McKenney wrote:
 
  On Sun, Sep 23, 2007 at 09:38:07PM +0400, Oleg Nesterov wrote:
   Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and 
   rcu_mb_flag?
  
  Looks like it to me, thank you for the tip!
  
  Hmmm...  Why not the same for rcu_data?  I guess because there is
  very little sharing?  The only example thus far of sharing would be
  if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
  critical section were preempted and then started on some other CPU),
 
 even if rcu lock/unlock happen on different CPUs, rcu_flipctr is not shared,
 we remember the index, but not the CPU, we always modify the local data.

Ah, good point, and good reason to keep rcu_flipctr separate.

   OTOH, I can't understand how it works. With ot without local_irq_save(),
   another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
   at any time, we can see the old value... rcu_try_flip_waitzero() can miss 
   us?
   
   OK, GP_STAGES  1, so rcu_try_flip_waitzero() will actually check both
   0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
   my understanding is correct. Apart from this why GP_STAGES  1 ???
  
  This is indeed one reason.
 
 Thanks a lot! Actually, this explains most of my questions. I was greatly
 confused even before I started to read the code. Looking at rcu_flipctr[2]
 I wrongly assumed that these 2 counters belong to different GPs. When
 I started to understand the reality, I was confused by GP_STAGES == 4 ;)

;-)

+* Disable local interrupts to prevent the grace-period
+* detection state machine from seeing us half-done.
+* NMIs can still occur, of course, and might themselves
+* contain rcu_read_lock().
+*/
+
+   local_irq_save(oldirq);
   
   Could you please tell more, why do we need this cli?
   
   It can't protect rcu_ctrlblk.completed, and the only change which 
   affects
   the state machine is rcu_flipctr[idx]++, so I can't understand the 
   half-done
   above. (of course, we must disable preemption while changing rcu_flipctr).
   
  
  The problem is not with .completed being incremented, but only
  by it (1) being incremented (presumably by some other CPU and
  then (2) having this CPU get a scheduling-clock interrupt, which
  then causes this CPU to prematurely update the rcu_flip_flag.
  This is a problem, because updating rcu_flip_flag is making a
  promise that you won't ever again increment the old counter set
  (at least not until the next counter flip).  If the scheduling
  clock interrupt were to happen between the time that we pick
  up the .completed field and the time that we increment our
  counter, we will have broken that promise, and that could cause
  someone to prematurely declare the grace period to be finished
 
 Thanks!
 
  The second reason is that cli prohibits preemption.
 
 yes, this is clear
 
+   rcu_ctrlblk.completed++;  /* stands in for rcu_try_flip_g2 */
+
+   /*
+* Need a memory barrier so that other CPUs see the new
+* counter value before they see the subsequent change of all
+* the rcu_flip_flag instances to rcu_flipped.
+*/
   
   Why? Any code sequence which relies on that?
  
  Yep.  If a CPU saw the flip flag set to rcu_flipped before it saw
  the new value of the counter, it might acknowledge the flip, but
  then later use the old value of the counter.
 
 Yes, yes, I see now. We really need this barriers, except I think
 rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
 
   // rcu_try_flip_waitzero()
   if (A == 0) {
   mb();
   B == 0;
   }
 
 Do we really need the mb() in this case? How it is possible that STORE
 goes before LOAD? Obviously, the LOAD should be completed first, no?

Suppose that A was most recently stored by a CPU that shares a store
buffer with this CPU.  Then it is possible that some other CPU sees
the store to B as happening before the store that A==0 above is
loading from.  This other CPU would naturally conclude that the store
to B must have happened before the load from A.

In more detail, suppose that CPU 0 and 1 share a store buffer, and
that CPU 2 and 3 share a second store buffer.  This happens naturally
if CPUs 0 and 1 are really just different hardware threads within a
single core.

So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
and that the cacheline for B is initially owned by CPUs 0 and 1.
Then consider the following sequence of events:

o   CPU 0 stores zero to A.  This is a cache miss, so the new value
for A is placed in CPU 0's and 1's store buffer.

o   CPU 1 executes the above code, first loading A.  It sees
the value of A==0 in 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-26 Thread Oleg Nesterov
On 09/23, Paul E. McKenney wrote:
>
> On Sun, Sep 23, 2007 at 09:38:07PM +0400, Oleg Nesterov wrote:
> > Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and 
> > rcu_mb_flag?
> 
> Looks like it to me, thank you for the tip!
> 
> Hmmm...  Why not the same for rcu_data?  I guess because there is
> very little sharing?  The only example thus far of sharing would be
> if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
> critical section were preempted and then started on some other CPU),

even if rcu lock/unlock happen on different CPUs, rcu_flipctr is not "shared",
we remember the index, but not the CPU, we always modify the "local" data.

> > OTOH, I can't understand how it works. With ot without local_irq_save(),
> > another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
> > at any time, we can see the old value... rcu_try_flip_waitzero() can miss 
> > us?
> > 
> > OK, GP_STAGES > 1, so rcu_try_flip_waitzero() will actually check both
> > 0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
> > my understanding is correct. Apart from this why GP_STAGES > 1 ???
> 
> This is indeed one reason.

Thanks a lot! Actually, this explains most of my questions. I was greatly
confused even before I started to read the code. Looking at rcu_flipctr[2]
I wrongly assumed that these 2 counters "belong" to different GPs. When
I started to understand the reality, I was confused by GP_STAGES == 4 ;)

> > > +  * Disable local interrupts to prevent the grace-period
> > > +  * detection state machine from seeing us half-done.
> > > +  * NMIs can still occur, of course, and might themselves
> > > +  * contain rcu_read_lock().
> > > +  */
> > > +
> > > + local_irq_save(oldirq);
> > 
> > Could you please tell more, why do we need this cli?
> > 
> > It can't "protect" rcu_ctrlblk.completed, and the only change which affects
> > the state machine is rcu_flipctr[idx]++, so I can't understand the 
> > "half-done"
> > above. (of course, we must disable preemption while changing rcu_flipctr).
> > 
> 
>   The problem is not with .completed being incremented, but only
>   by it (1) being incremented (presumably by some other CPU and
>   then (2) having this CPU get a scheduling-clock interrupt, which
>   then causes this CPU to prematurely update the rcu_flip_flag.
>   This is a problem, because updating rcu_flip_flag is making a
>   promise that you won't ever again increment the "old" counter set
>   (at least not until the next counter flip).  If the scheduling
>   clock interrupt were to happen between the time that we pick
>   up the .completed field and the time that we increment our
>   counter, we will have broken that promise, and that could cause
>   someone to prematurely declare the grace period to be finished

Thanks!

>   The second reason is that cli prohibits preemption.

yes, this is clear

> > > + rcu_ctrlblk.completed++;  /* stands in for rcu_try_flip_g2 */
> > > +
> > > + /*
> > > +  * Need a memory barrier so that other CPUs see the new
> > > +  * counter value before they see the subsequent change of all
> > > +  * the rcu_flip_flag instances to rcu_flipped.
> > > +  */
> > 
> > Why? Any code sequence which relies on that?
> 
> Yep.  If a CPU saw the flip flag set to rcu_flipped before it saw
> the new value of the counter, it might acknowledge the flip, but
> then later use the old value of the counter.

Yes, yes, I see now. We really need this barriers, except I think
rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,

// rcu_try_flip_waitzero()
if (A == 0) {
mb();
B == 0;
}

Do we really need the mb() in this case? How it is possible that STORE
goes before LOAD? "Obviously", the LOAD should be completed first, no?

> > This looks a bit strange. Is this optimization? Why not
> > 
> > spin_lock_irqsave(>lock, flags);
> > rdp = RCU_DATA_ME();
> 
> Can't do the above, because I need the pointer before I can lock it.  ;-)

Oooh... well... I need to think more about your explanation :)

> > If this is possible, can't we move the code doing 
> > "s/rcu_flipped/rcu_flip_seen/"
> > from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ?
> 
> I believe that we cannot safely do this.  The rcu_flipped-to-rcu_flip_seen
> transition has to be synchronized to the moving of the callbacks --
> either that or we need more GP_STAGES.

Hmm. Still can't understand.

Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
(say, another CPU does rcu_try_flip_idle() in between).

We ack the flip, call_rcu() enables irqs, the timer interrupt calls
__rcu_advance_callbacks() again and moves the callbacks.

So, it is still possible that "move callbacks" and "ack the flip" happen
out of order. 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-26 Thread Oleg Nesterov
On 09/23, Paul E. McKenney wrote:

 On Sun, Sep 23, 2007 at 09:38:07PM +0400, Oleg Nesterov wrote:
  Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and 
  rcu_mb_flag?
 
 Looks like it to me, thank you for the tip!
 
 Hmmm...  Why not the same for rcu_data?  I guess because there is
 very little sharing?  The only example thus far of sharing would be
 if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
 critical section were preempted and then started on some other CPU),

even if rcu lock/unlock happen on different CPUs, rcu_flipctr is not shared,
we remember the index, but not the CPU, we always modify the local data.

  OTOH, I can't understand how it works. With ot without local_irq_save(),
  another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
  at any time, we can see the old value... rcu_try_flip_waitzero() can miss 
  us?
  
  OK, GP_STAGES  1, so rcu_try_flip_waitzero() will actually check both
  0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
  my understanding is correct. Apart from this why GP_STAGES  1 ???
 
 This is indeed one reason.

Thanks a lot! Actually, this explains most of my questions. I was greatly
confused even before I started to read the code. Looking at rcu_flipctr[2]
I wrongly assumed that these 2 counters belong to different GPs. When
I started to understand the reality, I was confused by GP_STAGES == 4 ;)

   +  * Disable local interrupts to prevent the grace-period
   +  * detection state machine from seeing us half-done.
   +  * NMIs can still occur, of course, and might themselves
   +  * contain rcu_read_lock().
   +  */
   +
   + local_irq_save(oldirq);
  
  Could you please tell more, why do we need this cli?
  
  It can't protect rcu_ctrlblk.completed, and the only change which affects
  the state machine is rcu_flipctr[idx]++, so I can't understand the 
  half-done
  above. (of course, we must disable preemption while changing rcu_flipctr).
  
 
   The problem is not with .completed being incremented, but only
   by it (1) being incremented (presumably by some other CPU and
   then (2) having this CPU get a scheduling-clock interrupt, which
   then causes this CPU to prematurely update the rcu_flip_flag.
   This is a problem, because updating rcu_flip_flag is making a
   promise that you won't ever again increment the old counter set
   (at least not until the next counter flip).  If the scheduling
   clock interrupt were to happen between the time that we pick
   up the .completed field and the time that we increment our
   counter, we will have broken that promise, and that could cause
   someone to prematurely declare the grace period to be finished

Thanks!

   The second reason is that cli prohibits preemption.

yes, this is clear

   + rcu_ctrlblk.completed++;  /* stands in for rcu_try_flip_g2 */
   +
   + /*
   +  * Need a memory barrier so that other CPUs see the new
   +  * counter value before they see the subsequent change of all
   +  * the rcu_flip_flag instances to rcu_flipped.
   +  */
  
  Why? Any code sequence which relies on that?
 
 Yep.  If a CPU saw the flip flag set to rcu_flipped before it saw
 the new value of the counter, it might acknowledge the flip, but
 then later use the old value of the counter.

Yes, yes, I see now. We really need this barriers, except I think
rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,

// rcu_try_flip_waitzero()
if (A == 0) {
mb();
B == 0;
}

Do we really need the mb() in this case? How it is possible that STORE
goes before LOAD? Obviously, the LOAD should be completed first, no?

  This looks a bit strange. Is this optimization? Why not
  
  spin_lock_irqsave(rdp-lock, flags);
  rdp = RCU_DATA_ME();
 
 Can't do the above, because I need the pointer before I can lock it.  ;-)

Oooh... well... I need to think more about your explanation :)

  If this is possible, can't we move the code doing 
  s/rcu_flipped/rcu_flip_seen/
  from __rcu_advance_callbacks() to rcu_check_mb() to unify the acks ?
 
 I believe that we cannot safely do this.  The rcu_flipped-to-rcu_flip_seen
 transition has to be synchronized to the moving of the callbacks --
 either that or we need more GP_STAGES.

Hmm. Still can't understand.

Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
rdp-completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
(say, another CPU does rcu_try_flip_idle() in between).

We ack the flip, call_rcu() enables irqs, the timer interrupt calls
__rcu_advance_callbacks() again and moves the callbacks.

So, it is still possible that move callbacks and ack the flip happen
out of order. But why this is bad?

This can't speedup the moving of our callbacks from next to done lists.
Yes, RCU state machine can switch to the next 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-23 Thread Paul E. McKenney
On Sun, Sep 23, 2007 at 09:38:07PM +0400, Oleg Nesterov wrote:
> On 09/10, Paul E. McKenney wrote:
> >
> > Work in progress, not for inclusion.
> 
> Impressive work! a couple of random newbie's questions...

Thank you for the kind words, and most especially for the careful review!!!

And Oleg, I don't think you qualify as a newbie anymore.  ;-)

> > --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 
> > 14:02:36.0 -0700
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h  2007-08-22 
> > 15:21:06.0 -0700
> >
> >  extern void rcu_check_callbacks(int cpu, int user);
> 
> Also superfluously declared in rcuclassic.h and in rcupreempt.h

Definitely are two of them in rcupdate.h, good eyes!  The ones in
rcuclassic.h and rcupreempt.h, by the time all the patches are
applied, are for rcu_check_callbacks_rt().  However, this, along
with the other _rt() cross-calls from rcuclassic.c to rcupreempt.c,
will go away when the merge with upcoming hotplug patches is complete.

> > --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h   1969-12-31 
> > 16:00:00.0 -0800
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h2007-08-22 
> > 15:21:06.0 -0700
> > +
> > +#define __rcu_read_lock_nesting()  (current->rcu_read_lock_nesting)
> 
> unused?

It would seem so!  Will remove it.

> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c 
> > linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c
> > --- linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c  1969-12-31 
> > 16:00:00.0 -0800
> > +++ linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c   2007-08-22 
> > 15:35:19.0 -0700
> >
> > +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
> > +static struct rcu_ctrlblk rcu_ctrlblk = {
> > +   .fliplock = SPIN_LOCK_UNLOCKED,
> > +   .completed = 0,
> > +};
> > static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
> 
> Just curious, any reason why rcu_flipctr can't live in rcu_data ? Similarly,
> rcu_try_flip_state can be a member of rcu_ctrlblk, it is even protected by
> rcu_ctrlblk.fliplock

In the case of rcu_flipctr, this is historical -- the implementation in
-rt has only one rcu_data, rather than one per CPU.  I cannot think of
a problem with placing it in rcu_data right off-hand, will look it over
carefully and see.

Good point on rcu_try_flip_state!!!

> Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and rcu_mb_flag?

Looks like it to me, thank you for the tip!

Hmmm...  Why not the same for rcu_data?  I guess because there is
very little sharing?  The only example thus far of sharing would be
if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
critical section were preempted and then started on some other CPU),
though I will need to check more carefully.

Of course, if we start having CPUs access each others' rcu_data structures
to make RCU more power-friendly in dynticks situations, then maybe there
would be more reason to use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data.

Thoughts?

> > +void __rcu_read_lock(void)
> > +{
> > +   int idx;
> > +   struct task_struct *me = current;
> > +   int nesting;
> > +
> > +   nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > +   if (nesting != 0) {
> > +
> > +   /* An earlier rcu_read_lock() covers us, just count it. */
> > +
> > +   me->rcu_read_lock_nesting = nesting + 1;
> > +
> > +   } else {
> > +   unsigned long oldirq;
> > +
> > +   /*
> > +* Disable local interrupts to prevent the grace-period
> > +* detection state machine from seeing us half-done.
> > +* NMIs can still occur, of course, and might themselves
> > +* contain rcu_read_lock().
> > +*/
> > +
> > +   local_irq_save(oldirq);
> 
> Could you please tell more, why do we need this cli?
> 
> It can't "protect" rcu_ctrlblk.completed, and the only change which affects
> the state machine is rcu_flipctr[idx]++, so I can't understand the "half-done"
> above. (of course, we must disable preemption while changing rcu_flipctr).
> 
> Hmm... this was already discussed... from another message:
> 
>   > Critical portions of the GP protection happen in the scheduler-clock
>   > interrupt, which is a hardirq.  For example, the .completed counter
>   > is always incremented in hardirq context, and we cannot tolerate a
>   > .completed increment in this code.  Allowing such an increment would
>   > defeat the counter-acknowledge state in the state machine.
> 
> Still can't understand, please help! .completed could be incremented by
> another CPU, why rcu_check_callbacks() running on _this_ CPU is so special?

Yeah, I guess my explanation -was- a bit lacking...  When I re-read it, it
didn't even help -me- much!  ;-)

I am putting together better documentation, but in the meantime, let me
try again...

The problem is not with .completed being incremented, but only
by it 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-23 Thread Oleg Nesterov
On 09/10, Paul E. McKenney wrote:
>
> Work in progress, not for inclusion.

Impressive work! a couple of random newbie's questions...

> --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h   2007-07-19 
> 14:02:36.0 -0700
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h2007-08-22 
> 15:21:06.0 -0700
>
>  extern void rcu_check_callbacks(int cpu, int user);

Also superfluously declared in rcuclassic.h and in rcupreempt.h

> --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 1969-12-31 
> 16:00:00.0 -0800
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h  2007-08-22 
> 15:21:06.0 -0700
> +
> +#define __rcu_read_lock_nesting()(current->rcu_read_lock_nesting)

unused?

> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c 
> linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c
> --- linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c1969-12-31 
> 16:00:00.0 -0800
> +++ linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c 2007-08-22 
> 15:35:19.0 -0700
>
> +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
> +static struct rcu_ctrlblk rcu_ctrlblk = {
> + .fliplock = SPIN_LOCK_UNLOCKED,
> + .completed = 0,
> +};
> static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };

Just curious, any reason why rcu_flipctr can't live in rcu_data ? Similarly,
rcu_try_flip_state can be a member of rcu_ctrlblk, it is even protected by
rcu_ctrlblk.fliplock

Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and rcu_mb_flag?

> +void __rcu_read_lock(void)
> +{
> + int idx;
> + struct task_struct *me = current;
> + int nesting;
> +
> + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> + if (nesting != 0) {
> +
> + /* An earlier rcu_read_lock() covers us, just count it. */
> +
> + me->rcu_read_lock_nesting = nesting + 1;
> +
> + } else {
> + unsigned long oldirq;
> +
> + /*
> +  * Disable local interrupts to prevent the grace-period
> +  * detection state machine from seeing us half-done.
> +  * NMIs can still occur, of course, and might themselves
> +  * contain rcu_read_lock().
> +  */
> +
> + local_irq_save(oldirq);

Could you please tell more, why do we need this cli?

It can't "protect" rcu_ctrlblk.completed, and the only change which affects
the state machine is rcu_flipctr[idx]++, so I can't understand the "half-done"
above. (of course, we must disable preemption while changing rcu_flipctr).

Hmm... this was already discussed... from another message:

> Critical portions of the GP protection happen in the scheduler-clock
> interrupt, which is a hardirq.  For example, the .completed counter
> is always incremented in hardirq context, and we cannot tolerate a
> .completed increment in this code.  Allowing such an increment would
> defeat the counter-acknowledge state in the state machine.

Still can't understand, please help! .completed could be incremented by
another CPU, why rcu_check_callbacks() running on _this_ CPU is so special?

> +
> + /*
> +  * Outermost nesting of rcu_read_lock(), so increment
> +  * the current counter for the current CPU.  Use volatile
> +  * casts to prevent the compiler from reordering.
> +  */
> +
> + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> + smp_read_barrier_depends();  /*  might be unneeded */
> + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;

Isn't it "obvious" that this barrier is not needed? rcu_flipctr could be
change only by this CPU, regardless of the actual value of idx, we can't
read the "wrong" value of rcu_flipctr[idx], no?

OTOH, I can't understand how it works. With ot without local_irq_save(),
another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
at any time, we can see the old value... rcu_try_flip_waitzero() can miss us?

OK, GP_STAGES > 1, so rcu_try_flip_waitzero() will actually check both
0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
my understanding is correct. Apart from this why GP_STAGES > 1 ???

Well, I think this code is just too tricky for me! Will try to read again
after sleep ;)

> +static int
> +rcu_try_flip_idle(void)
> +{
> + int cpu;
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
> + if (!rcu_pending(smp_processor_id())) {
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
> + return 0;
> + }
> +
> + /*
> +  * Do the flip.
> +  */
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_g1);
> + rcu_ctrlblk.completed++;  /* stands in for rcu_try_flip_g2 */
> +
> + /*
> +  * Need a memory barrier so that other CPUs see the new
> +  * counter value before they see the subsequent change of all
> +  * the rcu_flip_flag 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-23 Thread Oleg Nesterov
On 09/10, Paul E. McKenney wrote:

 Work in progress, not for inclusion.

Impressive work! a couple of random newbie's questions...

 --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h   2007-07-19 
 14:02:36.0 -0700
 +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h2007-08-22 
 15:21:06.0 -0700

  extern void rcu_check_callbacks(int cpu, int user);

Also superfluously declared in rcuclassic.h and in rcupreempt.h

 --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 1969-12-31 
 16:00:00.0 -0800
 +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h  2007-08-22 
 15:21:06.0 -0700
 +
 +#define __rcu_read_lock_nesting()(current-rcu_read_lock_nesting)

unused?

 diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c 
 linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c
 --- linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c1969-12-31 
 16:00:00.0 -0800
 +++ linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c 2007-08-22 
 15:35:19.0 -0700

 +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
 +static struct rcu_ctrlblk rcu_ctrlblk = {
 + .fliplock = SPIN_LOCK_UNLOCKED,
 + .completed = 0,
 +};
 static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };

Just curious, any reason why rcu_flipctr can't live in rcu_data ? Similarly,
rcu_try_flip_state can be a member of rcu_ctrlblk, it is even protected by
rcu_ctrlblk.fliplock

Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and rcu_mb_flag?

 +void __rcu_read_lock(void)
 +{
 + int idx;
 + struct task_struct *me = current;
 + int nesting;
 +
 + nesting = ORDERED_WRT_IRQ(me-rcu_read_lock_nesting);
 + if (nesting != 0) {
 +
 + /* An earlier rcu_read_lock() covers us, just count it. */
 +
 + me-rcu_read_lock_nesting = nesting + 1;
 +
 + } else {
 + unsigned long oldirq;
 +
 + /*
 +  * Disable local interrupts to prevent the grace-period
 +  * detection state machine from seeing us half-done.
 +  * NMIs can still occur, of course, and might themselves
 +  * contain rcu_read_lock().
 +  */
 +
 + local_irq_save(oldirq);

Could you please tell more, why do we need this cli?

It can't protect rcu_ctrlblk.completed, and the only change which affects
the state machine is rcu_flipctr[idx]++, so I can't understand the half-done
above. (of course, we must disable preemption while changing rcu_flipctr).

Hmm... this was already discussed... from another message:

 Critical portions of the GP protection happen in the scheduler-clock
 interrupt, which is a hardirq.  For example, the .completed counter
 is always incremented in hardirq context, and we cannot tolerate a
 .completed increment in this code.  Allowing such an increment would
 defeat the counter-acknowledge state in the state machine.

Still can't understand, please help! .completed could be incremented by
another CPU, why rcu_check_callbacks() running on _this_ CPU is so special?

 +
 + /*
 +  * Outermost nesting of rcu_read_lock(), so increment
 +  * the current counter for the current CPU.  Use volatile
 +  * casts to prevent the compiler from reordering.
 +  */
 +
 + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed)  0x1;
 + smp_read_barrier_depends();  /*  might be unneeded */
 + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;

Isn't it obvious that this barrier is not needed? rcu_flipctr could be
change only by this CPU, regardless of the actual value of idx, we can't
read the wrong value of rcu_flipctr[idx], no?

OTOH, I can't understand how it works. With ot without local_irq_save(),
another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
at any time, we can see the old value... rcu_try_flip_waitzero() can miss us?

OK, GP_STAGES  1, so rcu_try_flip_waitzero() will actually check both
0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
my understanding is correct. Apart from this why GP_STAGES  1 ???

Well, I think this code is just too tricky for me! Will try to read again
after sleep ;)

 +static int
 +rcu_try_flip_idle(void)
 +{
 + int cpu;
 +
 + RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
 + if (!rcu_pending(smp_processor_id())) {
 + RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
 + return 0;
 + }
 +
 + /*
 +  * Do the flip.
 +  */
 +
 + RCU_TRACE_ME(rcupreempt_trace_try_flip_g1);
 + rcu_ctrlblk.completed++;  /* stands in for rcu_try_flip_g2 */
 +
 + /*
 +  * Need a memory barrier so that other CPUs see the new
 +  * counter value before they see the subsequent change of all
 +  * the rcu_flip_flag instances to rcu_flipped.
 +  */

Why? Any code sequence which relies on that?

For example, 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-23 Thread Paul E. McKenney
On Sun, Sep 23, 2007 at 09:38:07PM +0400, Oleg Nesterov wrote:
 On 09/10, Paul E. McKenney wrote:
 
  Work in progress, not for inclusion.
 
 Impressive work! a couple of random newbie's questions...

Thank you for the kind words, and most especially for the careful review!!!

And Oleg, I don't think you qualify as a newbie anymore.  ;-)

  --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 
  14:02:36.0 -0700
  +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h  2007-08-22 
  15:21:06.0 -0700
 
   extern void rcu_check_callbacks(int cpu, int user);
 
 Also superfluously declared in rcuclassic.h and in rcupreempt.h

Definitely are two of them in rcupdate.h, good eyes!  The ones in
rcuclassic.h and rcupreempt.h, by the time all the patches are
applied, are for rcu_check_callbacks_rt().  However, this, along
with the other _rt() cross-calls from rcuclassic.c to rcupreempt.c,
will go away when the merge with upcoming hotplug patches is complete.

  --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h   1969-12-31 
  16:00:00.0 -0800
  +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h2007-08-22 
  15:21:06.0 -0700
  +
  +#define __rcu_read_lock_nesting()  (current-rcu_read_lock_nesting)
 
 unused?

It would seem so!  Will remove it.

  diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c 
  linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c
  --- linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c  1969-12-31 
  16:00:00.0 -0800
  +++ linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c   2007-08-22 
  15:35:19.0 -0700
 
  +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
  +static struct rcu_ctrlblk rcu_ctrlblk = {
  +   .fliplock = SPIN_LOCK_UNLOCKED,
  +   .completed = 0,
  +};
  static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
 
 Just curious, any reason why rcu_flipctr can't live in rcu_data ? Similarly,
 rcu_try_flip_state can be a member of rcu_ctrlblk, it is even protected by
 rcu_ctrlblk.fliplock

In the case of rcu_flipctr, this is historical -- the implementation in
-rt has only one rcu_data, rather than one per CPU.  I cannot think of
a problem with placing it in rcu_data right off-hand, will look it over
carefully and see.

Good point on rcu_try_flip_state!!!

 Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and rcu_mb_flag?

Looks like it to me, thank you for the tip!

Hmmm...  Why not the same for rcu_data?  I guess because there is
very little sharing?  The only example thus far of sharing would be
if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
critical section were preempted and then started on some other CPU),
though I will need to check more carefully.

Of course, if we start having CPUs access each others' rcu_data structures
to make RCU more power-friendly in dynticks situations, then maybe there
would be more reason to use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data.

Thoughts?

  +void __rcu_read_lock(void)
  +{
  +   int idx;
  +   struct task_struct *me = current;
  +   int nesting;
  +
  +   nesting = ORDERED_WRT_IRQ(me-rcu_read_lock_nesting);
  +   if (nesting != 0) {
  +
  +   /* An earlier rcu_read_lock() covers us, just count it. */
  +
  +   me-rcu_read_lock_nesting = nesting + 1;
  +
  +   } else {
  +   unsigned long oldirq;
  +
  +   /*
  +* Disable local interrupts to prevent the grace-period
  +* detection state machine from seeing us half-done.
  +* NMIs can still occur, of course, and might themselves
  +* contain rcu_read_lock().
  +*/
  +
  +   local_irq_save(oldirq);
 
 Could you please tell more, why do we need this cli?
 
 It can't protect rcu_ctrlblk.completed, and the only change which affects
 the state machine is rcu_flipctr[idx]++, so I can't understand the half-done
 above. (of course, we must disable preemption while changing rcu_flipctr).
 
 Hmm... this was already discussed... from another message:
 
Critical portions of the GP protection happen in the scheduler-clock
interrupt, which is a hardirq.  For example, the .completed counter
is always incremented in hardirq context, and we cannot tolerate a
.completed increment in this code.  Allowing such an increment would
defeat the counter-acknowledge state in the state machine.
 
 Still can't understand, please help! .completed could be incremented by
 another CPU, why rcu_check_callbacks() running on _this_ CPU is so special?

Yeah, I guess my explanation -was- a bit lacking...  When I re-read it, it
didn't even help -me- much!  ;-)

I am putting together better documentation, but in the meantime, let me
try again...

The problem is not with .completed being incremented, but only
by it (1) being incremented (presumably by some other CPU and
then (2) having this CPU get a scheduling-clock interrupt, which
then 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 10:56:56PM -0400, Steven Rostedt wrote:
> 
> [ sneaks away from the family for a bit to answer emails ]

[ same here, now that you mention it... ]

> --
> On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> 
> > On Fri, Sep 21, 2007 at 09:19:22PM -0400, Steven Rostedt wrote:
> > >
> > > --
> > > On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > > > >
> > > > > In any case, I will be looking at the scenarios more carefully.  If
> > > > > it turns out that GP_STAGES can indeed be cranked down a bit, well,
> > > > > that is an easy change!  I just fired off a POWER run with GP_STAGES
> > > > > set to 3, will let you know how it goes.
> > > >
> > > > The first attempt blew up during boot badly enough that ABAT was unable
> > > > to recover the machine (sorry, grahal!!!).  Just for grins, I am trying
> > > > it again on a machine that ABAT has had a better record of reviving...
> > >
> > > This still frightens the hell out of me. Going through 15 states and
> > > failing. Seems the CPU is holding off writes for a long long time. That
> > > means we flipped the counter 4 times, and that still wasn't good enough?
> >
> > Might be that the other machine has its 2.6.22 version of .config messed
> > up.  I will try booting it on a stock 2.6.22 kernel when it comes back
> > to life -- not sure I ever did that before.  Besides, the other similar
> > machine seems to have gone down for the count, but without me torturing
> > it...
> >
> > Also, keep in mind that various stages can "record" a memory misordering,
> > for example, by incrementing the wrong counter.
> >
> > > Maybe I'll boot up my powerbook to see if it has the same issues.
> > >
> > > Well, I'm still finishing up on moving into my new house, so I wont be
> > > available this weekend.
> >
> > The other machine not only booted, but has survived several minutes of
> > rcutorture thus far.  I am also trying POWER5 machine as well, as the
> > one currently running is a POWER4, which is a bit less aggressive about
> > memory reordering than is the POWER5.
> >
> > Even if they pass, I refuse to reduce GP_STAGES until proven safe.
> > Trust me, you -don't- want to be unwittingly making use of a subtely
> > busted RCU implementation!!!
> 
> I totally agree. This is the same reason I want to understand -why- it
> fails with 3 stages. To make sure that adding a 4th stage really does fix
> it, and doesn't just make the chances for the bug smaller.

Or if it really does break, as opposed to my having happened upon a sick
or misconfigured machine.

> I just have that funny feeling that we are band-aiding this for POWER with
> extra stages and not really solving the bug.
> 
> I could be totally out in left field on this. But the more people have a
> good understanding of what is happening (this includes why things fail)
> the more people in general can trust this code.  Right now I'm thinking
> you may be the only one that understands this code enough to trust it. I'm
> just wanting you to help people like me to trust the code by understanding
> and not just having faith in others.

Agreed.  Trusting me is grossly insufficient.  For one thing, the Linux
kernel has a reasonable chance of outliving me.

> If you ever decide to give up jogging, we need to make sure that there are
> people here that can still fill those running shoes (-:

Well, I certainly don't jog as fast or as far as I used to!  ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 11:15:42PM -0400, Steven Rostedt wrote:
> On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > On Fri, Sep 21, 2007 at 09:15:03PM -0400, Steven Rostedt wrote:
> > > On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > > > On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
> > > > > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

[ . . . ]

> > > Are we sure that adding all these grace periods stages is better than just
> > > biting the bullet and put in a memory barrier?
> >
> > Good question.  I believe so, because the extra stages don't require
> > much additional processing, and because the ratio of rcu_read_lock()
> > calls to the number of grace periods is extremely high.  But, if I
> > can prove it is safe, I will certainly decrease GP_STAGES or otherwise
> > optimize the state machine.
> 
> But until others besides yourself understand that state machine (doesn't
> really need to be me) I would be worried about applying it without
> barriers.  The barriers may add a bit of overhead, but it adds some
> confidence in the code.  I'm arguing that we have barriers in there until
> there's a fine understanding of why we fail with 3 stages and not 4.
> Perhaps you don't have a box with enough cpus to fail at 4.
> 
> I don't know how the higher ups in the kernel command line feel, but I
> think that memory barriers on critical sections are justified. But if you
> can show a proof that adding extra stages is sufficient to deal with
> CPUS moving memory writes around, then so be it. But I'm still not
> convinced that these extra stages are really solving the bug instead of
> just making it much less likely to happen.
> 
> Ingo praised this code since it had several years of testing in the RT
> tree. But that version has barriers, so this new verison without the
> barriers has not had that "run it through the grinder" feeling to it.

Fair point...  Though the -rt variant has its shortcomings as well,
such as being unusable from NMI/SMI handlers.

How about this:  I continue running the GP_STAGES=3 run on the pair of
POWER machines (which are both going strong, and I also get a document
together describing the new version (and of course apply the changes we
have discussed, and merge with recent CPU-hotplug changes -- Gautham
Shenoy is currently working this), work out a good answer to "how
big exactly does GP_STAGES need to be", test whatever that number is,
assuming it is neither 3 nor 4, and figure out why the gekko-lp1 machine
choked on GP_STAGES=3.

Then we can work out the best path forward from wherever that ends up
being.

[ . . . ]

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt

[took off Ingo, because he has my ISP blacklisted, and I'm tired of
getting those return mail messages. He can read LKML or you can re-CC
him. Sad since this is a topic he should be reading. ]

--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:

> On Fri, Sep 21, 2007 at 09:15:03PM -0400, Steven Rostedt wrote:
> > On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > > On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
> > > > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
>
> [ . . . ]
>
> > > > > + /*
> > > > > +  * Take the next transition(s) through the RCU grace-period
> > > > > +  * flip-counter state machine.
> > > > > +  */
> > > > > +
> > > > > + switch (rcu_try_flip_state) {
> > > > > + case rcu_try_flip_idle_state:
> > > > > + if (rcu_try_flip_idle())
> > > > > + rcu_try_flip_state = rcu_try_flip_waitack_state;
> > > >
> > > > Just trying to understand all this. Here at flip_idle, only a CPU with
> > > > no pending RCU calls will flip it. Then all the cpus flags will be set
> > > > to rcu_flipped, and the ctrl.completed counter is incremented.
> > >
> > > s/no pending RCU calls/at least one pending RCU call/, but otherwise
> > > spot on.
> > >
> > > So if the RCU grace-period machinery is idle, the first CPU to take
> > > a scheduling-clock interrupt after having posted an RCU callback will
> > > get things going.
> >
> > I said 'no' becaues of this:
> >
> > +rcu_try_flip_idle(void)
> > +{
> > +   int cpu;
> > +
> > +   RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
> > +   if (!rcu_pending(smp_processor_id())) {
> > +   RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
> > +   return 0;
> > +   }
> >
> > But now I'm a bit more confused. :-/
> >
> > Looking at the caller in kernel/timer.c I see
> >
> > if (rcu_pending(cpu))
> > rcu_check_callbacks(cpu, user_tick);
> >
> > And rcu_check_callbacks is the caller of rcu_try_flip. The confusion is
> > that we call this when we have a pending rcu, but if we have a pending
> > rcu, we won't flip the counter ??
>
> We don't enter unless there is something for RCU to do (might be a
> pending callback, for example, but might also be needing to acknowledge
> a counter flip).  If, by the time we get to rcu_try_flip_idle(), there
> is no longer anything to do (!rcu_pending()), we bail.
>
> So a given CPU kicks the state machine out of idle only if it -still-
> has something to do once it gets to rcu_try_flip_idle(), right?
>

Now I can slap my forehead!  Duh, I wasn't seeing that ! in front of the
rcu_pending condition in the rcu_try_flip_idle.  We only flip if we do
indeed have something pending. I need some sleep. I also need to
re-evaluate some of my analysis of that code. But it doesn't change my
opinion of the stages.

> >
> > Are we sure that adding all these grace periods stages is better than just
> > biting the bullet and put in a memory barrier?
>
> Good question.  I believe so, because the extra stages don't require
> much additional processing, and because the ratio of rcu_read_lock()
> calls to the number of grace periods is extremely high.  But, if I
> can prove it is safe, I will certainly decrease GP_STAGES or otherwise
> optimize the state machine.

But until others besides yourself understand that state machine (doesn't
really need to be me) I would be worried about applying it without
barriers.  The barriers may add a bit of overhead, but it adds some
confidence in the code.  I'm arguing that we have barriers in there until
there's a fine understanding of why we fail with 3 stages and not 4.
Perhaps you don't have a box with enough cpus to fail at 4.

I don't know how the higher ups in the kernel command line feel, but I
think that memory barriers on critical sections are justified. But if you
can show a proof that adding extra stages is sufficient to deal with
CPUS moving memory writes around, then so be it. But I'm still not
convinced that these extra stages are really solving the bug instead of
just making it much less likely to happen.

Ingo praised this code since it had several years of testing in the RT
tree. But that version has barriers, so this new verison without the
barriers has not had that "run it through the grinder" feeling to it.

>
> [ . . . ]
>
> > > > OK, that's all I have on this patch (will take a bit of a break before
> > > > reviewing your other patches).  But I will say that RCU has grown quite
> > > > a bit, and is looking very good.
> > >
> > > Glad you like it, and thank you again for the careful and thorough review.
> >
> > I'm scared to do the preempt portion %^O
>
> Ummm...  This -was- the preempt portion.  ;-)

hehe, I do need sleep I meant the boosting portion.

>
> > > > Basically, what I'm saying is "Great work, Paul!".  This is looking
> > > > good. Seems that we just need a little bit better explanation for those
> > > > that are not up at the IQ level of you.  I can write 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt

[ sneaks away from the family for a bit to answer emails ]

--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:

> On Fri, Sep 21, 2007 at 09:19:22PM -0400, Steven Rostedt wrote:
> >
> > --
> > On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > > >
> > > > In any case, I will be looking at the scenarios more carefully.  If
> > > > it turns out that GP_STAGES can indeed be cranked down a bit, well,
> > > > that is an easy change!  I just fired off a POWER run with GP_STAGES
> > > > set to 3, will let you know how it goes.
> > >
> > > The first attempt blew up during boot badly enough that ABAT was unable
> > > to recover the machine (sorry, grahal!!!).  Just for grins, I am trying
> > > it again on a machine that ABAT has had a better record of reviving...
> >
> > This still frightens the hell out of me. Going through 15 states and
> > failing. Seems the CPU is holding off writes for a long long time. That
> > means we flipped the counter 4 times, and that still wasn't good enough?
>
> Might be that the other machine has its 2.6.22 version of .config messed
> up.  I will try booting it on a stock 2.6.22 kernel when it comes back
> to life -- not sure I ever did that before.  Besides, the other similar
> machine seems to have gone down for the count, but without me torturing
> it...
>
> Also, keep in mind that various stages can "record" a memory misordering,
> for example, by incrementing the wrong counter.
>
> > Maybe I'll boot up my powerbook to see if it has the same issues.
> >
> > Well, I'm still finishing up on moving into my new house, so I wont be
> > available this weekend.
>
> The other machine not only booted, but has survived several minutes of
> rcutorture thus far.  I am also trying POWER5 machine as well, as the
> one currently running is a POWER4, which is a bit less aggressive about
> memory reordering than is the POWER5.
>
> Even if they pass, I refuse to reduce GP_STAGES until proven safe.
> Trust me, you -don't- want to be unwittingly making use of a subtely
> busted RCU implementation!!!

I totally agree. This is the same reason I want to understand -why- it
fails with 3 stages. To make sure that adding a 4th stage really does fix
it, and doesn't just make the chances for the bug smaller.

I just have that funny feeling that we are band-aiding this for POWER with
extra stages and not really solving the bug.

I could be totally out in left field on this. But the more people have a
good understanding of what is happening (this includes why things fail)
the more people in general can trust this code.  Right now I'm thinking
you may be the only one that understands this code enough to trust it. I'm
just wanting you to help people like me to trust the code by understanding
and not just having faith in others.

If you ever decide to give up jogging, we need to make sure that there are
people here that can still fill those running shoes (-:


 -- Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 09:15:03PM -0400, Steven Rostedt wrote:
> On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
> > > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

[ . . . ]

> > > > +   /*
> > > > +* Take the next transition(s) through the RCU grace-period
> > > > +* flip-counter state machine.
> > > > +*/
> > > > +
> > > > +   switch (rcu_try_flip_state) {
> > > > +   case rcu_try_flip_idle_state:
> > > > +   if (rcu_try_flip_idle())
> > > > +   rcu_try_flip_state = rcu_try_flip_waitack_state;
> > >
> > > Just trying to understand all this. Here at flip_idle, only a CPU with
> > > no pending RCU calls will flip it. Then all the cpus flags will be set
> > > to rcu_flipped, and the ctrl.completed counter is incremented.
> >
> > s/no pending RCU calls/at least one pending RCU call/, but otherwise
> > spot on.
> >
> > So if the RCU grace-period machinery is idle, the first CPU to take
> > a scheduling-clock interrupt after having posted an RCU callback will
> > get things going.
> 
> I said 'no' becaues of this:
> 
> +rcu_try_flip_idle(void)
> +{
> +   int cpu;
> +
> +   RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
> +   if (!rcu_pending(smp_processor_id())) {
> +   RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
> +   return 0;
> +   }
> 
> But now I'm a bit more confused. :-/
> 
> Looking at the caller in kernel/timer.c I see
> 
>   if (rcu_pending(cpu))
>   rcu_check_callbacks(cpu, user_tick);
> 
> And rcu_check_callbacks is the caller of rcu_try_flip. The confusion is
> that we call this when we have a pending rcu, but if we have a pending
> rcu, we won't flip the counter ??

We don't enter unless there is something for RCU to do (might be a
pending callback, for example, but might also be needing to acknowledge
a counter flip).  If, by the time we get to rcu_try_flip_idle(), there
is no longer anything to do (!rcu_pending()), we bail.

So a given CPU kicks the state machine out of idle only if it -still-
has something to do once it gets to rcu_try_flip_idle(), right?

[ . . . ]

> > > Is there a chance that overflow of a counter (although probably very
> > > very unlikely) would cause any problems?
> >
> > The only way it could cause a problem would be if there was ever
> > more than 4,294,967,296 outstanding rcu_read_lock() calls.  I believe
> > that lockdep screams if it sees more than 30 nested locks within a
> > single task, so for systems that support no more than 100M tasks, we
> > should be OK.  It might sometime be necessary to make this be a long
> > rather than an int.  Should we just do that now and be done with it?
> 
> Sure, why not. More and more and more overkill!!!
> 
> (rostedt hears in his head the Monty Python "Spam" song).

;-)  OK!

> > > Also, all the CPUs have their "check_mb" set.
> > >
> > > > +   rcu_try_flip_state = rcu_try_flip_waitmb_state;
> > > > +   break;
> > > > +   case rcu_try_flip_waitmb_state:
> > > > +   if (rcu_try_flip_waitmb())
> > >
> > > I have to admit that this seems a bit of an overkill, but I guess you
> > > know what you are doing.  After going through three states, we still
> > > need to do a memory barrier on each CPU?
> >
> > Yep.  Because there are no memory barriers in rcu_read_unlock(), the
> > CPU is free to reorder the contents of the RCU read-side critical section
> > to follow the counter decrement.  This means that this CPU would still
> > be referencing RCU-protected data after it had told the world that it
> > was no longer doing so.  Forcing a memory barrier on each CPU guarantees
> > that if we see the memory-barrier acknowledge, we also see any prior
> > RCU read-side critical section.
> 
> And this seem reasonable to me that this would be enough to satisfy a
> grace period. But the CPU moving around the rcu_read_(un)lock's around.
> 
> Are we sure that adding all these grace periods stages is better than just
> biting the bullet and put in a memory barrier?

Good question.  I believe so, because the extra stages don't require
much additional processing, and because the ratio of rcu_read_lock()
calls to the number of grace periods is extremely high.  But, if I
can prove it is safe, I will certainly decrease GP_STAGES or otherwise
optimize the state machine.

[ . . . ]

> > > OK, that's all I have on this patch (will take a bit of a break before
> > > reviewing your other patches).  But I will say that RCU has grown quite
> > > a bit, and is looking very good.
> >
> > Glad you like it, and thank you again for the careful and thorough review.
> 
> I'm scared to do the preempt portion %^O

Ummm...  This -was- the preempt portion.  ;-)

> > > Basically, what I'm saying is "Great work, Paul!".  This is looking
> > > good. Seems that we just need a little bit better explanation for those
> > > 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 09:19:22PM -0400, Steven Rostedt wrote:
> 
> --
> On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > >
> > > In any case, I will be looking at the scenarios more carefully.  If
> > > it turns out that GP_STAGES can indeed be cranked down a bit, well,
> > > that is an easy change!  I just fired off a POWER run with GP_STAGES
> > > set to 3, will let you know how it goes.
> >
> > The first attempt blew up during boot badly enough that ABAT was unable
> > to recover the machine (sorry, grahal!!!).  Just for grins, I am trying
> > it again on a machine that ABAT has had a better record of reviving...
> 
> This still frightens the hell out of me. Going through 15 states and
> failing. Seems the CPU is holding off writes for a long long time. That
> means we flipped the counter 4 times, and that still wasn't good enough?

Might be that the other machine has its 2.6.22 version of .config messed
up.  I will try booting it on a stock 2.6.22 kernel when it comes back
to life -- not sure I ever did that before.  Besides, the other similar
machine seems to have gone down for the count, but without me torturing
it...

Also, keep in mind that various stages can "record" a memory misordering,
for example, by incrementing the wrong counter.

> Maybe I'll boot up my powerbook to see if it has the same issues.
> 
> Well, I'm still finishing up on moving into my new house, so I wont be
> available this weekend.

The other machine not only booted, but has survived several minutes of
rcutorture thus far.  I am also trying POWER5 machine as well, as the
one currently running is a POWER4, which is a bit less aggressive about
memory reordering than is the POWER5.

Even if they pass, I refuse to reduce GP_STAGES until proven safe.
Trust me, you -don't- want to be unwittingly making use of a subtely
busted RCU implementation!!!

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt

--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> >
> > In any case, I will be looking at the scenarios more carefully.  If
> > it turns out that GP_STAGES can indeed be cranked down a bit, well,
> > that is an easy change!  I just fired off a POWER run with GP_STAGES
> > set to 3, will let you know how it goes.
>
> The first attempt blew up during boot badly enough that ABAT was unable
> to recover the machine (sorry, grahal!!!).  Just for grins, I am trying
> it again on a machine that ABAT has had a better record of reviving...

This still frightens the hell out of me. Going through 15 states and
failing. Seems the CPU is holding off writes for a long long time. That
means we flipped the counter 4 times, and that still wasn't good enough?

Maybe I'll boot up my powerbook to see if it has the same issues.

Well, I'm still finishing up on moving into my new house, so I wont be
available this weekend.

Thanks,

-- Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt

On Fri, 21 Sep 2007, Paul E. McKenney wrote:

> On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
> > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
>
> Covering the pieces that weren't in Peter's reply.  ;-)
>
> And thank you -very- much for the careful and thorough review!!!
>
> > >  #endif /* __KERNEL__ */
> > >  #endif /* __LINUX_RCUCLASSIC_H */
> > > diff -urpNa -X dontdiff 
> > > linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
> > > linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
> > > --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h   2007-07-19 
> > > 14:02:36.0 -0700
> > > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h2007-08-22 
> > > 15:21:06.0 -0700
> > > @@ -52,7 +52,11 @@ struct rcu_head {
> > >   void (*func)(struct rcu_head *head);
> > >  };
> > >
> > > +#ifdef CONFIG_CLASSIC_RCU
> > >  #include 
> > > +#else /* #ifdef CONFIG_CLASSIC_RCU */
> > > +#include 
> > > +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
> >
> > A bit extreme on the comments here.
>
> My fingers do this without any help from the rest of me, but I suppose
> it is a bit of overkill in this case.

Heck, why stop the overkill here, the whole patch is overkill ;-)

> > > +
> > > +#define GP_STAGES 4
> >
> > I take it that GP stand for "grace period". Might want to state that
> > here. /* Grace period stages */  When I was looking at this code at 1am,
> > I kept asking myself "what's this GP?" (General Protection??). But
> > that's what happens when looking at code like this after midnight ;-)
>
> Good point, will add a comment.  You did get it right, "grace period".

Thanks, so many places in the kernel have acronyms that are just suppose
to be "obvious". I hate them, because they make me feel so stupid when I
don't know what they are. After I find out, I usually slap my forehead and
say "duh!". My mind is set on reading code, not deciphering TLAs.


> >
> > Can you have a pointer somewhere that explains these states. And not a
> > "it's in this paper or directory". Either have a short discription here,
> > or specify where exactly to find the information (perhaps a
> > Documentation/RCU/preemptible_states.txt?).
> >
> > Trying to understand these states has caused me the most agony in
> > reviewing these patches.
>
> Good point, perhaps a comment block above the enum giving a short
> description of the purpose of each state.  Maybe more detail in
> Documentation/RCU as well, as you suggest above.

That would be great.

> > > +
> > > +/*
> > > + * Return the number of RCU batches processed thus far.  Useful for debug
> > > + * and statistics.  The _bh variant is identical to straight RCU.
> > > + */
> >
> > If they are identical, then why the separation?
>
> I apologize for the repetition in this email.
>
> I apologize for the repetition in this email.
>
> I apologize for the repetition in this email.
>
> Yep, will fix with either #define or static inline, as you suggested
> in a later email.

you're starting to sound like me ;-)

> > > + struct task_struct *me = current;
> >
> > Nitpick, but other places in the kernel usually use "t" or "p" as a
> > variable to assign current to.  It's just that "me" thows me off a
> > little while reviewing this.  But this is just a nitpick, so do as you
> > will.
>
> Fair enough, as discussed earlier.

Who's on first, What's on second, and I-dont-know is on third.

> > > + unsigned long oldirq;
> >
> > Nitpick, "flags" is usually used for saving irq state.
>
> A later patch in the series fixes these -- I believe I got all of them.
> (The priority-boost patch, IIRC.)

OK

>
> > > +
> > > + /*
> > > +  * Disable local interrupts to prevent the grace-period
> > > +  * detection state machine from seeing us half-done.
> > > +  * NMIs can still occur, of course, and might themselves
> > > +  * contain rcu_read_lock().
> > > +  */
> > > +
> > > + local_irq_save(oldirq);
> >
> > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > local_bh_disable be sufficient here? You already cover NMIs, which would
> > also handle normal interrupts.
>
> We beat this into the ground in other email.

Nothing like kicking a dead horse on LKML ;-)

> > > +
> > > + /*
> > > +  * It is now safe to decrement this task's nesting count.
> > > +  * NMIs that occur after this statement will route their
> > > +  * rcu_read_lock() calls through this "else" clause, and
> > > +  * will thus start incrementing the per-CPU coutner on
> >
> > s/coutner/counter/
>
> wlli fxi!!!

snousd oogd

> > > +
> > > +/*
> > > + * Attempt a single flip of the counters.  Remember, a single flip does
> > > + * -not- constitute a grace period.  Instead, the interval between
> > > + * at least three consecutive flips is a grace period.
> > > + *
> > > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> >
> > Oh, come now! 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 04:03:43PM -0700, Paul E. McKenney wrote:
> On Fri, Sep 21, 2007 at 11:20:48AM -0400, Steven Rostedt wrote:
> > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

[ . . . ]

> > Paul,
> > 
> > Looking further into this, I still think this is a bit of overkill. We
> > go through 20 states from call_rcu to list->func().
> > 
> > On call_rcu we put our stuff on the next list. Before we move stuff from
> > next to wait, we need to go through 4 states. So we have
> > 
> > next -> 4 states -> wait[0] -> 4 states -> wait[1] -> 4 states ->
> > wait[2] -> 4 states -> wait[3] -> 4 states -> done.
> > 
> > That's 20 states that we go through from the time we add our function to
> > the list to the time it actually gets called. Do we really need the 4
> > wait lists?
> > 
> > Seems a bit overkill to me.
> > 
> > What am I missing?
> 
> "Nothing kills like overkill!!!"  ;-)
> 
> Seriously, I do expect to be able to squeeze this down over time, but
> feel the need to be a bit on the cowardly side at the moment.
> 
> In any case, I will be looking at the scenarios more carefully.  If
> it turns out that GP_STAGES can indeed be cranked down a bit, well,
> that is an easy change!  I just fired off a POWER run with GP_STAGES
> set to 3, will let you know how it goes.

The first attempt blew up during boot badly enough that ABAT was unable
to recover the machine (sorry, grahal!!!).  Just for grins, I am trying
it again on a machine that ABAT has had a better record of reviving...

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
> On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

Covering the pieces that weren't in Peter's reply.  ;-)

And thank you -very- much for the careful and thorough review!!!

> >  #endif /* __KERNEL__ */
> >  #endif /* __LINUX_RCUCLASSIC_H */
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
> > linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 
> > 14:02:36.0 -0700
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h  2007-08-22 
> > 15:21:06.0 -0700
> > @@ -52,7 +52,11 @@ struct rcu_head {
> > void (*func)(struct rcu_head *head);
> >  };
> >  
> > +#ifdef CONFIG_CLASSIC_RCU
> >  #include 
> > +#else /* #ifdef CONFIG_CLASSIC_RCU */
> > +#include 
> > +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
> 
> A bit extreme on the comments here.

My fingers do this without any help from the rest of me, but I suppose
it is a bit of overkill in this case.

> >  #define RCU_HEAD_INIT  { .next = NULL, .func = NULL }
> >  #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
> > @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
> >  /* Exported common interfaces */
> >  extern void synchronize_rcu(void);
> >  extern void rcu_barrier(void);
> > +extern long rcu_batches_completed(void);
> > +extern long rcu_batches_completed_bh(void);
> >  
> >  /* Internal to kernel */
> >  extern void rcu_init(void);
> >  extern void rcu_check_callbacks(int cpu, int user);
> > +extern int rcu_needs_cpu(int cpu);
> >  
> >  #endif /* __KERNEL__ */
> >  #endif /* __LINUX_RCUPDATE_H */
> > diff -urpNa -X dontdiff 
> > linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 
> > linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h   1969-12-31 
> > 16:00:00.0 -0800
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h2007-08-22 
> > 15:21:06.0 -0700
> > @@ -0,0 +1,78 @@
> > +/*
> > + * Read-Copy Update mechanism for mutual exclusion (RT implementation)
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, 
> > USA.
> > + *
> > + * Copyright (C) IBM Corporation, 2006
> > + *
> > + * Author:  Paul McKenney <[EMAIL PROTECTED]>
> > + *
> > + * Based on the original work by Paul McKenney <[EMAIL PROTECTED]>
> > + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
> > + * Papers:
> > + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
> > + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf 
> > (OLS2001)
> > + *
> > + * For detailed explanation of Read-Copy Update mechanism see -
> > + * Documentation/RCU
> > + *
> > + */
> > +
> > +#ifndef __LINUX_RCUPREEMPT_H
> > +#define __LINUX_RCUPREEMPT_H
> > +
> > +#ifdef __KERNEL__
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define rcu_qsctr_inc(cpu)
> > +#define rcu_bh_qsctr_inc(cpu)
> > +#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
> > +
> > +extern void __rcu_read_lock(void);
> > +extern void __rcu_read_unlock(void);
> > +extern int rcu_pending(int cpu);
> > +extern int rcu_needs_cpu(int cpu);
> > +
> > +#define __rcu_read_lock_bh()   { rcu_read_lock(); local_bh_disable(); }
> > +#define __rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); 
> > }
> > +
> > +#define __rcu_read_lock_nesting()  (current->rcu_read_lock_nesting)
> > +
> > +extern void __synchronize_sched(void);
> > +
> > +extern void __rcu_init(void);
> > +extern void rcu_check_callbacks(int cpu, int user);
> > +extern void rcu_restart_cpu(int cpu);
> > +
> > +#ifdef CONFIG_RCU_TRACE
> > +struct rcupreempt_trace;
> > +extern int *rcupreempt_flipctr(int cpu);
> > +extern long rcupreempt_data_completed(void);
> > +extern int rcupreempt_flip_flag(int cpu);
> > +extern int rcupreempt_mb_flag(int cpu);
> > +extern char *rcupreempt_try_flip_state_name(void);
> > +extern struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu);
> > +#endif
> > +
> > +struct softirq_action;
> > +
> > +#endif /* __KERNEL__ */
> > +#endif /* __LINUX_RCUPREEMPT_H */
> > diff -urpNa -X 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 07:23:09PM -0400, Steven Rostedt wrote:
> --
> On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> 
> > If you do a synchronize_rcu() it might well have to wait through the
> > following sequence of states:
> >
> > Stage 0: (might have to wait through part of this to get out of "next" 
> > queue)
> > rcu_try_flip_idle_state,/* "I" */
> > rcu_try_flip_waitack_state, /* "A" */
> > rcu_try_flip_waitzero_state,/* "Z" */
> > rcu_try_flip_waitmb_state   /* "M" */
> > Stage 1:
> > rcu_try_flip_idle_state,/* "I" */
> > rcu_try_flip_waitack_state, /* "A" */
> > rcu_try_flip_waitzero_state,/* "Z" */
> > rcu_try_flip_waitmb_state   /* "M" */
> > Stage 2:
> > rcu_try_flip_idle_state,/* "I" */
> > rcu_try_flip_waitack_state, /* "A" */
> > rcu_try_flip_waitzero_state,/* "Z" */
> > rcu_try_flip_waitmb_state   /* "M" */
> > Stage 3:
> > rcu_try_flip_idle_state,/* "I" */
> > rcu_try_flip_waitack_state, /* "A" */
> > rcu_try_flip_waitzero_state,/* "Z" */
> > rcu_try_flip_waitmb_state   /* "M" */
> > Stage 4:
> > rcu_try_flip_idle_state,/* "I" */
> > rcu_try_flip_waitack_state, /* "A" */
> > rcu_try_flip_waitzero_state,/* "Z" */
> > rcu_try_flip_waitmb_state   /* "M" */
> >
> > So yes, grace periods do indeed have some latency.
> 
> Yes they do. I'm now at the point that I'm just "trusting" you that you
> understand that each of these stages are needed. My IQ level only lets me
> understand next -> wait -> done, but not the extra 3 shifts in wait.
> 
> ;-)

In the spirit of full disclosure, I am not -absolutely- certain that
they are needed, only that they are sufficient.  Just color me paranoid.

> > > True, but the "me" confused me. Since that task struct is not me ;-)
> >
> > Well, who is it, then?  ;-)
> 
> It's the app I watch sitting there waiting it's turn for it's callback to
> run.

:-)

> > > > > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > > > > local_bh_disable be sufficient here? You already cover NMIs, which 
> > > > > would
> > > > > also handle normal interrupts.
> > > >
> > > > This is also my understanding, but I think this disable is an
> > > > 'optimization' in that it avoids the regular IRQs from jumping through
> > > > these hoops outlined below.
> > >
> > > But isn't disabling irqs slower than doing a local_bh_disable? So the
> > > majority of times (where irqs will not happen) we have this overhead.
> >
> > The current code absolutely must exclude the scheduling-clock hardirq
> > handler.
> 
> ACKed,
> The reasoning you gave in Peter's reply most certainly makes sense.
> 
> > > > > > + *
> > > > > > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU 
> > > > > > implementation
> > > > >
> > > > > Oh, come now! It's not "nuts" to use this ;-)
> > > > >
> > > > > > + * on a large SMP, they might want to use a hierarchical 
> > > > > > organization of
> > > > > > + * the per-CPU-counter pairs.
> > > > > > + */
> > > >
> > > > Its the large SMP case that's nuts, and on that I have to agree with
> > > > Paul, its not really large SMP friendly.
> > >
> > > Hmm, that could be true. But on large SMP systems, you usually have a
> > > large amounts of memory, so hopefully a really long synchronize_rcu
> > > would not be a problem.
> >
> > Somewhere in the range from 64 to a few hundred CPUs, the global lock
> > protecting the try_flip state machine would start sucking air pretty
> > badly.  But the real problem is synchronize_sched(), which loops through
> > all the CPUs --  this would likely cause problems at a few tens of
> > CPUs, perhaps as early as 10-20.
> 
> hehe, From someone who's largest box is 4 CPUs, to me 16 CPUS is large.
> But I can see hundreds, let alone thousands of CPUs would make a huge
> grinding halt on things like synchronize_sched. God, imaging if all CPUs
> did that approximately at the same time. The system would should a huge
> jitter.

Well, the first time the SGI guys tried to boot a 1024-CPU Altix, I got
an email complaining about RCU overheads.  ;-)  Manfred Spraul fixed
things up for them, though.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt


--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:

>
> If you do a synchronize_rcu() it might well have to wait through the
> following sequence of states:
>
> Stage 0: (might have to wait through part of this to get out of "next" queue)
>   rcu_try_flip_idle_state,/* "I" */
>   rcu_try_flip_waitack_state, /* "A" */
>   rcu_try_flip_waitzero_state,/* "Z" */
>   rcu_try_flip_waitmb_state   /* "M" */
> Stage 1:
>   rcu_try_flip_idle_state,/* "I" */
>   rcu_try_flip_waitack_state, /* "A" */
>   rcu_try_flip_waitzero_state,/* "Z" */
>   rcu_try_flip_waitmb_state   /* "M" */
> Stage 2:
>   rcu_try_flip_idle_state,/* "I" */
>   rcu_try_flip_waitack_state, /* "A" */
>   rcu_try_flip_waitzero_state,/* "Z" */
>   rcu_try_flip_waitmb_state   /* "M" */
> Stage 3:
>   rcu_try_flip_idle_state,/* "I" */
>   rcu_try_flip_waitack_state, /* "A" */
>   rcu_try_flip_waitzero_state,/* "Z" */
>   rcu_try_flip_waitmb_state   /* "M" */
> Stage 4:
>   rcu_try_flip_idle_state,/* "I" */
>   rcu_try_flip_waitack_state, /* "A" */
>   rcu_try_flip_waitzero_state,/* "Z" */
>   rcu_try_flip_waitmb_state   /* "M" */
>
> So yes, grace periods do indeed have some latency.

Yes they do. I'm now at the point that I'm just "trusting" you that you
understand that each of these stages are needed. My IQ level only lets me
understand next -> wait -> done, but not the extra 3 shifts in wait.

;-)

> >
> > True, but the "me" confused me. Since that task struct is not me ;-)
>
> Well, who is it, then?  ;-)

It's the app I watch sitting there waiting it's turn for it's callback to
run.

> > > >
> > > > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > > > local_bh_disable be sufficient here? You already cover NMIs, which would
> > > > also handle normal interrupts.
> > >
> > > This is also my understanding, but I think this disable is an
> > > 'optimization' in that it avoids the regular IRQs from jumping through
> > > these hoops outlined below.
> >
> > But isn't disabling irqs slower than doing a local_bh_disable? So the
> > majority of times (where irqs will not happen) we have this overhead.
>
> The current code absolutely must exclude the scheduling-clock hardirq
> handler.

ACKed,
The reasoning you gave in Peter's reply most certainly makes sense.


> > > > > + *
> > > > > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU 
> > > > > implementation
> > > >
> > > > Oh, come now! It's not "nuts" to use this ;-)
> > > >
> > > > > + * on a large SMP, they might want to use a hierarchical 
> > > > > organization of
> > > > > + * the per-CPU-counter pairs.
> > > > > + */
> > >
> > > Its the large SMP case that's nuts, and on that I have to agree with
> > > Paul, its not really large SMP friendly.
> >
> > Hmm, that could be true. But on large SMP systems, you usually have a
> > large amounts of memory, so hopefully a really long synchronize_rcu
> > would not be a problem.
>
> Somewhere in the range from 64 to a few hundred CPUs, the global lock
> protecting the try_flip state machine would start sucking air pretty
> badly.  But the real problem is synchronize_sched(), which loops through
> all the CPUs --  this would likely cause problems at a few tens of
> CPUs, perhaps as early as 10-20.

hehe, From someone who's largest box is 4 CPUs, to me 16 CPUS is large.
But I can see hundreds, let alone thousands of CPUs would make a huge
grinding halt on things like synchronize_sched. God, imaging if all CPUs
did that approximately at the same time. The system would should a huge
jitter.

-- Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 11:20:48AM -0400, Steven Rostedt wrote:
> On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> > +
> > +/*
> > + * PREEMPT_RCU data structures.
> > + */
> > +
> > +#define GP_STAGES 4
> > +struct rcu_data {
> > +   spinlock_t  lock;   /* Protect rcu_data fields. */
> > +   longcompleted;  /* Number of last completed batch. */
> > +   int waitlistcount;
> > +   struct tasklet_struct rcu_tasklet;
> > +   struct rcu_head *nextlist;
> > +   struct rcu_head **nexttail;
> > +   struct rcu_head *waitlist[GP_STAGES];
> > +   struct rcu_head **waittail[GP_STAGES];
> > +   struct rcu_head *donelist;
> > +   struct rcu_head **donetail;
> > +#ifdef CONFIG_RCU_TRACE
> > +   struct rcupreempt_trace trace;
> > +#endif /* #ifdef CONFIG_RCU_TRACE */
> > +};
> > +struct rcu_ctrlblk {
> > +   spinlock_t  fliplock;   /* Protect state-machine transitions. */
> > +   longcompleted;  /* Number of last completed batch. */
> > +};
> > +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
> > +static struct rcu_ctrlblk rcu_ctrlblk = {
> > +   .fliplock = SPIN_LOCK_UNLOCKED,
> > +   .completed = 0,
> > +};
> > +static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
> > +
> > +/*
> > + * States for rcu_try_flip() and friends.
> > + */
> > +
> > +enum rcu_try_flip_states {
> > +   rcu_try_flip_idle_state,/* "I" */
> > +   rcu_try_flip_waitack_state, /* "A" */
> > +   rcu_try_flip_waitzero_state,/* "Z" */
> > +   rcu_try_flip_waitmb_state   /* "M" */
> > +};
> > +static enum rcu_try_flip_states rcu_try_flip_state = 
> > rcu_try_flip_idle_state;
> > +#ifdef CONFIG_RCU_TRACE
> > +static char *rcu_try_flip_state_names[] =
> > +   { "idle", "waitack", "waitzero", "waitmb" };
> > +#endif /* #ifdef CONFIG_RCU_TRACE */
> 
> [snip]
> 
> > +/*
> > + * If a global counter flip has occurred since the last time that we
> > + * advanced callbacks, advance them.  Hardware interrupts must be
> > + * disabled when calling this function.
> > + */
> > +static void __rcu_advance_callbacks(struct rcu_data *rdp)
> > +{
> > +   int cpu;
> > +   int i;
> > +   int wlc = 0;
> > +
> > +   if (rdp->completed != rcu_ctrlblk.completed) {
> > +   if (rdp->waitlist[GP_STAGES - 1] != NULL) {
> > +   *rdp->donetail = rdp->waitlist[GP_STAGES - 1];
> > +   rdp->donetail = rdp->waittail[GP_STAGES - 1];
> > +   RCU_TRACE_RDP(rcupreempt_trace_move2done, rdp);
> > +   }
> > +   for (i = GP_STAGES - 2; i >= 0; i--) {
> > +   if (rdp->waitlist[i] != NULL) {
> > +   rdp->waitlist[i + 1] = rdp->waitlist[i];
> > +   rdp->waittail[i + 1] = rdp->waittail[i];
> > +   wlc++;
> > +   } else {
> > +   rdp->waitlist[i + 1] = NULL;
> > +   rdp->waittail[i + 1] =
> > +   >waitlist[i + 1];
> > +   }
> > +   }
> > +   if (rdp->nextlist != NULL) {
> > +   rdp->waitlist[0] = rdp->nextlist;
> > +   rdp->waittail[0] = rdp->nexttail;
> > +   wlc++;
> > +   rdp->nextlist = NULL;
> > +   rdp->nexttail = >nextlist;
> > +   RCU_TRACE_RDP(rcupreempt_trace_move2wait, rdp);
> > +   } else {
> > +   rdp->waitlist[0] = NULL;
> > +   rdp->waittail[0] = >waitlist[0];
> > +   }
> > +   rdp->waitlistcount = wlc;
> > +   rdp->completed = rcu_ctrlblk.completed;
> > +   }
> > +
> > +   /*
> > +* Check to see if this CPU needs to report that it has seen
> > +* the most recent counter flip, thereby declaring that all
> > +* subsequent rcu_read_lock() invocations will respect this flip.
> > +*/
> > +
> > +   cpu = raw_smp_processor_id();
> > +   if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
> > +   smp_mb();  /* Subsequent counter accesses must see new value */
> > +   per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
> > +   smp_mb();  /* Subsequent RCU read-side critical sections */
> > +  /*  seen -after- acknowledgement. */
> > +   }
> > +}
> 
> [snip]
> 
> > +/*
> > + * Attempt a single flip of the counters.  Remember, a single flip does
> > + * -not- constitute a grace period.  Instead, the interval between
> > + * at least three consecutive flips is a grace period.
> > + *
> > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> > + * on a large SMP, they might want to use a hierarchical organization of
> > + * the per-CPU-counter pairs.
> > + */
> > +static void rcu_try_flip(void)
> > +{
> > +   unsigned long oldirq;
> > +
> > +   RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
> > +   if (unlikely(!spin_trylock_irqsave(_ctrlblk.fliplock, oldirq))) {
> > +   

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 06:31:12PM -0400, Steven Rostedt wrote:
> On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:
> > On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <[EMAIL PROTECTED]>
> > wrote:
> > 
> > > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> > 
> > 
> > > Can you have a pointer somewhere that explains these states. And not a
> > > "it's in this paper or directory". Either have a short discription here,
> > > or specify where exactly to find the information (perhaps a
> > > Documentation/RCU/preemptible_states.txt?).
> > > 
> > > Trying to understand these states has caused me the most agony in
> > > reviewing these patches.
> > > 
> > > > + */
> > > > +
> > > > +enum rcu_try_flip_states {
> > > > +   rcu_try_flip_idle_state,/* "I" */
> > > > +   rcu_try_flip_waitack_state, /* "A" */
> > > > +   rcu_try_flip_waitzero_state,/* "Z" */
> > > > +   rcu_try_flip_waitmb_state   /* "M" */
> > > > +};
> > 
> > I thought the 4 flip states corresponded to the 4 GP stages, but now
> > you confused me. It seems to indeed progress one stage for every 4 flip
> > states.
> 
> I'm still confused ;-)

If you do a synchronize_rcu() it might well have to wait through the
following sequence of states:

Stage 0: (might have to wait through part of this to get out of "next" queue)
rcu_try_flip_idle_state,/* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state,/* "Z" */
rcu_try_flip_waitmb_state   /* "M" */
Stage 1:
rcu_try_flip_idle_state,/* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state,/* "Z" */
rcu_try_flip_waitmb_state   /* "M" */
Stage 2:
rcu_try_flip_idle_state,/* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state,/* "Z" */
rcu_try_flip_waitmb_state   /* "M" */
Stage 3:
rcu_try_flip_idle_state,/* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state,/* "Z" */
rcu_try_flip_waitmb_state   /* "M" */
Stage 4:
rcu_try_flip_idle_state,/* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state,/* "Z" */
rcu_try_flip_waitmb_state   /* "M" */

So yes, grace periods do indeed have some latency.

> > Hmm, now I have to puzzle how these 4 stages are required by the lock
> > and unlock magic.
> > 
> > > > +/*
> > > > + * Return the number of RCU batches processed thus far.  Useful for 
> > > > debug
> > > > + * and statistics.  The _bh variant is identical to straight RCU.
> > > > + */
> > > 
> > > If they are identical, then why the separation?
> > 
> > I guess a smaller RCU domain makes for quicker grace periods.
> 
> No, I mean that both the rcu_batches_completed and
> rcu_batches_completed_bh are identical. Perhaps we can just put in a
> 
> #define rcu_batches_completed_bh rcu_batches_completed
> 
> in rcupreempt.h.  In rcuclassic, they are different. But no need to have
> two identical functions in the preempt version. A macro should do.

Ah!!!  Good point, #define does make sense here.

> > > > +void __rcu_read_lock(void)
> > > > +{
> > > > +   int idx;
> > > > +   struct task_struct *me = current;
> > > 
> > > Nitpick, but other places in the kernel usually use "t" or "p" as a
> > > variable to assign current to.  It's just that "me" thows me off a
> > > little while reviewing this.  But this is just a nitpick, so do as you
> > > will.
> > 
> > struct task_struct *curr = current;
> > 
> > is also not uncommon.
> 
> True, but the "me" confused me. Since that task struct is not me ;-)

Well, who is it, then?  ;-)

> > > > +   int nesting;
> > > > +
> > > > +   nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > > > +   if (nesting != 0) {
> > > > +
> > > > +   /* An earlier rcu_read_lock() covers us, just count it. 
> > > > */
> > > > +
> > > > +   me->rcu_read_lock_nesting = nesting + 1;
> > > > +
> > > > +   } else {
> > > > +   unsigned long oldirq;
> > > 
> > > > +
> > > > +   /*
> > > > +* Disable local interrupts to prevent the grace-period
> > > > +* detection state machine from seeing us half-done.
> > > > +* NMIs can still occur, of course, and might themselves
> > > > +* contain rcu_read_lock().
> > > > +*/
> > > > +
> > > > +   local_irq_save(oldirq);
> > > 
> > > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > > local_bh_disable be sufficient here? You already cover NMIs, which would
> > > also handle normal interrupts.
> > 
> > This is also my understanding, but I think this disable is an
> > 'optimization' in that it avoids the regular IRQs from jumping through
> > these hoops outlined 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt
On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:
> On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <[EMAIL PROTECTED]>
> wrote:
> 
> > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> 
> 
> > Can you have a pointer somewhere that explains these states. And not a
> > "it's in this paper or directory". Either have a short discription here,
> > or specify where exactly to find the information (perhaps a
> > Documentation/RCU/preemptible_states.txt?).
> > 
> > Trying to understand these states has caused me the most agony in
> > reviewing these patches.
> > 
> > > + */
> > > +
> > > +enum rcu_try_flip_states {
> > > + rcu_try_flip_idle_state,/* "I" */
> > > + rcu_try_flip_waitack_state, /* "A" */
> > > + rcu_try_flip_waitzero_state,/* "Z" */
> > > + rcu_try_flip_waitmb_state   /* "M" */
> > > +};
> 
> I thought the 4 flip states corresponded to the 4 GP stages, but now
> you confused me. It seems to indeed progress one stage for every 4 flip
> states.

I'm still confused ;-)

> 
> Hmm, now I have to puzzle how these 4 stages are required by the lock
> and unlock magic.
> 
> > > +/*
> > > + * Return the number of RCU batches processed thus far.  Useful for debug
> > > + * and statistics.  The _bh variant is identical to straight RCU.
> > > + */
> > 
> > If they are identical, then why the separation?
> 
> I guess a smaller RCU domain makes for quicker grace periods.

No, I mean that both the rcu_batches_completed and
rcu_batches_completed_bh are identical. Perhaps we can just put in a

#define rcu_batches_completed_bh rcu_batches_completed

in rcupreempt.h.  In rcuclassic, they are different. But no need to have
two identical functions in the preempt version. A macro should do.

>  
> > > +void __rcu_read_lock(void)
> > > +{
> > > + int idx;
> > > + struct task_struct *me = current;
> > 
> > Nitpick, but other places in the kernel usually use "t" or "p" as a
> > variable to assign current to.  It's just that "me" thows me off a
> > little while reviewing this.  But this is just a nitpick, so do as you
> > will.
> 
> struct task_struct *curr = current;
> 
> is also not uncommon.

True, but the "me" confused me. Since that task struct is not me ;-)

>  
> > > + int nesting;
> > > +
> > > + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > > + if (nesting != 0) {
> > > +
> > > + /* An earlier rcu_read_lock() covers us, just count it. */
> > > +
> > > + me->rcu_read_lock_nesting = nesting + 1;
> > > +
> > > + } else {
> > > + unsigned long oldirq;
> > 
> > > +
> > > + /*
> > > +  * Disable local interrupts to prevent the grace-period
> > > +  * detection state machine from seeing us half-done.
> > > +  * NMIs can still occur, of course, and might themselves
> > > +  * contain rcu_read_lock().
> > > +  */
> > > +
> > > + local_irq_save(oldirq);
> > 
> > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > local_bh_disable be sufficient here? You already cover NMIs, which would
> > also handle normal interrupts.
> 
> This is also my understanding, but I think this disable is an
> 'optimization' in that it avoids the regular IRQs from jumping through
> these hoops outlined below.

But isn't disabling irqs slower than doing a local_bh_disable? So the
majority of times (where irqs will not happen) we have this overhead.

> 
> > > +
> > > + /*
> > > +  * Outermost nesting of rcu_read_lock(), so increment
> > > +  * the current counter for the current CPU.  Use volatile
> > > +  * casts to prevent the compiler from reordering.
> > > +  */
> > > +
> > > + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> > > + smp_read_barrier_depends();  /*  might be unneeded */
> > > + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
> > > +
> > > + /*
> > > +  * Now that the per-CPU counter has been incremented, we
> > > +  * are protected from races with rcu_read_lock() invoked
> > > +  * from NMI handlers on this CPU.  We can therefore safely
> > > +  * increment the nesting counter, relieving further NMIs
> > > +  * of the need to increment the per-CPU counter.
> > > +  */
> > > +
> > > + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1;
> > > +
> > > + /*
> > > +  * Now that we have preventing any NMIs from storing
> > > +  * to the ->rcu_flipctr_idx, we can safely use it to
> > > +  * remember which counter to decrement in the matching
> > > +  * rcu_read_unlock().
> > > +  */
> > > +
> > > + ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
> > > + local_irq_restore(oldirq);
> > > + }
> > > +}
> 
> > > +/*
> > > + * Attempt a single flip of the counters.  Remember, a single flip does
> > > + * -not- constitute a grace period.  Instead, the interval between
> > > + * at 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:
> On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <[EMAIL PROTECTED]>
> wrote:
>
> > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> 
> > Can you have a pointer somewhere that explains these states. And not a
> > "it's in this paper or directory". Either have a short discription here,
> > or specify where exactly to find the information (perhaps a
> > Documentation/RCU/preemptible_states.txt?).
> > 
> > Trying to understand these states has caused me the most agony in
> > reviewing these patches.
> > 
> > > + */
> > > +
> > > +enum rcu_try_flip_states {
> > > + rcu_try_flip_idle_state,/* "I" */
> > > + rcu_try_flip_waitack_state, /* "A" */
> > > + rcu_try_flip_waitzero_state,/* "Z" */
> > > + rcu_try_flip_waitmb_state   /* "M" */
> > > +};
> 
> I thought the 4 flip states corresponded to the 4 GP stages, but now
> you confused me. It seems to indeed progress one stage for every 4 flip
> states.

Yes, four flip states per stage, and four stages per grace period:

rcu_try_flip_idle_state:Stay here if nothing is happening.
Flip the counter if something starts
happening.

rcu_try_flip_waitack_state: Wait here for all CPUs to notice that
the counter has flipped.  This prevents
the old set of counters from ever being
incremented once we leave this state,
which in turn is necessary because we
cannot test any individual counter for
zero -- we can only check the sum.

rcu_try_flip_waitzero_state:Wait here for the sum of the old
per-CPU counters to reach zero.

rcu_try_flip_waitmb_state:  Wait here for each of the other
CPUs to execute a memory barrier.
This is necessary to ensure that
these other CPUs really have finished
executing their RCU read-side
critical sections, despite their
CPUs wildly reordering memory.

This state 

> Hmm, now I have to puzzle how these 4 stages are required by the lock
> and unlock magic.

They are needed to allow memory barriers to be removed from
rcu_read_lock(), to allow for the fact that the stages mean that
grace-period boundaries are now fuzzy, as well as to account for the
fact that each CPU now has its own callback queue.  (Aside: classic RCU
doesn't have memory barriers -- or anything else -- in rcu_read_lock()
and rcu_read_unlock(), but from an implementation viewpoint, the read-side
critical section extends forwards and backwards to the next/previous
context switches, which do have lots of memory barriers.)

Taking the reasons for stages one at a time:

1.  The first stage is needed for generic RCU.  Everyone must
complete any pre-existing RCU read-side critical sections
before the grace period can be permitted to complete.  This
stage corresponds to the "waitlist" and "waittail" in earlier
RCU implementations.

2.  An additional stage is needed due to the fact that memory reordering
can cause an RCU read-side critical section's rcu_dereference()
to be executed before the rcu_read_lock() -- no memory barriers!

This can result in the following sequence of events:

o   rcu_dereference() from CPU 0.

o   list_del_rcu() from CPU 1.

o   call_rcu() from CPU 1.

o   counter flip from CPU 2.

o   rcu_read_lock() from CPU 0 (delayed due to memory reordering).

CPU 0 increments the new counter, not the old one that
its rcu_dereference() is associated with.

o   All CPUs acknowledge the counter flip.

o   CPU 2 discovers that the old counters sum to zero (because
CPU 0 has incremented its new counter).

o   All CPUs do their memory barriers.

o   CPU 1's callback is invoked, destroying the memory still
being used by CPU 0.

An additional pass through the state machine fixes this problem.

Note that the read-after-write misordering can happen on x86,
as well as on s390.

3.  An additional stage is also needed due to the fact that each CPU
has its own set of callback queues.  This means that callbacks
that were registered quite some time after the most recent counter
flip can make it into the current round of the state machine,
as follows:

o   CPU 0 flips the counter.

o   CPU 1 does an rcu_read_lock() followed by an rcu_dereference().
 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Peter Zijlstra
On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <[EMAIL PROTECTED]>
wrote:

> On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:


> Can you have a pointer somewhere that explains these states. And not a
> "it's in this paper or directory". Either have a short discription here,
> or specify where exactly to find the information (perhaps a
> Documentation/RCU/preemptible_states.txt?).
> 
> Trying to understand these states has caused me the most agony in
> reviewing these patches.
> 
> > + */
> > +
> > +enum rcu_try_flip_states {
> > +   rcu_try_flip_idle_state,/* "I" */
> > +   rcu_try_flip_waitack_state, /* "A" */
> > +   rcu_try_flip_waitzero_state,/* "Z" */
> > +   rcu_try_flip_waitmb_state   /* "M" */
> > +};

I thought the 4 flip states corresponded to the 4 GP stages, but now
you confused me. It seems to indeed progress one stage for every 4 flip
states.

Hmm, now I have to puzzle how these 4 stages are required by the lock
and unlock magic.

> > +/*
> > + * Return the number of RCU batches processed thus far.  Useful for debug
> > + * and statistics.  The _bh variant is identical to straight RCU.
> > + */
> 
> If they are identical, then why the separation?

I guess a smaller RCU domain makes for quicker grace periods.
 
> > +void __rcu_read_lock(void)
> > +{
> > +   int idx;
> > +   struct task_struct *me = current;
> 
> Nitpick, but other places in the kernel usually use "t" or "p" as a
> variable to assign current to.  It's just that "me" thows me off a
> little while reviewing this.  But this is just a nitpick, so do as you
> will.

struct task_struct *curr = current;

is also not uncommon.
 
> > +   int nesting;
> > +
> > +   nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > +   if (nesting != 0) {
> > +
> > +   /* An earlier rcu_read_lock() covers us, just count it. */
> > +
> > +   me->rcu_read_lock_nesting = nesting + 1;
> > +
> > +   } else {
> > +   unsigned long oldirq;
> 
> > +
> > +   /*
> > +* Disable local interrupts to prevent the grace-period
> > +* detection state machine from seeing us half-done.
> > +* NMIs can still occur, of course, and might themselves
> > +* contain rcu_read_lock().
> > +*/
> > +
> > +   local_irq_save(oldirq);
> 
> Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> local_bh_disable be sufficient here? You already cover NMIs, which would
> also handle normal interrupts.

This is also my understanding, but I think this disable is an
'optimization' in that it avoids the regular IRQs from jumping through
these hoops outlined below.

> > +
> > +   /*
> > +* Outermost nesting of rcu_read_lock(), so increment
> > +* the current counter for the current CPU.  Use volatile
> > +* casts to prevent the compiler from reordering.
> > +*/
> > +
> > +   idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> > +   smp_read_barrier_depends();  /*  might be unneeded */
> > +   ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
> > +
> > +   /*
> > +* Now that the per-CPU counter has been incremented, we
> > +* are protected from races with rcu_read_lock() invoked
> > +* from NMI handlers on this CPU.  We can therefore safely
> > +* increment the nesting counter, relieving further NMIs
> > +* of the need to increment the per-CPU counter.
> > +*/
> > +
> > +   ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1;
> > +
> > +   /*
> > +* Now that we have preventing any NMIs from storing
> > +* to the ->rcu_flipctr_idx, we can safely use it to
> > +* remember which counter to decrement in the matching
> > +* rcu_read_unlock().
> > +*/
> > +
> > +   ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
> > +   local_irq_restore(oldirq);
> > +   }
> > +}

> > +/*
> > + * Attempt a single flip of the counters.  Remember, a single flip does
> > + * -not- constitute a grace period.  Instead, the interval between
> > + * at least three consecutive flips is a grace period.
> > + *
> > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> 
> Oh, come now! It's not "nuts" to use this ;-)
> 
> > + * on a large SMP, they might want to use a hierarchical organization of
> > + * the per-CPU-counter pairs.
> > + */

Its the large SMP case that's nuts, and on that I have to agree with
Paul, its not really large SMP friendly.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt
On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> +
> +/*
> + * PREEMPT_RCU data structures.
> + */
> +
> +#define GP_STAGES 4
> +struct rcu_data {
> + spinlock_t  lock;   /* Protect rcu_data fields. */
> + longcompleted;  /* Number of last completed batch. */
> + int waitlistcount;
> + struct tasklet_struct rcu_tasklet;
> + struct rcu_head *nextlist;
> + struct rcu_head **nexttail;
> + struct rcu_head *waitlist[GP_STAGES];
> + struct rcu_head **waittail[GP_STAGES];
> + struct rcu_head *donelist;
> + struct rcu_head **donetail;
> +#ifdef CONFIG_RCU_TRACE
> + struct rcupreempt_trace trace;
> +#endif /* #ifdef CONFIG_RCU_TRACE */
> +};
> +struct rcu_ctrlblk {
> + spinlock_t  fliplock;   /* Protect state-machine transitions. */
> + longcompleted;  /* Number of last completed batch. */
> +};
> +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
> +static struct rcu_ctrlblk rcu_ctrlblk = {
> + .fliplock = SPIN_LOCK_UNLOCKED,
> + .completed = 0,
> +};
> +static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
> +
> +/*
> + * States for rcu_try_flip() and friends.
> + */
> +
> +enum rcu_try_flip_states {
> + rcu_try_flip_idle_state,/* "I" */
> + rcu_try_flip_waitack_state, /* "A" */
> + rcu_try_flip_waitzero_state,/* "Z" */
> + rcu_try_flip_waitmb_state   /* "M" */
> +};
> +static enum rcu_try_flip_states rcu_try_flip_state = rcu_try_flip_idle_state;
> +#ifdef CONFIG_RCU_TRACE
> +static char *rcu_try_flip_state_names[] =
> + { "idle", "waitack", "waitzero", "waitmb" };
> +#endif /* #ifdef CONFIG_RCU_TRACE */

[snip]

> +/*
> + * If a global counter flip has occurred since the last time that we
> + * advanced callbacks, advance them.  Hardware interrupts must be
> + * disabled when calling this function.
> + */
> +static void __rcu_advance_callbacks(struct rcu_data *rdp)
> +{
> + int cpu;
> + int i;
> + int wlc = 0;
> +
> + if (rdp->completed != rcu_ctrlblk.completed) {
> + if (rdp->waitlist[GP_STAGES - 1] != NULL) {
> + *rdp->donetail = rdp->waitlist[GP_STAGES - 1];
> + rdp->donetail = rdp->waittail[GP_STAGES - 1];
> + RCU_TRACE_RDP(rcupreempt_trace_move2done, rdp);
> + }
> + for (i = GP_STAGES - 2; i >= 0; i--) {
> + if (rdp->waitlist[i] != NULL) {
> + rdp->waitlist[i + 1] = rdp->waitlist[i];
> + rdp->waittail[i + 1] = rdp->waittail[i];
> + wlc++;
> + } else {
> + rdp->waitlist[i + 1] = NULL;
> + rdp->waittail[i + 1] =
> + >waitlist[i + 1];
> + }
> + }
> + if (rdp->nextlist != NULL) {
> + rdp->waitlist[0] = rdp->nextlist;
> + rdp->waittail[0] = rdp->nexttail;
> + wlc++;
> + rdp->nextlist = NULL;
> + rdp->nexttail = >nextlist;
> + RCU_TRACE_RDP(rcupreempt_trace_move2wait, rdp);
> + } else {
> + rdp->waitlist[0] = NULL;
> + rdp->waittail[0] = >waitlist[0];
> + }
> + rdp->waitlistcount = wlc;
> + rdp->completed = rcu_ctrlblk.completed;
> + }
> +
> + /*
> +  * Check to see if this CPU needs to report that it has seen
> +  * the most recent counter flip, thereby declaring that all
> +  * subsequent rcu_read_lock() invocations will respect this flip.
> +  */
> +
> + cpu = raw_smp_processor_id();
> + if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
> + smp_mb();  /* Subsequent counter accesses must see new value */
> + per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
> + smp_mb();  /* Subsequent RCU read-side critical sections */
> +/*  seen -after- acknowledgement. */
> + }
> +}

[snip]

> +/*
> + * Attempt a single flip of the counters.  Remember, a single flip does
> + * -not- constitute a grace period.  Instead, the interval between
> + * at least three consecutive flips is a grace period.
> + *
> + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> + * on a large SMP, they might want to use a hierarchical organization of
> + * the per-CPU-counter pairs.
> + */
> +static void rcu_try_flip(void)
> +{
> + unsigned long oldirq;
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
> + if (unlikely(!spin_trylock_irqsave(_ctrlblk.fliplock, oldirq))) {
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_e1);
> + return;
> + }
> +
> + /*
> +  * Take the next transition(s) through the RCU grace-period
> +  * flip-counter 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt
On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
>  
>  #endif /* __KERNEL__ */
>  #endif /* __LINUX_RCUCLASSIC_H */
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
> linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
> --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h   2007-07-19 
> 14:02:36.0 -0700
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h2007-08-22 
> 15:21:06.0 -0700
> @@ -52,7 +52,11 @@ struct rcu_head {
>   void (*func)(struct rcu_head *head);
>  };
>  
> +#ifdef CONFIG_CLASSIC_RCU
>  #include 
> +#else /* #ifdef CONFIG_CLASSIC_RCU */
> +#include 
> +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */

A bit extreme on the comments here.


>  
>  #define RCU_HEAD_INIT{ .next = NULL, .func = NULL }
>  #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
> @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
>  /* Exported common interfaces */
>  extern void synchronize_rcu(void);
>  extern void rcu_barrier(void);
> +extern long rcu_batches_completed(void);
> +extern long rcu_batches_completed_bh(void);
>  
>  /* Internal to kernel */
>  extern void rcu_init(void);
>  extern void rcu_check_callbacks(int cpu, int user);
> +extern int rcu_needs_cpu(int cpu);
>  
>  #endif /* __KERNEL__ */
>  #endif /* __LINUX_RCUPDATE_H */
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 
> linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h
> --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 1969-12-31 
> 16:00:00.0 -0800
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h  2007-08-22 
> 15:21:06.0 -0700
> @@ -0,0 +1,78 @@
> +/*
> + * Read-Copy Update mechanism for mutual exclusion (RT implementation)
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) IBM Corporation, 2006
> + *
> + * Author:  Paul McKenney <[EMAIL PROTECTED]>
> + *
> + * Based on the original work by Paul McKenney <[EMAIL PROTECTED]>
> + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
> + * Papers:
> + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
> + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
> + *
> + * For detailed explanation of Read-Copy Update mechanism see -
> + *   Documentation/RCU
> + *
> + */
> +
> +#ifndef __LINUX_RCUPREEMPT_H
> +#define __LINUX_RCUPREEMPT_H
> +
> +#ifdef __KERNEL__
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define rcu_qsctr_inc(cpu)
> +#define rcu_bh_qsctr_inc(cpu)
> +#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
> +
> +extern void __rcu_read_lock(void);
> +extern void __rcu_read_unlock(void);
> +extern int rcu_pending(int cpu);
> +extern int rcu_needs_cpu(int cpu);
> +
> +#define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); }
> +#define __rcu_read_unlock_bh()   { local_bh_enable(); rcu_read_unlock(); 
> }
> +
> +#define __rcu_read_lock_nesting()(current->rcu_read_lock_nesting)
> +
> +extern void __synchronize_sched(void);
> +
> +extern void __rcu_init(void);
> +extern void rcu_check_callbacks(int cpu, int user);
> +extern void rcu_restart_cpu(int cpu);
> +
> +#ifdef CONFIG_RCU_TRACE
> +struct rcupreempt_trace;
> +extern int *rcupreempt_flipctr(int cpu);
> +extern long rcupreempt_data_completed(void);
> +extern int rcupreempt_flip_flag(int cpu);
> +extern int rcupreempt_mb_flag(int cpu);
> +extern char *rcupreempt_try_flip_state_name(void);
> +extern struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu);
> +#endif
> +
> +struct softirq_action;
> +
> +#endif /* __KERNEL__ */
> +#endif /* __LINUX_RCUPREEMPT_H */
> diff -urpNa -X dontdiff 
> linux-2.6.22-b-fixbarriers/include/linux/rcupreempt_trace.h 
> linux-2.6.22-c-preemptrcu/include/linux/rcupreempt_trace.h
> --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt_trace.h   
> 1969-12-31 16:00:00.0 -0800
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt_trace.h
> 2007-08-22 15:21:06.0 -0700
> @@ -0,0 +1,100 @@
> +/*
> + * Read-Copy Update mechanism for mutual exclusion (RT implementation)
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt
On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
  
  #endif /* __KERNEL__ */
  #endif /* __LINUX_RCUCLASSIC_H */
 diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
 linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
 --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h   2007-07-19 
 14:02:36.0 -0700
 +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h2007-08-22 
 15:21:06.0 -0700
 @@ -52,7 +52,11 @@ struct rcu_head {
   void (*func)(struct rcu_head *head);
  };
  
 +#ifdef CONFIG_CLASSIC_RCU
  #include linux/rcuclassic.h
 +#else /* #ifdef CONFIG_CLASSIC_RCU */
 +#include linux/rcupreempt.h
 +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */

A bit extreme on the comments here.


  
  #define RCU_HEAD_INIT{ .next = NULL, .func = NULL }
  #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
 @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
  /* Exported common interfaces */
  extern void synchronize_rcu(void);
  extern void rcu_barrier(void);
 +extern long rcu_batches_completed(void);
 +extern long rcu_batches_completed_bh(void);
  
  /* Internal to kernel */
  extern void rcu_init(void);
  extern void rcu_check_callbacks(int cpu, int user);
 +extern int rcu_needs_cpu(int cpu);
  
  #endif /* __KERNEL__ */
  #endif /* __LINUX_RCUPDATE_H */
 diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 
 linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h
 --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 1969-12-31 
 16:00:00.0 -0800
 +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h  2007-08-22 
 15:21:06.0 -0700
 @@ -0,0 +1,78 @@
 +/*
 + * Read-Copy Update mechanism for mutual exclusion (RT implementation)
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License as published by
 + * the Free Software Foundation; either version 2 of the License, or
 + * (at your option) any later version.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program; if not, write to the Free Software
 + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 + *
 + * Copyright (C) IBM Corporation, 2006
 + *
 + * Author:  Paul McKenney [EMAIL PROTECTED]
 + *
 + * Based on the original work by Paul McKenney [EMAIL PROTECTED]
 + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
 + * Papers:
 + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
 + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
 + *
 + * For detailed explanation of Read-Copy Update mechanism see -
 + *   Documentation/RCU
 + *
 + */
 +
 +#ifndef __LINUX_RCUPREEMPT_H
 +#define __LINUX_RCUPREEMPT_H
 +
 +#ifdef __KERNEL__
 +
 +#include linux/cache.h
 +#include linux/spinlock.h
 +#include linux/threads.h
 +#include linux/percpu.h
 +#include linux/cpumask.h
 +#include linux/seqlock.h
 +
 +#define rcu_qsctr_inc(cpu)
 +#define rcu_bh_qsctr_inc(cpu)
 +#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
 +
 +extern void __rcu_read_lock(void);
 +extern void __rcu_read_unlock(void);
 +extern int rcu_pending(int cpu);
 +extern int rcu_needs_cpu(int cpu);
 +
 +#define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); }
 +#define __rcu_read_unlock_bh()   { local_bh_enable(); rcu_read_unlock(); 
 }
 +
 +#define __rcu_read_lock_nesting()(current-rcu_read_lock_nesting)
 +
 +extern void __synchronize_sched(void);
 +
 +extern void __rcu_init(void);
 +extern void rcu_check_callbacks(int cpu, int user);
 +extern void rcu_restart_cpu(int cpu);
 +
 +#ifdef CONFIG_RCU_TRACE
 +struct rcupreempt_trace;
 +extern int *rcupreempt_flipctr(int cpu);
 +extern long rcupreempt_data_completed(void);
 +extern int rcupreempt_flip_flag(int cpu);
 +extern int rcupreempt_mb_flag(int cpu);
 +extern char *rcupreempt_try_flip_state_name(void);
 +extern struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu);
 +#endif
 +
 +struct softirq_action;
 +
 +#endif /* __KERNEL__ */
 +#endif /* __LINUX_RCUPREEMPT_H */
 diff -urpNa -X dontdiff 
 linux-2.6.22-b-fixbarriers/include/linux/rcupreempt_trace.h 
 linux-2.6.22-c-preemptrcu/include/linux/rcupreempt_trace.h
 --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt_trace.h   
 1969-12-31 16:00:00.0 -0800
 +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt_trace.h
 2007-08-22 15:21:06.0 -0700
 @@ -0,0 +1,100 @@
 +/*
 + * Read-Copy Update mechanism for mutual exclusion (RT implementation)
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt
On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
 +
 +/*
 + * PREEMPT_RCU data structures.
 + */
 +
 +#define GP_STAGES 4
 +struct rcu_data {
 + spinlock_t  lock;   /* Protect rcu_data fields. */
 + longcompleted;  /* Number of last completed batch. */
 + int waitlistcount;
 + struct tasklet_struct rcu_tasklet;
 + struct rcu_head *nextlist;
 + struct rcu_head **nexttail;
 + struct rcu_head *waitlist[GP_STAGES];
 + struct rcu_head **waittail[GP_STAGES];
 + struct rcu_head *donelist;
 + struct rcu_head **donetail;
 +#ifdef CONFIG_RCU_TRACE
 + struct rcupreempt_trace trace;
 +#endif /* #ifdef CONFIG_RCU_TRACE */
 +};
 +struct rcu_ctrlblk {
 + spinlock_t  fliplock;   /* Protect state-machine transitions. */
 + longcompleted;  /* Number of last completed batch. */
 +};
 +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
 +static struct rcu_ctrlblk rcu_ctrlblk = {
 + .fliplock = SPIN_LOCK_UNLOCKED,
 + .completed = 0,
 +};
 +static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
 +
 +/*
 + * States for rcu_try_flip() and friends.
 + */
 +
 +enum rcu_try_flip_states {
 + rcu_try_flip_idle_state,/* I */
 + rcu_try_flip_waitack_state, /* A */
 + rcu_try_flip_waitzero_state,/* Z */
 + rcu_try_flip_waitmb_state   /* M */
 +};
 +static enum rcu_try_flip_states rcu_try_flip_state = rcu_try_flip_idle_state;
 +#ifdef CONFIG_RCU_TRACE
 +static char *rcu_try_flip_state_names[] =
 + { idle, waitack, waitzero, waitmb };
 +#endif /* #ifdef CONFIG_RCU_TRACE */

[snip]

 +/*
 + * If a global counter flip has occurred since the last time that we
 + * advanced callbacks, advance them.  Hardware interrupts must be
 + * disabled when calling this function.
 + */
 +static void __rcu_advance_callbacks(struct rcu_data *rdp)
 +{
 + int cpu;
 + int i;
 + int wlc = 0;
 +
 + if (rdp-completed != rcu_ctrlblk.completed) {
 + if (rdp-waitlist[GP_STAGES - 1] != NULL) {
 + *rdp-donetail = rdp-waitlist[GP_STAGES - 1];
 + rdp-donetail = rdp-waittail[GP_STAGES - 1];
 + RCU_TRACE_RDP(rcupreempt_trace_move2done, rdp);
 + }
 + for (i = GP_STAGES - 2; i = 0; i--) {
 + if (rdp-waitlist[i] != NULL) {
 + rdp-waitlist[i + 1] = rdp-waitlist[i];
 + rdp-waittail[i + 1] = rdp-waittail[i];
 + wlc++;
 + } else {
 + rdp-waitlist[i + 1] = NULL;
 + rdp-waittail[i + 1] =
 + rdp-waitlist[i + 1];
 + }
 + }
 + if (rdp-nextlist != NULL) {
 + rdp-waitlist[0] = rdp-nextlist;
 + rdp-waittail[0] = rdp-nexttail;
 + wlc++;
 + rdp-nextlist = NULL;
 + rdp-nexttail = rdp-nextlist;
 + RCU_TRACE_RDP(rcupreempt_trace_move2wait, rdp);
 + } else {
 + rdp-waitlist[0] = NULL;
 + rdp-waittail[0] = rdp-waitlist[0];
 + }
 + rdp-waitlistcount = wlc;
 + rdp-completed = rcu_ctrlblk.completed;
 + }
 +
 + /*
 +  * Check to see if this CPU needs to report that it has seen
 +  * the most recent counter flip, thereby declaring that all
 +  * subsequent rcu_read_lock() invocations will respect this flip.
 +  */
 +
 + cpu = raw_smp_processor_id();
 + if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
 + smp_mb();  /* Subsequent counter accesses must see new value */
 + per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
 + smp_mb();  /* Subsequent RCU read-side critical sections */
 +/*  seen -after- acknowledgement. */
 + }
 +}

[snip]

 +/*
 + * Attempt a single flip of the counters.  Remember, a single flip does
 + * -not- constitute a grace period.  Instead, the interval between
 + * at least three consecutive flips is a grace period.
 + *
 + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
 + * on a large SMP, they might want to use a hierarchical organization of
 + * the per-CPU-counter pairs.
 + */
 +static void rcu_try_flip(void)
 +{
 + unsigned long oldirq;
 +
 + RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
 + if (unlikely(!spin_trylock_irqsave(rcu_ctrlblk.fliplock, oldirq))) {
 + RCU_TRACE_ME(rcupreempt_trace_try_flip_e1);
 + return;
 + }
 +
 + /*
 +  * Take the next transition(s) through the RCU grace-period
 +  * flip-counter state machine.
 +  */
 +
 + switch (rcu_try_flip_state) {
 + case rcu_try_flip_idle_state:
 + if (rcu_try_flip_idle())
 +   

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Peter Zijlstra
On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt [EMAIL PROTECTED]
wrote:

 On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:


 Can you have a pointer somewhere that explains these states. And not a
 it's in this paper or directory. Either have a short discription here,
 or specify where exactly to find the information (perhaps a
 Documentation/RCU/preemptible_states.txt?).
 
 Trying to understand these states has caused me the most agony in
 reviewing these patches.
 
  + */
  +
  +enum rcu_try_flip_states {
  +   rcu_try_flip_idle_state,/* I */
  +   rcu_try_flip_waitack_state, /* A */
  +   rcu_try_flip_waitzero_state,/* Z */
  +   rcu_try_flip_waitmb_state   /* M */
  +};

I thought the 4 flip states corresponded to the 4 GP stages, but now
you confused me. It seems to indeed progress one stage for every 4 flip
states.

Hmm, now I have to puzzle how these 4 stages are required by the lock
and unlock magic.

  +/*
  + * Return the number of RCU batches processed thus far.  Useful for debug
  + * and statistics.  The _bh variant is identical to straight RCU.
  + */
 
 If they are identical, then why the separation?

I guess a smaller RCU domain makes for quicker grace periods.
 
  +void __rcu_read_lock(void)
  +{
  +   int idx;
  +   struct task_struct *me = current;
 
 Nitpick, but other places in the kernel usually use t or p as a
 variable to assign current to.  It's just that me thows me off a
 little while reviewing this.  But this is just a nitpick, so do as you
 will.

struct task_struct *curr = current;

is also not uncommon.
 
  +   int nesting;
  +
  +   nesting = ORDERED_WRT_IRQ(me-rcu_read_lock_nesting);
  +   if (nesting != 0) {
  +
  +   /* An earlier rcu_read_lock() covers us, just count it. */
  +
  +   me-rcu_read_lock_nesting = nesting + 1;
  +
  +   } else {
  +   unsigned long oldirq;
 
  +
  +   /*
  +* Disable local interrupts to prevent the grace-period
  +* detection state machine from seeing us half-done.
  +* NMIs can still occur, of course, and might themselves
  +* contain rcu_read_lock().
  +*/
  +
  +   local_irq_save(oldirq);
 
 Isn't the GP detection done via a tasklet/softirq. So wouldn't a
 local_bh_disable be sufficient here? You already cover NMIs, which would
 also handle normal interrupts.

This is also my understanding, but I think this disable is an
'optimization' in that it avoids the regular IRQs from jumping through
these hoops outlined below.

  +
  +   /*
  +* Outermost nesting of rcu_read_lock(), so increment
  +* the current counter for the current CPU.  Use volatile
  +* casts to prevent the compiler from reordering.
  +*/
  +
  +   idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed)  0x1;
  +   smp_read_barrier_depends();  /*  might be unneeded */
  +   ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
  +
  +   /*
  +* Now that the per-CPU counter has been incremented, we
  +* are protected from races with rcu_read_lock() invoked
  +* from NMI handlers on this CPU.  We can therefore safely
  +* increment the nesting counter, relieving further NMIs
  +* of the need to increment the per-CPU counter.
  +*/
  +
  +   ORDERED_WRT_IRQ(me-rcu_read_lock_nesting) = nesting + 1;
  +
  +   /*
  +* Now that we have preventing any NMIs from storing
  +* to the -rcu_flipctr_idx, we can safely use it to
  +* remember which counter to decrement in the matching
  +* rcu_read_unlock().
  +*/
  +
  +   ORDERED_WRT_IRQ(me-rcu_flipctr_idx) = idx;
  +   local_irq_restore(oldirq);
  +   }
  +}

  +/*
  + * Attempt a single flip of the counters.  Remember, a single flip does
  + * -not- constitute a grace period.  Instead, the interval between
  + * at least three consecutive flips is a grace period.
  + *
  + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
 
 Oh, come now! It's not nuts to use this ;-)
 
  + * on a large SMP, they might want to use a hierarchical organization of
  + * the per-CPU-counter pairs.
  + */

Its the large SMP case that's nuts, and on that I have to agree with
Paul, its not really large SMP friendly.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:
 On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt [EMAIL PROTECTED]
 wrote:

  On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
 
  Can you have a pointer somewhere that explains these states. And not a
  it's in this paper or directory. Either have a short discription here,
  or specify where exactly to find the information (perhaps a
  Documentation/RCU/preemptible_states.txt?).
  
  Trying to understand these states has caused me the most agony in
  reviewing these patches.
  
   + */
   +
   +enum rcu_try_flip_states {
   + rcu_try_flip_idle_state,/* I */
   + rcu_try_flip_waitack_state, /* A */
   + rcu_try_flip_waitzero_state,/* Z */
   + rcu_try_flip_waitmb_state   /* M */
   +};
 
 I thought the 4 flip states corresponded to the 4 GP stages, but now
 you confused me. It seems to indeed progress one stage for every 4 flip
 states.

Yes, four flip states per stage, and four stages per grace period:

rcu_try_flip_idle_state:Stay here if nothing is happening.
Flip the counter if something starts
happening.

rcu_try_flip_waitack_state: Wait here for all CPUs to notice that
the counter has flipped.  This prevents
the old set of counters from ever being
incremented once we leave this state,
which in turn is necessary because we
cannot test any individual counter for
zero -- we can only check the sum.

rcu_try_flip_waitzero_state:Wait here for the sum of the old
per-CPU counters to reach zero.

rcu_try_flip_waitmb_state:  Wait here for each of the other
CPUs to execute a memory barrier.
This is necessary to ensure that
these other CPUs really have finished
executing their RCU read-side
critical sections, despite their
CPUs wildly reordering memory.

This state 

 Hmm, now I have to puzzle how these 4 stages are required by the lock
 and unlock magic.

They are needed to allow memory barriers to be removed from
rcu_read_lock(), to allow for the fact that the stages mean that
grace-period boundaries are now fuzzy, as well as to account for the
fact that each CPU now has its own callback queue.  (Aside: classic RCU
doesn't have memory barriers -- or anything else -- in rcu_read_lock()
and rcu_read_unlock(), but from an implementation viewpoint, the read-side
critical section extends forwards and backwards to the next/previous
context switches, which do have lots of memory barriers.)

Taking the reasons for stages one at a time:

1.  The first stage is needed for generic RCU.  Everyone must
complete any pre-existing RCU read-side critical sections
before the grace period can be permitted to complete.  This
stage corresponds to the waitlist and waittail in earlier
RCU implementations.

2.  An additional stage is needed due to the fact that memory reordering
can cause an RCU read-side critical section's rcu_dereference()
to be executed before the rcu_read_lock() -- no memory barriers!

This can result in the following sequence of events:

o   rcu_dereference() from CPU 0.

o   list_del_rcu() from CPU 1.

o   call_rcu() from CPU 1.

o   counter flip from CPU 2.

o   rcu_read_lock() from CPU 0 (delayed due to memory reordering).

CPU 0 increments the new counter, not the old one that
its rcu_dereference() is associated with.

o   All CPUs acknowledge the counter flip.

o   CPU 2 discovers that the old counters sum to zero (because
CPU 0 has incremented its new counter).

o   All CPUs do their memory barriers.

o   CPU 1's callback is invoked, destroying the memory still
being used by CPU 0.

An additional pass through the state machine fixes this problem.

Note that the read-after-write misordering can happen on x86,
as well as on s390.

3.  An additional stage is also needed due to the fact that each CPU
has its own set of callback queues.  This means that callbacks
that were registered quite some time after the most recent counter
flip can make it into the current round of the state machine,
as follows:

o   CPU 0 flips the counter.

o   CPU 1 does an rcu_read_lock() followed by an rcu_dereference().
CPU 1 of course uses the new value of the counter.


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt
On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:
 On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt [EMAIL PROTECTED]
 wrote:
 
  On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
 
 
  Can you have a pointer somewhere that explains these states. And not a
  it's in this paper or directory. Either have a short discription here,
  or specify where exactly to find the information (perhaps a
  Documentation/RCU/preemptible_states.txt?).
  
  Trying to understand these states has caused me the most agony in
  reviewing these patches.
  
   + */
   +
   +enum rcu_try_flip_states {
   + rcu_try_flip_idle_state,/* I */
   + rcu_try_flip_waitack_state, /* A */
   + rcu_try_flip_waitzero_state,/* Z */
   + rcu_try_flip_waitmb_state   /* M */
   +};
 
 I thought the 4 flip states corresponded to the 4 GP stages, but now
 you confused me. It seems to indeed progress one stage for every 4 flip
 states.

I'm still confused ;-)

 
 Hmm, now I have to puzzle how these 4 stages are required by the lock
 and unlock magic.
 
   +/*
   + * Return the number of RCU batches processed thus far.  Useful for debug
   + * and statistics.  The _bh variant is identical to straight RCU.
   + */
  
  If they are identical, then why the separation?
 
 I guess a smaller RCU domain makes for quicker grace periods.

No, I mean that both the rcu_batches_completed and
rcu_batches_completed_bh are identical. Perhaps we can just put in a

#define rcu_batches_completed_bh rcu_batches_completed

in rcupreempt.h.  In rcuclassic, they are different. But no need to have
two identical functions in the preempt version. A macro should do.

  
   +void __rcu_read_lock(void)
   +{
   + int idx;
   + struct task_struct *me = current;
  
  Nitpick, but other places in the kernel usually use t or p as a
  variable to assign current to.  It's just that me thows me off a
  little while reviewing this.  But this is just a nitpick, so do as you
  will.
 
 struct task_struct *curr = current;
 
 is also not uncommon.

True, but the me confused me. Since that task struct is not me ;-)

  
   + int nesting;
   +
   + nesting = ORDERED_WRT_IRQ(me-rcu_read_lock_nesting);
   + if (nesting != 0) {
   +
   + /* An earlier rcu_read_lock() covers us, just count it. */
   +
   + me-rcu_read_lock_nesting = nesting + 1;
   +
   + } else {
   + unsigned long oldirq;
  
   +
   + /*
   +  * Disable local interrupts to prevent the grace-period
   +  * detection state machine from seeing us half-done.
   +  * NMIs can still occur, of course, and might themselves
   +  * contain rcu_read_lock().
   +  */
   +
   + local_irq_save(oldirq);
  
  Isn't the GP detection done via a tasklet/softirq. So wouldn't a
  local_bh_disable be sufficient here? You already cover NMIs, which would
  also handle normal interrupts.
 
 This is also my understanding, but I think this disable is an
 'optimization' in that it avoids the regular IRQs from jumping through
 these hoops outlined below.

But isn't disabling irqs slower than doing a local_bh_disable? So the
majority of times (where irqs will not happen) we have this overhead.

 
   +
   + /*
   +  * Outermost nesting of rcu_read_lock(), so increment
   +  * the current counter for the current CPU.  Use volatile
   +  * casts to prevent the compiler from reordering.
   +  */
   +
   + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed)  0x1;
   + smp_read_barrier_depends();  /*  might be unneeded */
   + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
   +
   + /*
   +  * Now that the per-CPU counter has been incremented, we
   +  * are protected from races with rcu_read_lock() invoked
   +  * from NMI handlers on this CPU.  We can therefore safely
   +  * increment the nesting counter, relieving further NMIs
   +  * of the need to increment the per-CPU counter.
   +  */
   +
   + ORDERED_WRT_IRQ(me-rcu_read_lock_nesting) = nesting + 1;
   +
   + /*
   +  * Now that we have preventing any NMIs from storing
   +  * to the -rcu_flipctr_idx, we can safely use it to
   +  * remember which counter to decrement in the matching
   +  * rcu_read_unlock().
   +  */
   +
   + ORDERED_WRT_IRQ(me-rcu_flipctr_idx) = idx;
   + local_irq_restore(oldirq);
   + }
   +}
 
   +/*
   + * Attempt a single flip of the counters.  Remember, a single flip does
   + * -not- constitute a grace period.  Instead, the interval between
   + * at least three consecutive flips is a grace period.
   + *
   + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
  
  Oh, come now! It's not nuts to use this ;-)
  
   + * on a large SMP, they might want to use a hierarchical organization of
   + * the per-CPU-counter pairs.
   + */
 
 Its 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 06:31:12PM -0400, Steven Rostedt wrote:
 On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:
  On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt [EMAIL PROTECTED]
  wrote:
  
   On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
  
  
   Can you have a pointer somewhere that explains these states. And not a
   it's in this paper or directory. Either have a short discription here,
   or specify where exactly to find the information (perhaps a
   Documentation/RCU/preemptible_states.txt?).
   
   Trying to understand these states has caused me the most agony in
   reviewing these patches.
   
+ */
+
+enum rcu_try_flip_states {
+   rcu_try_flip_idle_state,/* I */
+   rcu_try_flip_waitack_state, /* A */
+   rcu_try_flip_waitzero_state,/* Z */
+   rcu_try_flip_waitmb_state   /* M */
+};
  
  I thought the 4 flip states corresponded to the 4 GP stages, but now
  you confused me. It seems to indeed progress one stage for every 4 flip
  states.
 
 I'm still confused ;-)

If you do a synchronize_rcu() it might well have to wait through the
following sequence of states:

Stage 0: (might have to wait through part of this to get out of next queue)
rcu_try_flip_idle_state,/* I */
rcu_try_flip_waitack_state, /* A */
rcu_try_flip_waitzero_state,/* Z */
rcu_try_flip_waitmb_state   /* M */
Stage 1:
rcu_try_flip_idle_state,/* I */
rcu_try_flip_waitack_state, /* A */
rcu_try_flip_waitzero_state,/* Z */
rcu_try_flip_waitmb_state   /* M */
Stage 2:
rcu_try_flip_idle_state,/* I */
rcu_try_flip_waitack_state, /* A */
rcu_try_flip_waitzero_state,/* Z */
rcu_try_flip_waitmb_state   /* M */
Stage 3:
rcu_try_flip_idle_state,/* I */
rcu_try_flip_waitack_state, /* A */
rcu_try_flip_waitzero_state,/* Z */
rcu_try_flip_waitmb_state   /* M */
Stage 4:
rcu_try_flip_idle_state,/* I */
rcu_try_flip_waitack_state, /* A */
rcu_try_flip_waitzero_state,/* Z */
rcu_try_flip_waitmb_state   /* M */

So yes, grace periods do indeed have some latency.

  Hmm, now I have to puzzle how these 4 stages are required by the lock
  and unlock magic.
  
+/*
+ * Return the number of RCU batches processed thus far.  Useful for 
debug
+ * and statistics.  The _bh variant is identical to straight RCU.
+ */
   
   If they are identical, then why the separation?
  
  I guess a smaller RCU domain makes for quicker grace periods.
 
 No, I mean that both the rcu_batches_completed and
 rcu_batches_completed_bh are identical. Perhaps we can just put in a
 
 #define rcu_batches_completed_bh rcu_batches_completed
 
 in rcupreempt.h.  In rcuclassic, they are different. But no need to have
 two identical functions in the preempt version. A macro should do.

Ah!!!  Good point, #define does make sense here.

+void __rcu_read_lock(void)
+{
+   int idx;
+   struct task_struct *me = current;
   
   Nitpick, but other places in the kernel usually use t or p as a
   variable to assign current to.  It's just that me thows me off a
   little while reviewing this.  But this is just a nitpick, so do as you
   will.
  
  struct task_struct *curr = current;
  
  is also not uncommon.
 
 True, but the me confused me. Since that task struct is not me ;-)

Well, who is it, then?  ;-)

+   int nesting;
+
+   nesting = ORDERED_WRT_IRQ(me-rcu_read_lock_nesting);
+   if (nesting != 0) {
+
+   /* An earlier rcu_read_lock() covers us, just count it. 
*/
+
+   me-rcu_read_lock_nesting = nesting + 1;
+
+   } else {
+   unsigned long oldirq;
   
+
+   /*
+* Disable local interrupts to prevent the grace-period
+* detection state machine from seeing us half-done.
+* NMIs can still occur, of course, and might themselves
+* contain rcu_read_lock().
+*/
+
+   local_irq_save(oldirq);
   
   Isn't the GP detection done via a tasklet/softirq. So wouldn't a
   local_bh_disable be sufficient here? You already cover NMIs, which would
   also handle normal interrupts.
  
  This is also my understanding, but I think this disable is an
  'optimization' in that it avoids the regular IRQs from jumping through
  these hoops outlined below.
 
 But isn't disabling irqs slower than doing a local_bh_disable? So the
 majority of times (where irqs will not happen) we have this overhead.

The current code absolutely must exclude the scheduling-clock hardirq
handler.

+   /*
+* Outermost nesting of rcu_read_lock(), so increment
+  

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 11:20:48AM -0400, Steven Rostedt wrote:
 On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
  +
  +/*
  + * PREEMPT_RCU data structures.
  + */
  +
  +#define GP_STAGES 4
  +struct rcu_data {
  +   spinlock_t  lock;   /* Protect rcu_data fields. */
  +   longcompleted;  /* Number of last completed batch. */
  +   int waitlistcount;
  +   struct tasklet_struct rcu_tasklet;
  +   struct rcu_head *nextlist;
  +   struct rcu_head **nexttail;
  +   struct rcu_head *waitlist[GP_STAGES];
  +   struct rcu_head **waittail[GP_STAGES];
  +   struct rcu_head *donelist;
  +   struct rcu_head **donetail;
  +#ifdef CONFIG_RCU_TRACE
  +   struct rcupreempt_trace trace;
  +#endif /* #ifdef CONFIG_RCU_TRACE */
  +};
  +struct rcu_ctrlblk {
  +   spinlock_t  fliplock;   /* Protect state-machine transitions. */
  +   longcompleted;  /* Number of last completed batch. */
  +};
  +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
  +static struct rcu_ctrlblk rcu_ctrlblk = {
  +   .fliplock = SPIN_LOCK_UNLOCKED,
  +   .completed = 0,
  +};
  +static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
  +
  +/*
  + * States for rcu_try_flip() and friends.
  + */
  +
  +enum rcu_try_flip_states {
  +   rcu_try_flip_idle_state,/* I */
  +   rcu_try_flip_waitack_state, /* A */
  +   rcu_try_flip_waitzero_state,/* Z */
  +   rcu_try_flip_waitmb_state   /* M */
  +};
  +static enum rcu_try_flip_states rcu_try_flip_state = 
  rcu_try_flip_idle_state;
  +#ifdef CONFIG_RCU_TRACE
  +static char *rcu_try_flip_state_names[] =
  +   { idle, waitack, waitzero, waitmb };
  +#endif /* #ifdef CONFIG_RCU_TRACE */
 
 [snip]
 
  +/*
  + * If a global counter flip has occurred since the last time that we
  + * advanced callbacks, advance them.  Hardware interrupts must be
  + * disabled when calling this function.
  + */
  +static void __rcu_advance_callbacks(struct rcu_data *rdp)
  +{
  +   int cpu;
  +   int i;
  +   int wlc = 0;
  +
  +   if (rdp-completed != rcu_ctrlblk.completed) {
  +   if (rdp-waitlist[GP_STAGES - 1] != NULL) {
  +   *rdp-donetail = rdp-waitlist[GP_STAGES - 1];
  +   rdp-donetail = rdp-waittail[GP_STAGES - 1];
  +   RCU_TRACE_RDP(rcupreempt_trace_move2done, rdp);
  +   }
  +   for (i = GP_STAGES - 2; i = 0; i--) {
  +   if (rdp-waitlist[i] != NULL) {
  +   rdp-waitlist[i + 1] = rdp-waitlist[i];
  +   rdp-waittail[i + 1] = rdp-waittail[i];
  +   wlc++;
  +   } else {
  +   rdp-waitlist[i + 1] = NULL;
  +   rdp-waittail[i + 1] =
  +   rdp-waitlist[i + 1];
  +   }
  +   }
  +   if (rdp-nextlist != NULL) {
  +   rdp-waitlist[0] = rdp-nextlist;
  +   rdp-waittail[0] = rdp-nexttail;
  +   wlc++;
  +   rdp-nextlist = NULL;
  +   rdp-nexttail = rdp-nextlist;
  +   RCU_TRACE_RDP(rcupreempt_trace_move2wait, rdp);
  +   } else {
  +   rdp-waitlist[0] = NULL;
  +   rdp-waittail[0] = rdp-waitlist[0];
  +   }
  +   rdp-waitlistcount = wlc;
  +   rdp-completed = rcu_ctrlblk.completed;
  +   }
  +
  +   /*
  +* Check to see if this CPU needs to report that it has seen
  +* the most recent counter flip, thereby declaring that all
  +* subsequent rcu_read_lock() invocations will respect this flip.
  +*/
  +
  +   cpu = raw_smp_processor_id();
  +   if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
  +   smp_mb();  /* Subsequent counter accesses must see new value */
  +   per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
  +   smp_mb();  /* Subsequent RCU read-side critical sections */
  +  /*  seen -after- acknowledgement. */
  +   }
  +}
 
 [snip]
 
  +/*
  + * Attempt a single flip of the counters.  Remember, a single flip does
  + * -not- constitute a grace period.  Instead, the interval between
  + * at least three consecutive flips is a grace period.
  + *
  + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
  + * on a large SMP, they might want to use a hierarchical organization of
  + * the per-CPU-counter pairs.
  + */
  +static void rcu_try_flip(void)
  +{
  +   unsigned long oldirq;
  +
  +   RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
  +   if (unlikely(!spin_trylock_irqsave(rcu_ctrlblk.fliplock, oldirq))) {
  +   RCU_TRACE_ME(rcupreempt_trace_try_flip_e1);
  +   return;
  +   }
  +
  +   /*
  +* Take the next transition(s) through the RCU grace-period
  +* flip-counter state machine.
  +*/
  +
  +   switch (rcu_try_flip_state) {
  +   case rcu_try_flip_idle_state:
  +  

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt


--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:


 If you do a synchronize_rcu() it might well have to wait through the
 following sequence of states:

 Stage 0: (might have to wait through part of this to get out of next queue)
   rcu_try_flip_idle_state,/* I */
   rcu_try_flip_waitack_state, /* A */
   rcu_try_flip_waitzero_state,/* Z */
   rcu_try_flip_waitmb_state   /* M */
 Stage 1:
   rcu_try_flip_idle_state,/* I */
   rcu_try_flip_waitack_state, /* A */
   rcu_try_flip_waitzero_state,/* Z */
   rcu_try_flip_waitmb_state   /* M */
 Stage 2:
   rcu_try_flip_idle_state,/* I */
   rcu_try_flip_waitack_state, /* A */
   rcu_try_flip_waitzero_state,/* Z */
   rcu_try_flip_waitmb_state   /* M */
 Stage 3:
   rcu_try_flip_idle_state,/* I */
   rcu_try_flip_waitack_state, /* A */
   rcu_try_flip_waitzero_state,/* Z */
   rcu_try_flip_waitmb_state   /* M */
 Stage 4:
   rcu_try_flip_idle_state,/* I */
   rcu_try_flip_waitack_state, /* A */
   rcu_try_flip_waitzero_state,/* Z */
   rcu_try_flip_waitmb_state   /* M */

 So yes, grace periods do indeed have some latency.

Yes they do. I'm now at the point that I'm just trusting you that you
understand that each of these stages are needed. My IQ level only lets me
understand next - wait - done, but not the extra 3 shifts in wait.

;-)

 
  True, but the me confused me. Since that task struct is not me ;-)

 Well, who is it, then?  ;-)

It's the app I watch sitting there waiting it's turn for it's callback to
run.

   
Isn't the GP detection done via a tasklet/softirq. So wouldn't a
local_bh_disable be sufficient here? You already cover NMIs, which would
also handle normal interrupts.
  
   This is also my understanding, but I think this disable is an
   'optimization' in that it avoids the regular IRQs from jumping through
   these hoops outlined below.
 
  But isn't disabling irqs slower than doing a local_bh_disable? So the
  majority of times (where irqs will not happen) we have this overhead.

 The current code absolutely must exclude the scheduling-clock hardirq
 handler.

ACKed,
The reasoning you gave in Peter's reply most certainly makes sense.


 + *
 + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU 
 implementation
   
Oh, come now! It's not nuts to use this ;-)
   
 + * on a large SMP, they might want to use a hierarchical 
 organization of
 + * the per-CPU-counter pairs.
 + */
  
   Its the large SMP case that's nuts, and on that I have to agree with
   Paul, its not really large SMP friendly.
 
  Hmm, that could be true. But on large SMP systems, you usually have a
  large amounts of memory, so hopefully a really long synchronize_rcu
  would not be a problem.

 Somewhere in the range from 64 to a few hundred CPUs, the global lock
 protecting the try_flip state machine would start sucking air pretty
 badly.  But the real problem is synchronize_sched(), which loops through
 all the CPUs --  this would likely cause problems at a few tens of
 CPUs, perhaps as early as 10-20.

hehe, From someone who's largest box is 4 CPUs, to me 16 CPUS is large.
But I can see hundreds, let alone thousands of CPUs would make a huge
grinding halt on things like synchronize_sched. God, imaging if all CPUs
did that approximately at the same time. The system would should a huge
jitter.

-- Steve

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 07:23:09PM -0400, Steven Rostedt wrote:
 --
 On Fri, 21 Sep 2007, Paul E. McKenney wrote:
 
  If you do a synchronize_rcu() it might well have to wait through the
  following sequence of states:
 
  Stage 0: (might have to wait through part of this to get out of next 
  queue)
  rcu_try_flip_idle_state,/* I */
  rcu_try_flip_waitack_state, /* A */
  rcu_try_flip_waitzero_state,/* Z */
  rcu_try_flip_waitmb_state   /* M */
  Stage 1:
  rcu_try_flip_idle_state,/* I */
  rcu_try_flip_waitack_state, /* A */
  rcu_try_flip_waitzero_state,/* Z */
  rcu_try_flip_waitmb_state   /* M */
  Stage 2:
  rcu_try_flip_idle_state,/* I */
  rcu_try_flip_waitack_state, /* A */
  rcu_try_flip_waitzero_state,/* Z */
  rcu_try_flip_waitmb_state   /* M */
  Stage 3:
  rcu_try_flip_idle_state,/* I */
  rcu_try_flip_waitack_state, /* A */
  rcu_try_flip_waitzero_state,/* Z */
  rcu_try_flip_waitmb_state   /* M */
  Stage 4:
  rcu_try_flip_idle_state,/* I */
  rcu_try_flip_waitack_state, /* A */
  rcu_try_flip_waitzero_state,/* Z */
  rcu_try_flip_waitmb_state   /* M */
 
  So yes, grace periods do indeed have some latency.
 
 Yes they do. I'm now at the point that I'm just trusting you that you
 understand that each of these stages are needed. My IQ level only lets me
 understand next - wait - done, but not the extra 3 shifts in wait.
 
 ;-)

In the spirit of full disclosure, I am not -absolutely- certain that
they are needed, only that they are sufficient.  Just color me paranoid.

   True, but the me confused me. Since that task struct is not me ;-)
 
  Well, who is it, then?  ;-)
 
 It's the app I watch sitting there waiting it's turn for it's callback to
 run.

:-)

 Isn't the GP detection done via a tasklet/softirq. So wouldn't a
 local_bh_disable be sufficient here? You already cover NMIs, which 
 would
 also handle normal interrupts.
   
This is also my understanding, but I think this disable is an
'optimization' in that it avoids the regular IRQs from jumping through
these hoops outlined below.
  
   But isn't disabling irqs slower than doing a local_bh_disable? So the
   majority of times (where irqs will not happen) we have this overhead.
 
  The current code absolutely must exclude the scheduling-clock hardirq
  handler.
 
 ACKed,
 The reasoning you gave in Peter's reply most certainly makes sense.
 
  + *
  + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU 
  implementation

 Oh, come now! It's not nuts to use this ;-)

  + * on a large SMP, they might want to use a hierarchical 
  organization of
  + * the per-CPU-counter pairs.
  + */
   
Its the large SMP case that's nuts, and on that I have to agree with
Paul, its not really large SMP friendly.
  
   Hmm, that could be true. But on large SMP systems, you usually have a
   large amounts of memory, so hopefully a really long synchronize_rcu
   would not be a problem.
 
  Somewhere in the range from 64 to a few hundred CPUs, the global lock
  protecting the try_flip state machine would start sucking air pretty
  badly.  But the real problem is synchronize_sched(), which loops through
  all the CPUs --  this would likely cause problems at a few tens of
  CPUs, perhaps as early as 10-20.
 
 hehe, From someone who's largest box is 4 CPUs, to me 16 CPUS is large.
 But I can see hundreds, let alone thousands of CPUs would make a huge
 grinding halt on things like synchronize_sched. God, imaging if all CPUs
 did that approximately at the same time. The system would should a huge
 jitter.

Well, the first time the SGI guys tried to boot a 1024-CPU Altix, I got
an email complaining about RCU overheads.  ;-)  Manfred Spraul fixed
things up for them, though.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
 On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

Covering the pieces that weren't in Peter's reply.  ;-)

And thank you -very- much for the careful and thorough review!!!

   #endif /* __KERNEL__ */
   #endif /* __LINUX_RCUCLASSIC_H */
  diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
  linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
  --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 
  14:02:36.0 -0700
  +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h  2007-08-22 
  15:21:06.0 -0700
  @@ -52,7 +52,11 @@ struct rcu_head {
  void (*func)(struct rcu_head *head);
   };
   
  +#ifdef CONFIG_CLASSIC_RCU
   #include linux/rcuclassic.h
  +#else /* #ifdef CONFIG_CLASSIC_RCU */
  +#include linux/rcupreempt.h
  +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
 
 A bit extreme on the comments here.

My fingers do this without any help from the rest of me, but I suppose
it is a bit of overkill in this case.

   #define RCU_HEAD_INIT  { .next = NULL, .func = NULL }
   #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
  @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
   /* Exported common interfaces */
   extern void synchronize_rcu(void);
   extern void rcu_barrier(void);
  +extern long rcu_batches_completed(void);
  +extern long rcu_batches_completed_bh(void);
   
   /* Internal to kernel */
   extern void rcu_init(void);
   extern void rcu_check_callbacks(int cpu, int user);
  +extern int rcu_needs_cpu(int cpu);
   
   #endif /* __KERNEL__ */
   #endif /* __LINUX_RCUPDATE_H */
  diff -urpNa -X dontdiff 
  linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 
  linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h
  --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h   1969-12-31 
  16:00:00.0 -0800
  +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h2007-08-22 
  15:21:06.0 -0700
  @@ -0,0 +1,78 @@
  +/*
  + * Read-Copy Update mechanism for mutual exclusion (RT implementation)
  + *
  + * This program is free software; you can redistribute it and/or modify
  + * it under the terms of the GNU General Public License as published by
  + * the Free Software Foundation; either version 2 of the License, or
  + * (at your option) any later version.
  + *
  + * This program is distributed in the hope that it will be useful,
  + * but WITHOUT ANY WARRANTY; without even the implied warranty of
  + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  + * GNU General Public License for more details.
  + *
  + * You should have received a copy of the GNU General Public License
  + * along with this program; if not, write to the Free Software
  + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, 
  USA.
  + *
  + * Copyright (C) IBM Corporation, 2006
  + *
  + * Author:  Paul McKenney [EMAIL PROTECTED]
  + *
  + * Based on the original work by Paul McKenney [EMAIL PROTECTED]
  + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
  + * Papers:
  + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
  + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf 
  (OLS2001)
  + *
  + * For detailed explanation of Read-Copy Update mechanism see -
  + * Documentation/RCU
  + *
  + */
  +
  +#ifndef __LINUX_RCUPREEMPT_H
  +#define __LINUX_RCUPREEMPT_H
  +
  +#ifdef __KERNEL__
  +
  +#include linux/cache.h
  +#include linux/spinlock.h
  +#include linux/threads.h
  +#include linux/percpu.h
  +#include linux/cpumask.h
  +#include linux/seqlock.h
  +
  +#define rcu_qsctr_inc(cpu)
  +#define rcu_bh_qsctr_inc(cpu)
  +#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
  +
  +extern void __rcu_read_lock(void);
  +extern void __rcu_read_unlock(void);
  +extern int rcu_pending(int cpu);
  +extern int rcu_needs_cpu(int cpu);
  +
  +#define __rcu_read_lock_bh()   { rcu_read_lock(); local_bh_disable(); }
  +#define __rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); 
  }
  +
  +#define __rcu_read_lock_nesting()  (current-rcu_read_lock_nesting)
  +
  +extern void __synchronize_sched(void);
  +
  +extern void __rcu_init(void);
  +extern void rcu_check_callbacks(int cpu, int user);
  +extern void rcu_restart_cpu(int cpu);
  +
  +#ifdef CONFIG_RCU_TRACE
  +struct rcupreempt_trace;
  +extern int *rcupreempt_flipctr(int cpu);
  +extern long rcupreempt_data_completed(void);
  +extern int rcupreempt_flip_flag(int cpu);
  +extern int rcupreempt_mb_flag(int cpu);
  +extern char *rcupreempt_try_flip_state_name(void);
  +extern struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu);
  +#endif
  +
  +struct softirq_action;
  +
  +#endif /* __KERNEL__ */
  +#endif /* __LINUX_RCUPREEMPT_H */
  diff -urpNa -X dontdiff 
  linux-2.6.22-b-fixbarriers/include/linux/rcupreempt_trace.h 
  

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 04:03:43PM -0700, Paul E. McKenney wrote:
 On Fri, Sep 21, 2007 at 11:20:48AM -0400, Steven Rostedt wrote:
  On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

[ . . . ]

  Paul,
  
  Looking further into this, I still think this is a bit of overkill. We
  go through 20 states from call_rcu to list-func().
  
  On call_rcu we put our stuff on the next list. Before we move stuff from
  next to wait, we need to go through 4 states. So we have
  
  next - 4 states - wait[0] - 4 states - wait[1] - 4 states -
  wait[2] - 4 states - wait[3] - 4 states - done.
  
  That's 20 states that we go through from the time we add our function to
  the list to the time it actually gets called. Do we really need the 4
  wait lists?
  
  Seems a bit overkill to me.
  
  What am I missing?
 
 Nothing kills like overkill!!!  ;-)
 
 Seriously, I do expect to be able to squeeze this down over time, but
 feel the need to be a bit on the cowardly side at the moment.
 
 In any case, I will be looking at the scenarios more carefully.  If
 it turns out that GP_STAGES can indeed be cranked down a bit, well,
 that is an easy change!  I just fired off a POWER run with GP_STAGES
 set to 3, will let you know how it goes.

The first attempt blew up during boot badly enough that ABAT was unable
to recover the machine (sorry, grahal!!!).  Just for grins, I am trying
it again on a machine that ABAT has had a better record of reviving...

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt

On Fri, 21 Sep 2007, Paul E. McKenney wrote:

 On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
  On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

 Covering the pieces that weren't in Peter's reply.  ;-)

 And thank you -very- much for the careful and thorough review!!!

#endif /* __KERNEL__ */
#endif /* __LINUX_RCUCLASSIC_H */
   diff -urpNa -X dontdiff 
   linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
   linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
   --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h   2007-07-19 
   14:02:36.0 -0700
   +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h2007-08-22 
   15:21:06.0 -0700
   @@ -52,7 +52,11 @@ struct rcu_head {
 void (*func)(struct rcu_head *head);
};
  
   +#ifdef CONFIG_CLASSIC_RCU
#include linux/rcuclassic.h
   +#else /* #ifdef CONFIG_CLASSIC_RCU */
   +#include linux/rcupreempt.h
   +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
 
  A bit extreme on the comments here.

 My fingers do this without any help from the rest of me, but I suppose
 it is a bit of overkill in this case.

Heck, why stop the overkill here, the whole patch is overkill ;-)

   +
   +#define GP_STAGES 4
 
  I take it that GP stand for grace period. Might want to state that
  here. /* Grace period stages */  When I was looking at this code at 1am,
  I kept asking myself what's this GP? (General Protection??). But
  that's what happens when looking at code like this after midnight ;-)

 Good point, will add a comment.  You did get it right, grace period.

Thanks, so many places in the kernel have acronyms that are just suppose
to be obvious. I hate them, because they make me feel so stupid when I
don't know what they are. After I find out, I usually slap my forehead and
say duh!. My mind is set on reading code, not deciphering TLAs.


 
  Can you have a pointer somewhere that explains these states. And not a
  it's in this paper or directory. Either have a short discription here,
  or specify where exactly to find the information (perhaps a
  Documentation/RCU/preemptible_states.txt?).
 
  Trying to understand these states has caused me the most agony in
  reviewing these patches.

 Good point, perhaps a comment block above the enum giving a short
 description of the purpose of each state.  Maybe more detail in
 Documentation/RCU as well, as you suggest above.

That would be great.

   +
   +/*
   + * Return the number of RCU batches processed thus far.  Useful for debug
   + * and statistics.  The _bh variant is identical to straight RCU.
   + */
 
  If they are identical, then why the separation?

 I apologize for the repetition in this email.

 I apologize for the repetition in this email.

 I apologize for the repetition in this email.

 Yep, will fix with either #define or static inline, as you suggested
 in a later email.

you're starting to sound like me ;-)

   + struct task_struct *me = current;
 
  Nitpick, but other places in the kernel usually use t or p as a
  variable to assign current to.  It's just that me thows me off a
  little while reviewing this.  But this is just a nitpick, so do as you
  will.

 Fair enough, as discussed earlier.

Who's on first, What's on second, and I-dont-know is on third.

   + unsigned long oldirq;
 
  Nitpick, flags is usually used for saving irq state.

 A later patch in the series fixes these -- I believe I got all of them.
 (The priority-boost patch, IIRC.)

OK


   +
   + /*
   +  * Disable local interrupts to prevent the grace-period
   +  * detection state machine from seeing us half-done.
   +  * NMIs can still occur, of course, and might themselves
   +  * contain rcu_read_lock().
   +  */
   +
   + local_irq_save(oldirq);
 
  Isn't the GP detection done via a tasklet/softirq. So wouldn't a
  local_bh_disable be sufficient here? You already cover NMIs, which would
  also handle normal interrupts.

 We beat this into the ground in other email.

Nothing like kicking a dead horse on LKML ;-)

   +
   + /*
   +  * It is now safe to decrement this task's nesting count.
   +  * NMIs that occur after this statement will route their
   +  * rcu_read_lock() calls through this else clause, and
   +  * will thus start incrementing the per-CPU coutner on
 
  s/coutner/counter/

 wlli fxi!!!

snousd oogd

   +
   +/*
   + * Attempt a single flip of the counters.  Remember, a single flip does
   + * -not- constitute a grace period.  Instead, the interval between
   + * at least three consecutive flips is a grace period.
   + *
   + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
 
  Oh, come now! It's not nuts to use this ;-)

 ;-)

Although working with RCU may drive one nuts.

   + /*
   +  * Take the next transition(s) through the RCU grace-period
   +  * flip-counter state machine.
   +  */
   +
   + switch 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt

--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:
 
  In any case, I will be looking at the scenarios more carefully.  If
  it turns out that GP_STAGES can indeed be cranked down a bit, well,
  that is an easy change!  I just fired off a POWER run with GP_STAGES
  set to 3, will let you know how it goes.

 The first attempt blew up during boot badly enough that ABAT was unable
 to recover the machine (sorry, grahal!!!).  Just for grins, I am trying
 it again on a machine that ABAT has had a better record of reviving...

This still frightens the hell out of me. Going through 15 states and
failing. Seems the CPU is holding off writes for a long long time. That
means we flipped the counter 4 times, and that still wasn't good enough?

Maybe I'll boot up my powerbook to see if it has the same issues.

Well, I'm still finishing up on moving into my new house, so I wont be
available this weekend.

Thanks,

-- Steve

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 09:19:22PM -0400, Steven Rostedt wrote:
 
 --
 On Fri, 21 Sep 2007, Paul E. McKenney wrote:
  
   In any case, I will be looking at the scenarios more carefully.  If
   it turns out that GP_STAGES can indeed be cranked down a bit, well,
   that is an easy change!  I just fired off a POWER run with GP_STAGES
   set to 3, will let you know how it goes.
 
  The first attempt blew up during boot badly enough that ABAT was unable
  to recover the machine (sorry, grahal!!!).  Just for grins, I am trying
  it again on a machine that ABAT has had a better record of reviving...
 
 This still frightens the hell out of me. Going through 15 states and
 failing. Seems the CPU is holding off writes for a long long time. That
 means we flipped the counter 4 times, and that still wasn't good enough?

Might be that the other machine has its 2.6.22 version of .config messed
up.  I will try booting it on a stock 2.6.22 kernel when it comes back
to life -- not sure I ever did that before.  Besides, the other similar
machine seems to have gone down for the count, but without me torturing
it...

Also, keep in mind that various stages can record a memory misordering,
for example, by incrementing the wrong counter.

 Maybe I'll boot up my powerbook to see if it has the same issues.
 
 Well, I'm still finishing up on moving into my new house, so I wont be
 available this weekend.

The other machine not only booted, but has survived several minutes of
rcutorture thus far.  I am also trying POWER5 machine as well, as the
one currently running is a POWER4, which is a bit less aggressive about
memory reordering than is the POWER5.

Even if they pass, I refuse to reduce GP_STAGES until proven safe.
Trust me, you -don't- want to be unwittingly making use of a subtely
busted RCU implementation!!!

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 09:15:03PM -0400, Steven Rostedt wrote:
 On Fri, 21 Sep 2007, Paul E. McKenney wrote:
  On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
   On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

[ . . . ]

+   /*
+* Take the next transition(s) through the RCU grace-period
+* flip-counter state machine.
+*/
+
+   switch (rcu_try_flip_state) {
+   case rcu_try_flip_idle_state:
+   if (rcu_try_flip_idle())
+   rcu_try_flip_state = rcu_try_flip_waitack_state;
  
   Just trying to understand all this. Here at flip_idle, only a CPU with
   no pending RCU calls will flip it. Then all the cpus flags will be set
   to rcu_flipped, and the ctrl.completed counter is incremented.
 
  s/no pending RCU calls/at least one pending RCU call/, but otherwise
  spot on.
 
  So if the RCU grace-period machinery is idle, the first CPU to take
  a scheduling-clock interrupt after having posted an RCU callback will
  get things going.
 
 I said 'no' becaues of this:
 
 +rcu_try_flip_idle(void)
 +{
 +   int cpu;
 +
 +   RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
 +   if (!rcu_pending(smp_processor_id())) {
 +   RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
 +   return 0;
 +   }
 
 But now I'm a bit more confused. :-/
 
 Looking at the caller in kernel/timer.c I see
 
   if (rcu_pending(cpu))
   rcu_check_callbacks(cpu, user_tick);
 
 And rcu_check_callbacks is the caller of rcu_try_flip. The confusion is
 that we call this when we have a pending rcu, but if we have a pending
 rcu, we won't flip the counter ??

We don't enter unless there is something for RCU to do (might be a
pending callback, for example, but might also be needing to acknowledge
a counter flip).  If, by the time we get to rcu_try_flip_idle(), there
is no longer anything to do (!rcu_pending()), we bail.

So a given CPU kicks the state machine out of idle only if it -still-
has something to do once it gets to rcu_try_flip_idle(), right?

[ . . . ]

   Is there a chance that overflow of a counter (although probably very
   very unlikely) would cause any problems?
 
  The only way it could cause a problem would be if there was ever
  more than 4,294,967,296 outstanding rcu_read_lock() calls.  I believe
  that lockdep screams if it sees more than 30 nested locks within a
  single task, so for systems that support no more than 100M tasks, we
  should be OK.  It might sometime be necessary to make this be a long
  rather than an int.  Should we just do that now and be done with it?
 
 Sure, why not. More and more and more overkill!!!
 
 (rostedt hears in his head the Monty Python Spam song).

;-)  OK!

   Also, all the CPUs have their check_mb set.
  
+   rcu_try_flip_state = rcu_try_flip_waitmb_state;
+   break;
+   case rcu_try_flip_waitmb_state:
+   if (rcu_try_flip_waitmb())
  
   I have to admit that this seems a bit of an overkill, but I guess you
   know what you are doing.  After going through three states, we still
   need to do a memory barrier on each CPU?
 
  Yep.  Because there are no memory barriers in rcu_read_unlock(), the
  CPU is free to reorder the contents of the RCU read-side critical section
  to follow the counter decrement.  This means that this CPU would still
  be referencing RCU-protected data after it had told the world that it
  was no longer doing so.  Forcing a memory barrier on each CPU guarantees
  that if we see the memory-barrier acknowledge, we also see any prior
  RCU read-side critical section.
 
 And this seem reasonable to me that this would be enough to satisfy a
 grace period. But the CPU moving around the rcu_read_(un)lock's around.
 
 Are we sure that adding all these grace periods stages is better than just
 biting the bullet and put in a memory barrier?

Good question.  I believe so, because the extra stages don't require
much additional processing, and because the ratio of rcu_read_lock()
calls to the number of grace periods is extremely high.  But, if I
can prove it is safe, I will certainly decrease GP_STAGES or otherwise
optimize the state machine.

[ . . . ]

   OK, that's all I have on this patch (will take a bit of a break before
   reviewing your other patches).  But I will say that RCU has grown quite
   a bit, and is looking very good.
 
  Glad you like it, and thank you again for the careful and thorough review.
 
 I'm scared to do the preempt portion %^O

Ummm...  This -was- the preempt portion.  ;-)

   Basically, what I'm saying is Great work, Paul!.  This is looking
   good. Seems that we just need a little bit better explanation for those
   that are not up at the IQ level of you.  I can write something up after
   this all gets finalized. Sort of a rcu-design.txt, that really tries to
   explain it to the simpleton's like me ;-)
 
  I do 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt

[ sneaks away from the family for a bit to answer emails ]

--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:

 On Fri, Sep 21, 2007 at 09:19:22PM -0400, Steven Rostedt wrote:
 
  --
  On Fri, 21 Sep 2007, Paul E. McKenney wrote:
   
In any case, I will be looking at the scenarios more carefully.  If
it turns out that GP_STAGES can indeed be cranked down a bit, well,
that is an easy change!  I just fired off a POWER run with GP_STAGES
set to 3, will let you know how it goes.
  
   The first attempt blew up during boot badly enough that ABAT was unable
   to recover the machine (sorry, grahal!!!).  Just for grins, I am trying
   it again on a machine that ABAT has had a better record of reviving...
 
  This still frightens the hell out of me. Going through 15 states and
  failing. Seems the CPU is holding off writes for a long long time. That
  means we flipped the counter 4 times, and that still wasn't good enough?

 Might be that the other machine has its 2.6.22 version of .config messed
 up.  I will try booting it on a stock 2.6.22 kernel when it comes back
 to life -- not sure I ever did that before.  Besides, the other similar
 machine seems to have gone down for the count, but without me torturing
 it...

 Also, keep in mind that various stages can record a memory misordering,
 for example, by incrementing the wrong counter.

  Maybe I'll boot up my powerbook to see if it has the same issues.
 
  Well, I'm still finishing up on moving into my new house, so I wont be
  available this weekend.

 The other machine not only booted, but has survived several minutes of
 rcutorture thus far.  I am also trying POWER5 machine as well, as the
 one currently running is a POWER4, which is a bit less aggressive about
 memory reordering than is the POWER5.

 Even if they pass, I refuse to reduce GP_STAGES until proven safe.
 Trust me, you -don't- want to be unwittingly making use of a subtely
 busted RCU implementation!!!

I totally agree. This is the same reason I want to understand -why- it
fails with 3 stages. To make sure that adding a 4th stage really does fix
it, and doesn't just make the chances for the bug smaller.

I just have that funny feeling that we are band-aiding this for POWER with
extra stages and not really solving the bug.

I could be totally out in left field on this. But the more people have a
good understanding of what is happening (this includes why things fail)
the more people in general can trust this code.  Right now I'm thinking
you may be the only one that understands this code enough to trust it. I'm
just wanting you to help people like me to trust the code by understanding
and not just having faith in others.

If you ever decide to give up jogging, we need to make sure that there are
people here that can still fill those running shoes (-:


 -- Steve

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Steven Rostedt

[took off Ingo, because he has my ISP blacklisted, and I'm tired of
getting those return mail messages. He can read LKML or you can re-CC
him. Sad since this is a topic he should be reading. ]

--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:

 On Fri, Sep 21, 2007 at 09:15:03PM -0400, Steven Rostedt wrote:
  On Fri, 21 Sep 2007, Paul E. McKenney wrote:
   On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

 [ . . . ]

 + /*
 +  * Take the next transition(s) through the RCU grace-period
 +  * flip-counter state machine.
 +  */
 +
 + switch (rcu_try_flip_state) {
 + case rcu_try_flip_idle_state:
 + if (rcu_try_flip_idle())
 + rcu_try_flip_state = rcu_try_flip_waitack_state;
   
Just trying to understand all this. Here at flip_idle, only a CPU with
no pending RCU calls will flip it. Then all the cpus flags will be set
to rcu_flipped, and the ctrl.completed counter is incremented.
  
   s/no pending RCU calls/at least one pending RCU call/, but otherwise
   spot on.
  
   So if the RCU grace-period machinery is idle, the first CPU to take
   a scheduling-clock interrupt after having posted an RCU callback will
   get things going.
 
  I said 'no' becaues of this:
 
  +rcu_try_flip_idle(void)
  +{
  +   int cpu;
  +
  +   RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
  +   if (!rcu_pending(smp_processor_id())) {
  +   RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
  +   return 0;
  +   }
 
  But now I'm a bit more confused. :-/
 
  Looking at the caller in kernel/timer.c I see
 
  if (rcu_pending(cpu))
  rcu_check_callbacks(cpu, user_tick);
 
  And rcu_check_callbacks is the caller of rcu_try_flip. The confusion is
  that we call this when we have a pending rcu, but if we have a pending
  rcu, we won't flip the counter ??

 We don't enter unless there is something for RCU to do (might be a
 pending callback, for example, but might also be needing to acknowledge
 a counter flip).  If, by the time we get to rcu_try_flip_idle(), there
 is no longer anything to do (!rcu_pending()), we bail.

 So a given CPU kicks the state machine out of idle only if it -still-
 has something to do once it gets to rcu_try_flip_idle(), right?


Now I can slap my forehead!  Duh, I wasn't seeing that ! in front of the
rcu_pending condition in the rcu_try_flip_idle.  We only flip if we do
indeed have something pending. I need some sleep. I also need to
re-evaluate some of my analysis of that code. But it doesn't change my
opinion of the stages.

 
  Are we sure that adding all these grace periods stages is better than just
  biting the bullet and put in a memory barrier?

 Good question.  I believe so, because the extra stages don't require
 much additional processing, and because the ratio of rcu_read_lock()
 calls to the number of grace periods is extremely high.  But, if I
 can prove it is safe, I will certainly decrease GP_STAGES or otherwise
 optimize the state machine.

But until others besides yourself understand that state machine (doesn't
really need to be me) I would be worried about applying it without
barriers.  The barriers may add a bit of overhead, but it adds some
confidence in the code.  I'm arguing that we have barriers in there until
there's a fine understanding of why we fail with 3 stages and not 4.
Perhaps you don't have a box with enough cpus to fail at 4.

I don't know how the higher ups in the kernel command line feel, but I
think that memory barriers on critical sections are justified. But if you
can show a proof that adding extra stages is sufficient to deal with
CPUS moving memory writes around, then so be it. But I'm still not
convinced that these extra stages are really solving the bug instead of
just making it much less likely to happen.

Ingo praised this code since it had several years of testing in the RT
tree. But that version has barriers, so this new verison without the
barriers has not had that run it through the grinder feeling to it.


 [ . . . ]

OK, that's all I have on this patch (will take a bit of a break before
reviewing your other patches).  But I will say that RCU has grown quite
a bit, and is looking very good.
  
   Glad you like it, and thank you again for the careful and thorough review.
 
  I'm scared to do the preempt portion %^O

 Ummm...  This -was- the preempt portion.  ;-)

hehe, I do need sleep I meant the boosting portion.


Basically, what I'm saying is Great work, Paul!.  This is looking
good. Seems that we just need a little bit better explanation for those
that are not up at the IQ level of you.  I can write something up after
this all gets finalized. Sort of a rcu-design.txt, that really tries to
explain it to the simpleton's like me ;-)
  
   I do greatly appreciate the compliments, especially coming 

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 11:15:42PM -0400, Steven Rostedt wrote:
 On Fri, 21 Sep 2007, Paul E. McKenney wrote:
  On Fri, Sep 21, 2007 at 09:15:03PM -0400, Steven Rostedt wrote:
   On Fri, 21 Sep 2007, Paul E. McKenney wrote:
On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
 On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

[ . . . ]

   Are we sure that adding all these grace periods stages is better than just
   biting the bullet and put in a memory barrier?
 
  Good question.  I believe so, because the extra stages don't require
  much additional processing, and because the ratio of rcu_read_lock()
  calls to the number of grace periods is extremely high.  But, if I
  can prove it is safe, I will certainly decrease GP_STAGES or otherwise
  optimize the state machine.
 
 But until others besides yourself understand that state machine (doesn't
 really need to be me) I would be worried about applying it without
 barriers.  The barriers may add a bit of overhead, but it adds some
 confidence in the code.  I'm arguing that we have barriers in there until
 there's a fine understanding of why we fail with 3 stages and not 4.
 Perhaps you don't have a box with enough cpus to fail at 4.
 
 I don't know how the higher ups in the kernel command line feel, but I
 think that memory barriers on critical sections are justified. But if you
 can show a proof that adding extra stages is sufficient to deal with
 CPUS moving memory writes around, then so be it. But I'm still not
 convinced that these extra stages are really solving the bug instead of
 just making it much less likely to happen.
 
 Ingo praised this code since it had several years of testing in the RT
 tree. But that version has barriers, so this new verison without the
 barriers has not had that run it through the grinder feeling to it.

Fair point...  Though the -rt variant has its shortcomings as well,
such as being unusable from NMI/SMI handlers.

How about this:  I continue running the GP_STAGES=3 run on the pair of
POWER machines (which are both going strong, and I also get a document
together describing the new version (and of course apply the changes we
have discussed, and merge with recent CPU-hotplug changes -- Gautham
Shenoy is currently working this), work out a good answer to how
big exactly does GP_STAGES need to be, test whatever that number is,
assuming it is neither 3 nor 4, and figure out why the gekko-lp1 machine
choked on GP_STAGES=3.

Then we can work out the best path forward from wherever that ends up
being.

[ . . . ]

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-21 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 10:56:56PM -0400, Steven Rostedt wrote:
 
 [ sneaks away from the family for a bit to answer emails ]

[ same here, now that you mention it... ]

 --
 On Fri, 21 Sep 2007, Paul E. McKenney wrote:
 
  On Fri, Sep 21, 2007 at 09:19:22PM -0400, Steven Rostedt wrote:
  
   --
   On Fri, 21 Sep 2007, Paul E. McKenney wrote:

 In any case, I will be looking at the scenarios more carefully.  If
 it turns out that GP_STAGES can indeed be cranked down a bit, well,
 that is an easy change!  I just fired off a POWER run with GP_STAGES
 set to 3, will let you know how it goes.
   
The first attempt blew up during boot badly enough that ABAT was unable
to recover the machine (sorry, grahal!!!).  Just for grins, I am trying
it again on a machine that ABAT has had a better record of reviving...
  
   This still frightens the hell out of me. Going through 15 states and
   failing. Seems the CPU is holding off writes for a long long time. That
   means we flipped the counter 4 times, and that still wasn't good enough?
 
  Might be that the other machine has its 2.6.22 version of .config messed
  up.  I will try booting it on a stock 2.6.22 kernel when it comes back
  to life -- not sure I ever did that before.  Besides, the other similar
  machine seems to have gone down for the count, but without me torturing
  it...
 
  Also, keep in mind that various stages can record a memory misordering,
  for example, by incrementing the wrong counter.
 
   Maybe I'll boot up my powerbook to see if it has the same issues.
  
   Well, I'm still finishing up on moving into my new house, so I wont be
   available this weekend.
 
  The other machine not only booted, but has survived several minutes of
  rcutorture thus far.  I am also trying POWER5 machine as well, as the
  one currently running is a POWER4, which is a bit less aggressive about
  memory reordering than is the POWER5.
 
  Even if they pass, I refuse to reduce GP_STAGES until proven safe.
  Trust me, you -don't- want to be unwittingly making use of a subtely
  busted RCU implementation!!!
 
 I totally agree. This is the same reason I want to understand -why- it
 fails with 3 stages. To make sure that adding a 4th stage really does fix
 it, and doesn't just make the chances for the bug smaller.

Or if it really does break, as opposed to my having happened upon a sick
or misconfigured machine.

 I just have that funny feeling that we are band-aiding this for POWER with
 extra stages and not really solving the bug.
 
 I could be totally out in left field on this. But the more people have a
 good understanding of what is happening (this includes why things fail)
 the more people in general can trust this code.  Right now I'm thinking
 you may be the only one that understands this code enough to trust it. I'm
 just wanting you to help people like me to trust the code by understanding
 and not just having faith in others.

Agreed.  Trusting me is grossly insufficient.  For one thing, the Linux
kernel has a reasonable chance of outliving me.

 If you ever decide to give up jogging, we need to make sure that there are
 people here that can still fill those running shoes (-:

Well, I certainly don't jog as fast or as far as I used to!  ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-20 Thread Dipankar Sarma
On Fri, Sep 21, 2007 at 12:17:21AM -0400, Steven Rostedt wrote:
> [ continued here from comment on patch 1]
> 
> On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> >  /* softirq mask and active fields moved to irq_cpustat_t in
> > diff -urpNa -X dontdiff 
> > linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 
> > linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h   2007-08-22 
> > 14:42:23.0 -0700
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h2007-08-22 
> > 15:21:06.0 -0700
> > @@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
> >  #define RCU_HEAD_INIT  { .next = NULL, .func = NULL }
> >  #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
> > @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
> >  /* Exported common interfaces */
> >  extern void synchronize_rcu(void);
> >  extern void rcu_barrier(void);
> > +extern long rcu_batches_completed(void);
> > +extern long rcu_batches_completed_bh(void);
> >
> 
> And here we put back rcu_batches_completed and rcu_batches_completed_bh
> from rcuclassic.h to rcupdate.h ;-)

Good questions :) I can't remember why I did this - probably because
I was breaking up into classic and preemptible RCU in incremental
patches with the goal that the break-up patch can be merged
before the rcu-preempt patches. IIRC, I had to make *batches_completed*()
a common RCU API later on.

Thanks
Dipankar
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-20 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 12:17:21AM -0400, Steven Rostedt wrote:
> [ continued here from comment on patch 1]
> 
> On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> >  /* softirq mask and active fields moved to irq_cpustat_t in
> > diff -urpNa -X dontdiff 
> > linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 
> > linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h   2007-08-22 
> > 14:42:23.0 -0700
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h2007-08-22 
> > 15:21:06.0 -0700
> > @@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
> >  extern void __rcu_init(void);
> >  extern void rcu_check_callbacks(int cpu, int user);
> >  extern void rcu_restart_cpu(int cpu);
> > -extern long rcu_batches_completed(void);
> > -extern long rcu_batches_completed_bh(void);
> >  
> >  #endif /* __KERNEL__ */
> >  #endif /* __LINUX_RCUCLASSIC_H */
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
> > linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 
> > 14:02:36.0 -0700
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h  2007-08-22 
> > 15:21:06.0 -0700
> > @@ -52,7 +52,11 @@ struct rcu_head {
> > void (*func)(struct rcu_head *head);
> >  };
> >  
> > +#ifdef CONFIG_CLASSIC_RCU
> >  #include 
> > +#else /* #ifdef CONFIG_CLASSIC_RCU */
> > +#include 
> > +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
> >  
> >  #define RCU_HEAD_INIT  { .next = NULL, .func = NULL }
> >  #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
> > @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
> >  /* Exported common interfaces */
> >  extern void synchronize_rcu(void);
> >  extern void rcu_barrier(void);
> > +extern long rcu_batches_completed(void);
> > +extern long rcu_batches_completed_bh(void);
> >
> 
> And here we put back rcu_batches_completed and rcu_batches_completed_bh
> from rcuclassic.h to rcupdate.h ;-)

Hmmm...  Good point!!!  I guess it would be OK to just leave them
in rcupdate.h throughout.  ;-)

Will fix.  And good eyes!

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-20 Thread Steven Rostedt
[ continued here from comment on patch 1]

On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
>  /* softirq mask and active fields moved to irq_cpustat_t in
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 
> linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
> --- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 2007-08-22 
> 14:42:23.0 -0700
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h  2007-08-22 
> 15:21:06.0 -0700
> @@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
>  extern void __rcu_init(void);
>  extern void rcu_check_callbacks(int cpu, int user);
>  extern void rcu_restart_cpu(int cpu);
> -extern long rcu_batches_completed(void);
> -extern long rcu_batches_completed_bh(void);
>  
>  #endif /* __KERNEL__ */
>  #endif /* __LINUX_RCUCLASSIC_H */
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
> linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
> --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h   2007-07-19 
> 14:02:36.0 -0700
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h2007-08-22 
> 15:21:06.0 -0700
> @@ -52,7 +52,11 @@ struct rcu_head {
>   void (*func)(struct rcu_head *head);
>  };
>  
> +#ifdef CONFIG_CLASSIC_RCU
>  #include 
> +#else /* #ifdef CONFIG_CLASSIC_RCU */
> +#include 
> +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
>  
>  #define RCU_HEAD_INIT{ .next = NULL, .func = NULL }
>  #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
> @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
>  /* Exported common interfaces */
>  extern void synchronize_rcu(void);
>  extern void rcu_barrier(void);
> +extern long rcu_batches_completed(void);
> +extern long rcu_batches_completed_bh(void);
>

And here we put back rcu_batches_completed and rcu_batches_completed_bh
from rcuclassic.h to rcupdate.h ;-)

-- Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-20 Thread Steven Rostedt
[ continued here from comment on patch 1]

On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
  /* softirq mask and active fields moved to irq_cpustat_t in
 diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 
 linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
 --- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 2007-08-22 
 14:42:23.0 -0700
 +++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h  2007-08-22 
 15:21:06.0 -0700
 @@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
  extern void __rcu_init(void);
  extern void rcu_check_callbacks(int cpu, int user);
  extern void rcu_restart_cpu(int cpu);
 -extern long rcu_batches_completed(void);
 -extern long rcu_batches_completed_bh(void);
  
  #endif /* __KERNEL__ */
  #endif /* __LINUX_RCUCLASSIC_H */
 diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
 linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
 --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h   2007-07-19 
 14:02:36.0 -0700
 +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h2007-08-22 
 15:21:06.0 -0700
 @@ -52,7 +52,11 @@ struct rcu_head {
   void (*func)(struct rcu_head *head);
  };
  
 +#ifdef CONFIG_CLASSIC_RCU
  #include linux/rcuclassic.h
 +#else /* #ifdef CONFIG_CLASSIC_RCU */
 +#include linux/rcupreempt.h
 +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
  
  #define RCU_HEAD_INIT{ .next = NULL, .func = NULL }
  #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
 @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
  /* Exported common interfaces */
  extern void synchronize_rcu(void);
  extern void rcu_barrier(void);
 +extern long rcu_batches_completed(void);
 +extern long rcu_batches_completed_bh(void);


And here we put back rcu_batches_completed and rcu_batches_completed_bh
from rcuclassic.h to rcupdate.h ;-)

-- Steve

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-20 Thread Dipankar Sarma
On Fri, Sep 21, 2007 at 12:17:21AM -0400, Steven Rostedt wrote:
 [ continued here from comment on patch 1]
 
 On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
   /* softirq mask and active fields moved to irq_cpustat_t in
  diff -urpNa -X dontdiff 
  linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 
  linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
  --- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h   2007-08-22 
  14:42:23.0 -0700
  +++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h2007-08-22 
  15:21:06.0 -0700
  @@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
   #define RCU_HEAD_INIT  { .next = NULL, .func = NULL }
   #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
  @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
   /* Exported common interfaces */
   extern void synchronize_rcu(void);
   extern void rcu_barrier(void);
  +extern long rcu_batches_completed(void);
  +extern long rcu_batches_completed_bh(void);
 
 
 And here we put back rcu_batches_completed and rcu_batches_completed_bh
 from rcuclassic.h to rcupdate.h ;-)

Good questions :) I can't remember why I did this - probably because
I was breaking up into classic and preemptible RCU in incremental
patches with the goal that the break-up patch can be merged
before the rcu-preempt patches. IIRC, I had to make *batches_completed*()
a common RCU API later on.

Thanks
Dipankar
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-20 Thread Paul E. McKenney
On Fri, Sep 21, 2007 at 12:17:21AM -0400, Steven Rostedt wrote:
 [ continued here from comment on patch 1]
 
 On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
   /* softirq mask and active fields moved to irq_cpustat_t in
  diff -urpNa -X dontdiff 
  linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 
  linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
  --- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h   2007-08-22 
  14:42:23.0 -0700
  +++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h2007-08-22 
  15:21:06.0 -0700
  @@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
   extern void __rcu_init(void);
   extern void rcu_check_callbacks(int cpu, int user);
   extern void rcu_restart_cpu(int cpu);
  -extern long rcu_batches_completed(void);
  -extern long rcu_batches_completed_bh(void);
   
   #endif /* __KERNEL__ */
   #endif /* __LINUX_RCUCLASSIC_H */
  diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
  linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
  --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 
  14:02:36.0 -0700
  +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h  2007-08-22 
  15:21:06.0 -0700
  @@ -52,7 +52,11 @@ struct rcu_head {
  void (*func)(struct rcu_head *head);
   };
   
  +#ifdef CONFIG_CLASSIC_RCU
   #include linux/rcuclassic.h
  +#else /* #ifdef CONFIG_CLASSIC_RCU */
  +#include linux/rcupreempt.h
  +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
   
   #define RCU_HEAD_INIT  { .next = NULL, .func = NULL }
   #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
  @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
   /* Exported common interfaces */
   extern void synchronize_rcu(void);
   extern void rcu_barrier(void);
  +extern long rcu_batches_completed(void);
  +extern long rcu_batches_completed_bh(void);
 
 
 And here we put back rcu_batches_completed and rcu_batches_completed_bh
 from rcuclassic.h to rcupdate.h ;-)

Hmmm...  Good point!!!  I guess it would be OK to just leave them
in rcupdate.h throughout.  ;-)

Will fix.  And good eyes!

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-10 Thread Paul E. McKenney
Work in progress, not for inclusion.

This patch implements a new version of RCU which allows its read-side
critical sections to be preempted. It uses a set of counter pairs
to keep track of the read-side critical sections and flips them
when all tasks exit read-side critical section. The details
of this implementation can be found in this paper -

http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf

This patch was developed as a part of the -rt kernel development and
meant to provide better latencies when read-side critical sections of
RCU don't disable preemption.  As a consequence of keeping track of RCU
readers, the readers have a slight overhead (optimizations in the paper).
This implementation co-exists with the "classic" RCU implementations
and can be switched to at compiler.

Also includes RCU tracing summarized in debugfs and RCU_SOFTIRQ for
the preemptible variant of RCU.

Signed-off-by: Dipankar Sarma <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> (for RCU_SOFTIRQ)
Signed-off-by: Paul McKenney <[EMAIL PROTECTED]>
---

 include/linux/interrupt.h|1 
 include/linux/rcuclassic.h   |2 
 include/linux/rcupdate.h |7 
 include/linux/rcupreempt.h   |   78 +++
 include/linux/rcupreempt_trace.h |  100 +
 include/linux/sched.h|5 
 kernel/Kconfig.preempt   |   38 +
 kernel/Makefile  |7 
 kernel/fork.c|4 
 kernel/rcupreempt.c  |  767 +++
 kernel/rcupreempt_trace.c|  330 
 11 files changed, 1336 insertions(+), 3 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/interrupt.h 
linux-2.6.22-c-preemptrcu/include/linux/interrupt.h
--- linux-2.6.22-b-fixbarriers/include/linux/interrupt.h2007-07-08 
16:32:17.0 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/interrupt.h 2007-08-22 
15:21:06.0 -0700
@@ -269,6 +269,7 @@ enum
 #ifdef CONFIG_HIGH_RES_TIMERS
HRTIMER_SOFTIRQ,
 #endif
+   RCU_SOFTIRQ,/* Preferable RCU should always be the last softirq */
 };
 
 /* softirq mask and active fields moved to irq_cpustat_t in
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 
linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
--- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h   2007-08-22 
14:42:23.0 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h2007-08-22 
15:21:06.0 -0700
@@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
 extern void __rcu_init(void);
 extern void rcu_check_callbacks(int cpu, int user);
 extern void rcu_restart_cpu(int cpu);
-extern long rcu_batches_completed(void);
-extern long rcu_batches_completed_bh(void);
 
 #endif /* __KERNEL__ */
 #endif /* __LINUX_RCUCLASSIC_H */
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
--- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 
14:02:36.0 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h  2007-08-22 
15:21:06.0 -0700
@@ -52,7 +52,11 @@ struct rcu_head {
void (*func)(struct rcu_head *head);
 };
 
+#ifdef CONFIG_CLASSIC_RCU
 #include 
+#else /* #ifdef CONFIG_CLASSIC_RCU */
+#include 
+#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
 
 #define RCU_HEAD_INIT  { .next = NULL, .func = NULL }
 #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
@@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
 /* Exported common interfaces */
 extern void synchronize_rcu(void);
 extern void rcu_barrier(void);
+extern long rcu_batches_completed(void);
+extern long rcu_batches_completed_bh(void);
 
 /* Internal to kernel */
 extern void rcu_init(void);
 extern void rcu_check_callbacks(int cpu, int user);
+extern int rcu_needs_cpu(int cpu);
 
 #endif /* __KERNEL__ */
 #endif /* __LINUX_RCUPDATE_H */
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 
linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h
--- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h   1969-12-31 
16:00:00.0 -0800
+++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h2007-08-22 
15:21:06.0 -0700
@@ -0,0 +1,78 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion (RT implementation)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General 

[PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-10 Thread Paul E. McKenney
Work in progress, not for inclusion.

This patch implements a new version of RCU which allows its read-side
critical sections to be preempted. It uses a set of counter pairs
to keep track of the read-side critical sections and flips them
when all tasks exit read-side critical section. The details
of this implementation can be found in this paper -

http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf

This patch was developed as a part of the -rt kernel development and
meant to provide better latencies when read-side critical sections of
RCU don't disable preemption.  As a consequence of keeping track of RCU
readers, the readers have a slight overhead (optimizations in the paper).
This implementation co-exists with the classic RCU implementations
and can be switched to at compiler.

Also includes RCU tracing summarized in debugfs and RCU_SOFTIRQ for
the preemptible variant of RCU.

Signed-off-by: Dipankar Sarma [EMAIL PROTECTED]
Signed-off-by: Steven Rostedt [EMAIL PROTECTED] (for RCU_SOFTIRQ)
Signed-off-by: Paul McKenney [EMAIL PROTECTED]
---

 include/linux/interrupt.h|1 
 include/linux/rcuclassic.h   |2 
 include/linux/rcupdate.h |7 
 include/linux/rcupreempt.h   |   78 +++
 include/linux/rcupreempt_trace.h |  100 +
 include/linux/sched.h|5 
 kernel/Kconfig.preempt   |   38 +
 kernel/Makefile  |7 
 kernel/fork.c|4 
 kernel/rcupreempt.c  |  767 +++
 kernel/rcupreempt_trace.c|  330 
 11 files changed, 1336 insertions(+), 3 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/interrupt.h 
linux-2.6.22-c-preemptrcu/include/linux/interrupt.h
--- linux-2.6.22-b-fixbarriers/include/linux/interrupt.h2007-07-08 
16:32:17.0 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/interrupt.h 2007-08-22 
15:21:06.0 -0700
@@ -269,6 +269,7 @@ enum
 #ifdef CONFIG_HIGH_RES_TIMERS
HRTIMER_SOFTIRQ,
 #endif
+   RCU_SOFTIRQ,/* Preferable RCU should always be the last softirq */
 };
 
 /* softirq mask and active fields moved to irq_cpustat_t in
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 
linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
--- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h   2007-08-22 
14:42:23.0 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h2007-08-22 
15:21:06.0 -0700
@@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
 extern void __rcu_init(void);
 extern void rcu_check_callbacks(int cpu, int user);
 extern void rcu_restart_cpu(int cpu);
-extern long rcu_batches_completed(void);
-extern long rcu_batches_completed_bh(void);
 
 #endif /* __KERNEL__ */
 #endif /* __LINUX_RCUCLASSIC_H */
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 
linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
--- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 
14:02:36.0 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h  2007-08-22 
15:21:06.0 -0700
@@ -52,7 +52,11 @@ struct rcu_head {
void (*func)(struct rcu_head *head);
 };
 
+#ifdef CONFIG_CLASSIC_RCU
 #include linux/rcuclassic.h
+#else /* #ifdef CONFIG_CLASSIC_RCU */
+#include linux/rcupreempt.h
+#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
 
 #define RCU_HEAD_INIT  { .next = NULL, .func = NULL }
 #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
@@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct 
 /* Exported common interfaces */
 extern void synchronize_rcu(void);
 extern void rcu_barrier(void);
+extern long rcu_batches_completed(void);
+extern long rcu_batches_completed_bh(void);
 
 /* Internal to kernel */
 extern void rcu_init(void);
 extern void rcu_check_callbacks(int cpu, int user);
+extern int rcu_needs_cpu(int cpu);
 
 #endif /* __KERNEL__ */
 #endif /* __LINUX_RCUPDATE_H */
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 
linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h
--- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h   1969-12-31 
16:00:00.0 -0800
+++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h2007-08-22 
15:21:06.0 -0700
@@ -0,0 +1,78 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion (RT implementation)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a