subject:"Re\: \[PATCH 3\/7\] barrier\: a scalable synchonisation barrier"

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-03 Thread Paul E. McKenney

On Sat, Feb 03, 2007 at 10:24:04PM -0500, Alan Stern wrote:
> On Sat, 3 Feb 2007, Paul E. McKenney wrote:
> 
> > > And another note: this all assumes that STORE-MB-LOAD works "correctly", 
> > > yes?
> > > We have other code which relies on that, should not be a problem.
> > 
> > We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of
> > University of Pennsylvania, and Vijay Saraswat of IBM Research towards
> > a "universal memory model" that accommodates all machines.  Currently,
> > it does in fact handle store-mb-load the way we all want, thankfully!
> 
> We should add that many places in the kernel do depend on proper behavior 
> for this data access pattern.  So whatever "universal memory model" we end 
> up with, it had better handle the pattern correctly if Linux is to support 
> it.

Agreed!

> It's interesting to note, however, that this does exclude simple MESI.

Yep!  And also a number of compiler optimizations, as it turns out.  ;-)

There is a tension between nice-to-software memory-barrier properties
on the one hand and easily understood code on the other.  But I guess
that this is true of pretty much any software tool.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-03 Thread Alan Stern

On Sat, 3 Feb 2007, Paul E. McKenney wrote:

> > And another note: this all assumes that STORE-MB-LOAD works "correctly", 
> > yes?
> > We have other code which relies on that, should not be a problem.
> 
> We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of
> University of Pennsylvania, and Vijay Saraswat of IBM Research towards
> a "universal memory model" that accommodates all machines.  Currently,
> it does in fact handle store-mb-load the way we all want, thankfully!

We should add that many places in the kernel do depend on proper behavior 
for this data access pattern.  So whatever "universal memory model" we end 
up with, it had better handle the pattern correctly if Linux is to support 
it.

It's interesting to note, however, that this does exclude simple MESI.

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-03 Thread Paul E. McKenney

On Sat, Feb 03, 2007 at 07:38:50PM +0300, Oleg Nesterov wrote:
> On 01/31, Paul E. McKenney wrote:
> >
> > QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
> > do what you want, as it acquires the lock unconditionally.  I am proposing
> > that synchronize_qrcu() change to something like the following:
> > 
> > void synchronize_qrcu(struct qrcu_struct *qp)
> > {
> > int idx;
> > 
> > smp_mb();
> > 
> > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
> > smp_rmb();
> > if (atomic_read(qp->ctr[0]) +
> > atomic_read(qp->ctr[1]) <= 1)
> > goto out;
> > }
> > 
> > mutex_lock(>mutex);
> > idx = qp->completed & 0x1;
> > atomic_inc(qp->ctr + (idx ^ 0x1));
> > /* Reduce the likelihood that qrcu_read_lock() will loop */
> > smp_mb__after_atomic_inc();
> 
> I almost forgot. Currently this smp_mb__after_atomic_inc() is not strictly
> needed, and the comment is correct. However, it becomes mandatory with your
> optimization. Without this barrier, it is possible that both checks above
> mutex_lock() will see the result of atomic_dec(), but not the atomic_inc().
> 
> So, may I ask you to also update this comment?
> 
>   /*
>* Reduce the likelihood that qrcu_read_lock() will loop
>*  AND
>* make sure the second re-check above will see the result
>* of atomic_inc() if it sees the result of atomic_dec()
>*/
> 
> Something like this, I hope you will make it better.

Good catch!!!  I will make sure that this is reflected.

Also, validation is proceeding nicely, if extremely hoggishly.
The validation program and output thus far attached in case someone
is interested.  The three-readers/three-updaters case has worked its
way up to 16% of the memory on a 48GB machine.  ;-)

If it overflows, I will see if I can get someone to convert it to
VHDL and run hardware validation tools on it.  This turned out to
be necessary for the -rt implementation of RCU...

> And another note: this all assumes that STORE-MB-LOAD works "correctly", yes?
> We have other code which relies on that, should not be a problem.

We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of
University of Pennsylvania, and Vijay Saraswat of IBM Research towards
a "universal memory model" that accommodates all machines.  Currently,
it does in fact handle store-mb-load the way we all want, thankfully!
Actually, the other guys are doing most of the formalism, my main role
has been programming a very stupid checker based on their formalisms
and yelling at them when something bad happens.

See the  directory in the x10 project on SourceForge if you want more
info, but be warned that the checker's UI really sucks.  It's input and
output formats are abominations that could only have been dreamed up by
someone who started out on punchcards and early-70s BASIC.  Not pretty,
but it does a good job of letting people know what I think the formalism
is saying!

There are some people working on a Prolog program called jmmsolve that
as a much nicer input format, but I need to prototype memory barriers
before they will incorporate them (currently, each statement acts as
if it had smp_mb() before and after it).  Also, their output is as yet
incomprehensible to me.

Thanx, Paul

> (Alan Stern cc'ed).
> 
> Oleg.
> 

qrcuval.2007.02.03a.tgz
Description: qrcuval.2007.02.03a.tgz

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-03 Thread Oleg Nesterov

On 01/31, Paul E. McKenney wrote:
>
> QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
> do what you want, as it acquires the lock unconditionally.  I am proposing
> that synchronize_qrcu() change to something like the following:
> 
>   void synchronize_qrcu(struct qrcu_struct *qp)
>   {
>   int idx;
>   
>   smp_mb();
>   
>   if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
>   smp_rmb();
>   if (atomic_read(qp->ctr[0]) +
>   atomic_read(qp->ctr[1]) <= 1)
>   goto out;
>   }
>   
>   mutex_lock(>mutex);
>   idx = qp->completed & 0x1;
>   atomic_inc(qp->ctr + (idx ^ 0x1));
>   /* Reduce the likelihood that qrcu_read_lock() will loop */
>   smp_mb__after_atomic_inc();

I almost forgot. Currently this smp_mb__after_atomic_inc() is not strictly
needed, and the comment is correct. However, it becomes mandatory with your
optimization. Without this barrier, it is possible that both checks above
mutex_lock() will see the result of atomic_dec(), but not the atomic_inc().

So, may I ask you to also update this comment?

/*
 * Reduce the likelihood that qrcu_read_lock() will loop
 *  AND
 * make sure the second re-check above will see the result
 * of atomic_inc() if it sees the result of atomic_dec()
 */

Something like this, I hope you will make it better.

And another note: this all assumes that STORE-MB-LOAD works "correctly", yes?
We have other code which relies on that, should not be a problem.

(Alan Stern cc'ed).

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-03 Thread Oleg Nesterov

On 01/31, Paul E. McKenney wrote:

 QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
 do what you want, as it acquires the lock unconditionally.  I am proposing
 that synchronize_qrcu() change to something like the following:
 
   void synchronize_qrcu(struct qrcu_struct *qp)
   {
   int idx;
   
   smp_mb();
   
   if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) {
   smp_rmb();
   if (atomic_read(qp-ctr[0]) +
   atomic_read(qp-ctr[1]) = 1)
   goto out;
   }
   
   mutex_lock(qp-mutex);
   idx = qp-completed  0x1;
   atomic_inc(qp-ctr + (idx ^ 0x1));
   /* Reduce the likelihood that qrcu_read_lock() will loop */
   smp_mb__after_atomic_inc();

I almost forgot. Currently this smp_mb__after_atomic_inc() is not strictly
needed, and the comment is correct. However, it becomes mandatory with your
optimization. Without this barrier, it is possible that both checks above
mutex_lock() will see the result of atomic_dec(), but not the atomic_inc().

So, may I ask you to also update this comment?

/*
 * Reduce the likelihood that qrcu_read_lock() will loop
 *  AND
 * make sure the second re-check above will see the result
 * of atomic_inc() if it sees the result of atomic_dec()
 */

Something like this, I hope you will make it better.

And another note: this all assumes that STORE-MB-LOAD works correctly, yes?
We have other code which relies on that, should not be a problem.

(Alan Stern cc'ed).

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-03 Thread Paul E. McKenney

On Sat, Feb 03, 2007 at 07:38:50PM +0300, Oleg Nesterov wrote:
 On 01/31, Paul E. McKenney wrote:
 
  QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
  do what you want, as it acquires the lock unconditionally.  I am proposing
  that synchronize_qrcu() change to something like the following:
  
  void synchronize_qrcu(struct qrcu_struct *qp)
  {
  int idx;
  
  smp_mb();
  
  if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) {
  smp_rmb();
  if (atomic_read(qp-ctr[0]) +
  atomic_read(qp-ctr[1]) = 1)
  goto out;
  }
  
  mutex_lock(qp-mutex);
  idx = qp-completed  0x1;
  atomic_inc(qp-ctr + (idx ^ 0x1));
  /* Reduce the likelihood that qrcu_read_lock() will loop */
  smp_mb__after_atomic_inc();
 
 I almost forgot. Currently this smp_mb__after_atomic_inc() is not strictly
 needed, and the comment is correct. However, it becomes mandatory with your
 optimization. Without this barrier, it is possible that both checks above
 mutex_lock() will see the result of atomic_dec(), but not the atomic_inc().
 
 So, may I ask you to also update this comment?
 
   /*
* Reduce the likelihood that qrcu_read_lock() will loop
*  AND
* make sure the second re-check above will see the result
* of atomic_inc() if it sees the result of atomic_dec()
*/
 
 Something like this, I hope you will make it better.

Good catch!!!  I will make sure that this is reflected.

Also, validation is proceeding nicely, if extremely hoggishly.
The validation program and output thus far attached in case someone
is interested.  The three-readers/three-updaters case has worked its
way up to 16% of the memory on a 48GB machine.  ;-)

If it overflows, I will see if I can get someone to convert it to
VHDL and run hardware validation tools on it.  This turned out to
be necessary for the -rt implementation of RCU...

 And another note: this all assumes that STORE-MB-LOAD works correctly, yes?
 We have other code which relies on that, should not be a problem.

We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of
University of Pennsylvania, and Vijay Saraswat of IBM Research towards
a universal memory model that accommodates all machines.  Currently,
it does in fact handle store-mb-load the way we all want, thankfully!
Actually, the other guys are doing most of the formalism, my main role
has been programming a very stupid checker based on their formalisms
and yelling at them when something bad happens.

See the  directory in the x10 project on SourceForge if you want more
info, but be warned that the checker's UI really sucks.  It's input and
output formats are abominations that could only have been dreamed up by
someone who started out on punchcards and early-70s BASIC.  Not pretty,
but it does a good job of letting people know what I think the formalism
is saying!

There are some people working on a Prolog program called jmmsolve that
as a much nicer input format, but I need to prototype memory barriers
before they will incorporate them (currently, each statement acts as
if it had smp_mb() before and after it).  Also, their output is as yet
incomprehensible to me.

Thanx, Paul

 (Alan Stern cc'ed).
 
 Oleg.
 


qrcuval.2007.02.03a.tgz
Description: qrcuval.2007.02.03a.tgz

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-03 Thread Alan Stern

On Sat, 3 Feb 2007, Paul E. McKenney wrote:

  And another note: this all assumes that STORE-MB-LOAD works correctly, 
  yes?
  We have other code which relies on that, should not be a problem.
 
 We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of
 University of Pennsylvania, and Vijay Saraswat of IBM Research towards
 a universal memory model that accommodates all machines.  Currently,
 it does in fact handle store-mb-load the way we all want, thankfully!

We should add that many places in the kernel do depend on proper behavior 
for this data access pattern.  So whatever universal memory model we end 
up with, it had better handle the pattern correctly if Linux is to support 
it.

It's interesting to note, however, that this does exclude simple MESI.

Alan Stern

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-03 Thread Paul E. McKenney

On Sat, Feb 03, 2007 at 10:24:04PM -0500, Alan Stern wrote:
 On Sat, 3 Feb 2007, Paul E. McKenney wrote:
 
   And another note: this all assumes that STORE-MB-LOAD works correctly, 
   yes?
   We have other code which relies on that, should not be a problem.
  
  We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of
  University of Pennsylvania, and Vijay Saraswat of IBM Research towards
  a universal memory model that accommodates all machines.  Currently,
  it does in fact handle store-mb-load the way we all want, thankfully!
 
 We should add that many places in the kernel do depend on proper behavior 
 for this data access pattern.  So whatever universal memory model we end 
 up with, it had better handle the pattern correctly if Linux is to support 
 it.

Agreed!

 It's interesting to note, however, that this does exclude simple MESI.

Yep!  And also a number of compiler optimizations, as it turns out.  ;-)

There is a tension between nice-to-software memory-barrier properties
on the one hand and easily understood code on the other.  But I guess
that this is true of pretty much any software tool.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-02 Thread Paul E. McKenney

On Fri, Feb 02, 2007 at 02:56:12PM +0300, Oleg Nesterov wrote:
> On 02/01, Paul E. McKenney wrote:
> > > > 
> > > > void synchronize_qrcu(struct qrcu_struct *qp)
> > > > {
> > > > int idx;
> > > > 
> > > > smp_mb();
> > > > 
> > > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) 
> > > > <= 1) {
> > > > smp_rmb();
> > > > if (atomic_read(qp->ctr[0]) +
> > > > atomic_read(qp->ctr[1]) <= 1)
> > > > goto out;
> > > > }
> > > > 
> > > > mutex_lock(>mutex);
> > > > idx = qp->completed & 0x1;
> > > > atomic_inc(qp->ctr + (idx ^ 0x1));
> > > > /* Reduce the likelihood that qrcu_read_lock() will 
> > > > loop */
> > > > smp_mb__after_atomic_inc();
> > > > qp->completed++;
> > > > 
> > > > atomic_dec(qp->ctr + idx);
> > > > __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
> > > > mutex_unlock(>mutex);
> > > > out:
> > > > smp_mb();
> > > > }
> > > > 
> > > > For the first "if" to give a false positive, a concurrent switch had
> > > > to have happened.  For example, qp->ctr[0] was zero and qp->ctr[1]
> > > > was two at the time of the first atomic_read(), but then qp->completed
> > > > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
> > > > of the second atomic_read.  The only way the second "if" can give us a
> > > > false positive is if there was another change to qp->completed in the
> > > > meantime -- but that means that all of the pre-existing qrcu_read_lock()
> > > > holders must have gotten done, otherwise the second switch could not
> > > > have happened.  Yes, you do incur three memory barriers on the fast
> > > > path, but the best you could hope for with your approach was two of them
> > > > (unless I am confused about how you were using barrier_sync()).
> 
> Yes. Without synchronize_qrcu() in between, one of the counters should be == 
> 0,
> another >= 1. == 1 means we have no active readers. So the false positive 
> really
> means a concurrent switch. And we can check twice - excellent idea!

Well, if it ends up really working.  ;-)

This one needs more than just testing -- I will put together a Promela
model that does a full state-space search for races.

I do have one fall-back, namely putting both counters into a single
aligned long.  But I would like to avoid this, since there might be
architectures out there that cannot cleanly store into half-longs.
Such architectures would have to use atomic ops.  :-/

> > > While doing qrcu, somehow I convinced myself we can't optimize out taking
> > > qp->mutex. Now I think I was wrong. Good!
> > 
> > Me, I didn't want to worry about it unless someone needed it.  Which
> > it now appears they do.  ;-)
> 
> No. I do remember I tried hard to optimize out taking qp->mutex, but failed.
> So I decided it is not possible. And now you show that I just don't have 
> enough
> brains! (of course, I hate you :)

Coming from you, that is high praise indeed!!!  ;-)

Now if it really does work...

> > > Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under 
> > > ->mutex,
> > > was this needed for this optimization to work? I am asking because I can't
> > > understand how it can make any difference.
> > 
> > Before, we held the lock, so we could just check the single current
> > element.  Now we don't hold the lock, so we need to check both elements.
> > So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the
> > nested "if" statements that test both elements.
> 
> Ah, my question was different. The current version of qrcu does
> 
>   mutex_lock(>mutex);
> 
>   idx = qp->completed & 0x1;
>   if (atomic_read(qp->ctr + idx) == 1)// fast path
>   return;
> 
>   ...
> 
> and it seems to me that we can retain this fastpath even with your 
> optimization,
> no? Surely, it is not so important, but it is nearly free.

Ah!  This does make sense, excellent point!!!

> Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only 
> have
> P-4 ht).

I will do the Promela model, and if that works, I will submit a patch.

Thanx, Paul

> Peter, do you think you can use qrcu?
> 
> Oleg.
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-02 Thread Peter Zijlstra

On Fri, 2007-02-02 at 14:56 +0300, Oleg Nesterov wrote:
> On 02/01, Paul E. McKenney wrote:
> > > > 
> > > > void synchronize_qrcu(struct qrcu_struct *qp)
> > > > {
> > > > int idx;
> > > > 
> > > > smp_mb();
> > > > 
> > > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) 
> > > > <= 1) {
> > > > smp_rmb();
> > > > if (atomic_read(qp->ctr[0]) +
> > > > atomic_read(qp->ctr[1]) <= 1)
> > > > goto out;
> > > > }
> > > > 
> > > > mutex_lock(>mutex);
> > > > idx = qp->completed & 0x1;
> > > > atomic_inc(qp->ctr + (idx ^ 0x1));
> > > > /* Reduce the likelihood that qrcu_read_lock() will 
> > > > loop */
> > > > smp_mb__after_atomic_inc();
> > > > qp->completed++;
> > > > 
> > > > atomic_dec(qp->ctr + idx);
> > > > __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
> > > > mutex_unlock(>mutex);
> > > > out:
> > > > smp_mb();
> > > > }
> > > > 
> > > > For the first "if" to give a false positive, a concurrent switch had
> > > > to have happened.  For example, qp->ctr[0] was zero and qp->ctr[1]
> > > > was two at the time of the first atomic_read(), but then qp->completed
> > > > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
> > > > of the second atomic_read.  The only way the second "if" can give us a
> > > > false positive is if there was another change to qp->completed in the
> > > > meantime -- but that means that all of the pre-existing qrcu_read_lock()
> > > > holders must have gotten done, otherwise the second switch could not
> > > > have happened.  Yes, you do incur three memory barriers on the fast
> > > > path, but the best you could hope for with your approach was two of them
> > > > (unless I am confused about how you were using barrier_sync()).
> 
> Yes. Without synchronize_qrcu() in between, one of the counters should be == 
> 0,
> another >= 1. == 1 means we have no active readers. So the false positive 
> really
> means a concurrent switch. And we can check twice - excellent idea!
> 
> > > While doing qrcu, somehow I convinced myself we can't optimize out taking
> > > qp->mutex. Now I think I was wrong. Good!
> > 
> > Me, I didn't want to worry about it unless someone needed it.  Which
> > it now appears they do.  ;-)
> 
> No. I do remember I tried hard to optimize out taking qp->mutex, but failed.
> So I decided it is not possible. And now you show that I just don't have 
> enough
> brains! (of course, I hate you :)
> 
> > > Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under 
> > > ->mutex,
> > > was this needed for this optimization to work? I am asking because I can't
> > > understand how it can make any difference.
> > 
> > Before, we held the lock, so we could just check the single current
> > element.  Now we don't hold the lock, so we need to check both elements.
> > So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the
> > nested "if" statements that test both elements.
> 
> Ah, my question was different. The current version of qrcu does
> 
>   mutex_lock(>mutex);
> 
>   idx = qp->completed & 0x1;
>   if (atomic_read(qp->ctr + idx) == 1)// fast path
>   return;
> 
>   ...
> 
> and it seems to me that we can retain this fastpath even with your 
> optimization,
> no? Surely, it is not so important, but it is nearly free.
> 
> Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only 
> have
> P-4 ht).
> 
> Peter, do you think you can use qrcu?

Yes, this looks very much like what I need. Awesome work! 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-02 Thread Oleg Nesterov

On 02/01, Paul E. McKenney wrote:
> > > 
> > >   void synchronize_qrcu(struct qrcu_struct *qp)
> > >   {
> > >   int idx;
> > >   
> > >   smp_mb();
> > >   
> > >   if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
> > >   smp_rmb();
> > >   if (atomic_read(qp->ctr[0]) +
> > >   atomic_read(qp->ctr[1]) <= 1)
> > >   goto out;
> > >   }
> > >   
> > >   mutex_lock(>mutex);
> > >   idx = qp->completed & 0x1;
> > >   atomic_inc(qp->ctr + (idx ^ 0x1));
> > >   /* Reduce the likelihood that qrcu_read_lock() will loop */
> > >   smp_mb__after_atomic_inc();
> > >   qp->completed++;
> > >   
> > >   atomic_dec(qp->ctr + idx);
> > >   __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
> > >   mutex_unlock(>mutex);
> > >   out:
> > >   smp_mb();
> > >   }
> > > 
> > > For the first "if" to give a false positive, a concurrent switch had
> > > to have happened.  For example, qp->ctr[0] was zero and qp->ctr[1]
> > > was two at the time of the first atomic_read(), but then qp->completed
> > > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
> > > of the second atomic_read.  The only way the second "if" can give us a
> > > false positive is if there was another change to qp->completed in the
> > > meantime -- but that means that all of the pre-existing qrcu_read_lock()
> > > holders must have gotten done, otherwise the second switch could not
> > > have happened.  Yes, you do incur three memory barriers on the fast
> > > path, but the best you could hope for with your approach was two of them
> > > (unless I am confused about how you were using barrier_sync()).

Yes. Without synchronize_qrcu() in between, one of the counters should be == 0,
another >= 1. == 1 means we have no active readers. So the false positive really
means a concurrent switch. And we can check twice - excellent idea!

> > While doing qrcu, somehow I convinced myself we can't optimize out taking
> > qp->mutex. Now I think I was wrong. Good!
> 
> Me, I didn't want to worry about it unless someone needed it.  Which
> it now appears they do.  ;-)

No. I do remember I tried hard to optimize out taking qp->mutex, but failed.
So I decided it is not possible. And now you show that I just don't have enough
brains! (of course, I hate you :)

> > Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under 
> > ->mutex,
> > was this needed for this optimization to work? I am asking because I can't
> > understand how it can make any difference.
> 
> Before, we held the lock, so we could just check the single current
> element.  Now we don't hold the lock, so we need to check both elements.
> So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the
> nested "if" statements that test both elements.

Ah, my question was different. The current version of qrcu does

mutex_lock(>mutex);

idx = qp->completed & 0x1;
if (atomic_read(qp->ctr + idx) == 1)// fast path
return;

...

and it seems to me that we can retain this fastpath even with your optimization,
no? Surely, it is not so important, but it is nearly free.

Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only have
P-4 ht).

Peter, do you think you can use qrcu?

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-02 Thread Oleg Nesterov

On 02/01, Paul E. McKenney wrote:
   
 void synchronize_qrcu(struct qrcu_struct *qp)
 {
 int idx;
 
 smp_mb();
 
 if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) {
 smp_rmb();
 if (atomic_read(qp-ctr[0]) +
 atomic_read(qp-ctr[1]) = 1)
 goto out;
 }
 
 mutex_lock(qp-mutex);
 idx = qp-completed  0x1;
 atomic_inc(qp-ctr + (idx ^ 0x1));
 /* Reduce the likelihood that qrcu_read_lock() will loop */
 smp_mb__after_atomic_inc();
 qp-completed++;
 
 atomic_dec(qp-ctr + idx);
 __wait_event(qp-wq, !atomic_read(qp-ctr + idx));
 mutex_unlock(qp-mutex);
 out:
 smp_mb();
 }
   
   For the first if to give a false positive, a concurrent switch had
   to have happened.  For example, qp-ctr[0] was zero and qp-ctr[1]
   was two at the time of the first atomic_read(), but then qp-completed
   switched so that both qp-ctr[0] and qp-ctr[1] were one at the time
   of the second atomic_read.  The only way the second if can give us a
   false positive is if there was another change to qp-completed in the
   meantime -- but that means that all of the pre-existing qrcu_read_lock()
   holders must have gotten done, otherwise the second switch could not
   have happened.  Yes, you do incur three memory barriers on the fast
   path, but the best you could hope for with your approach was two of them
   (unless I am confused about how you were using barrier_sync()).

Yes. Without synchronize_qrcu() in between, one of the counters should be == 0,
another = 1. == 1 means we have no active readers. So the false positive really
means a concurrent switch. And we can check twice - excellent idea!

  While doing qrcu, somehow I convinced myself we can't optimize out taking
  qp-mutex. Now I think I was wrong. Good!
 
 Me, I didn't want to worry about it unless someone needed it.  Which
 it now appears they do.  ;-)

No. I do remember I tried hard to optimize out taking qp-mutex, but failed.
So I decided it is not possible. And now you show that I just don't have enough
brains! (of course, I hate you :)

  Q: you deleted if (atomic_read(qp-ctr + idx) == 1) fastpath under 
  -mutex,
  was this needed for this optimization to work? I am asking because I can't
  understand how it can make any difference.
 
 Before, we held the lock, so we could just check the single current
 element.  Now we don't hold the lock, so we need to check both elements.
 So I replaced the if (atomic_read(qp-ctr + idx) == 1) with the
 nested if statements that test both elements.

Ah, my question was different. The current version of qrcu does

mutex_lock(qp-mutex);

idx = qp-completed  0x1;
if (atomic_read(qp-ctr + idx) == 1)// fast path
return;

...

and it seems to me that we can retain this fastpath even with your optimization,
no? Surely, it is not so important, but it is nearly free.

Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only have
P-4 ht).

Peter, do you think you can use qrcu?

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-02 Thread Peter Zijlstra

On Fri, 2007-02-02 at 14:56 +0300, Oleg Nesterov wrote:
 On 02/01, Paul E. McKenney wrote:

void synchronize_qrcu(struct qrcu_struct *qp)
{
int idx;

smp_mb();

if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) 
= 1) {
smp_rmb();
if (atomic_read(qp-ctr[0]) +
atomic_read(qp-ctr[1]) = 1)
goto out;
}

mutex_lock(qp-mutex);
idx = qp-completed  0x1;
atomic_inc(qp-ctr + (idx ^ 0x1));
/* Reduce the likelihood that qrcu_read_lock() will 
loop */
smp_mb__after_atomic_inc();
qp-completed++;

atomic_dec(qp-ctr + idx);
__wait_event(qp-wq, !atomic_read(qp-ctr + idx));
mutex_unlock(qp-mutex);
out:
smp_mb();
}

For the first if to give a false positive, a concurrent switch had
to have happened.  For example, qp-ctr[0] was zero and qp-ctr[1]
was two at the time of the first atomic_read(), but then qp-completed
switched so that both qp-ctr[0] and qp-ctr[1] were one at the time
of the second atomic_read.  The only way the second if can give us a
false positive is if there was another change to qp-completed in the
meantime -- but that means that all of the pre-existing qrcu_read_lock()
holders must have gotten done, otherwise the second switch could not
have happened.  Yes, you do incur three memory barriers on the fast
path, but the best you could hope for with your approach was two of them
(unless I am confused about how you were using barrier_sync()).
 
 Yes. Without synchronize_qrcu() in between, one of the counters should be == 
 0,
 another = 1. == 1 means we have no active readers. So the false positive 
 really
 means a concurrent switch. And we can check twice - excellent idea!
 
   While doing qrcu, somehow I convinced myself we can't optimize out taking
   qp-mutex. Now I think I was wrong. Good!
  
  Me, I didn't want to worry about it unless someone needed it.  Which
  it now appears they do.  ;-)
 
 No. I do remember I tried hard to optimize out taking qp-mutex, but failed.
 So I decided it is not possible. And now you show that I just don't have 
 enough
 brains! (of course, I hate you :)
 
   Q: you deleted if (atomic_read(qp-ctr + idx) == 1) fastpath under 
   -mutex,
   was this needed for this optimization to work? I am asking because I can't
   understand how it can make any difference.
  
  Before, we held the lock, so we could just check the single current
  element.  Now we don't hold the lock, so we need to check both elements.
  So I replaced the if (atomic_read(qp-ctr + idx) == 1) with the
  nested if statements that test both elements.
 
 Ah, my question was different. The current version of qrcu does
 
   mutex_lock(qp-mutex);
 
   idx = qp-completed  0x1;
   if (atomic_read(qp-ctr + idx) == 1)// fast path
   return;
 
   ...
 
 and it seems to me that we can retain this fastpath even with your 
 optimization,
 no? Surely, it is not so important, but it is nearly free.
 
 Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only 
 have
 P-4 ht).
 
 Peter, do you think you can use qrcu?

Yes, this looks very much like what I need. Awesome work! 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-02 Thread Paul E. McKenney

On Fri, Feb 02, 2007 at 02:56:12PM +0300, Oleg Nesterov wrote:
 On 02/01, Paul E. McKenney wrote:

void synchronize_qrcu(struct qrcu_struct *qp)
{
int idx;

smp_mb();

if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) 
= 1) {
smp_rmb();
if (atomic_read(qp-ctr[0]) +
atomic_read(qp-ctr[1]) = 1)
goto out;
}

mutex_lock(qp-mutex);
idx = qp-completed  0x1;
atomic_inc(qp-ctr + (idx ^ 0x1));
/* Reduce the likelihood that qrcu_read_lock() will 
loop */
smp_mb__after_atomic_inc();
qp-completed++;

atomic_dec(qp-ctr + idx);
__wait_event(qp-wq, !atomic_read(qp-ctr + idx));
mutex_unlock(qp-mutex);
out:
smp_mb();
}

For the first if to give a false positive, a concurrent switch had
to have happened.  For example, qp-ctr[0] was zero and qp-ctr[1]
was two at the time of the first atomic_read(), but then qp-completed
switched so that both qp-ctr[0] and qp-ctr[1] were one at the time
of the second atomic_read.  The only way the second if can give us a
false positive is if there was another change to qp-completed in the
meantime -- but that means that all of the pre-existing qrcu_read_lock()
holders must have gotten done, otherwise the second switch could not
have happened.  Yes, you do incur three memory barriers on the fast
path, but the best you could hope for with your approach was two of them
(unless I am confused about how you were using barrier_sync()).
 
 Yes. Without synchronize_qrcu() in between, one of the counters should be == 
 0,
 another = 1. == 1 means we have no active readers. So the false positive 
 really
 means a concurrent switch. And we can check twice - excellent idea!

Well, if it ends up really working.  ;-)

This one needs more than just testing -- I will put together a Promela
model that does a full state-space search for races.

I do have one fall-back, namely putting both counters into a single
aligned long.  But I would like to avoid this, since there might be
architectures out there that cannot cleanly store into half-longs.
Such architectures would have to use atomic ops.  :-/

   While doing qrcu, somehow I convinced myself we can't optimize out taking
   qp-mutex. Now I think I was wrong. Good!
  
  Me, I didn't want to worry about it unless someone needed it.  Which
  it now appears they do.  ;-)
 
 No. I do remember I tried hard to optimize out taking qp-mutex, but failed.
 So I decided it is not possible. And now you show that I just don't have 
 enough
 brains! (of course, I hate you :)

Coming from you, that is high praise indeed!!!  ;-)

Now if it really does work...

   Q: you deleted if (atomic_read(qp-ctr + idx) == 1) fastpath under 
   -mutex,
   was this needed for this optimization to work? I am asking because I can't
   understand how it can make any difference.
  
  Before, we held the lock, so we could just check the single current
  element.  Now we don't hold the lock, so we need to check both elements.
  So I replaced the if (atomic_read(qp-ctr + idx) == 1) with the
  nested if statements that test both elements.
 
 Ah, my question was different. The current version of qrcu does
 
   mutex_lock(qp-mutex);
 
   idx = qp-completed  0x1;
   if (atomic_read(qp-ctr + idx) == 1)// fast path
   return;
 
   ...
 
 and it seems to me that we can retain this fastpath even with your 
 optimization,
 no? Surely, it is not so important, but it is nearly free.

Ah!  This does make sense, excellent point!!!

 Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only 
 have
 P-4 ht).

I will do the Promela model, and if that works, I will submit a patch.

Thanx, Paul

 Peter, do you think you can use qrcu?
 
 Oleg.
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-01 Thread Paul E. McKenney

On Thu, Feb 01, 2007 at 07:00:10PM +0300, Oleg Nesterov wrote:
> On 01/31, Paul E. McKenney wrote:
> > 
> > QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
> > do what you want, as it acquires the lock unconditionally.  I am proposing
> > that synchronize_qrcu() change to something like the following:
> > 
> > void synchronize_qrcu(struct qrcu_struct *qp)
> > {
> > int idx;
> > 
> > smp_mb();
> > 
> > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
> > smp_rmb();
> > if (atomic_read(qp->ctr[0]) +
> > atomic_read(qp->ctr[1]) <= 1)
> > goto out;
> > }
> > 
> > mutex_lock(>mutex);
> > idx = qp->completed & 0x1;
> > atomic_inc(qp->ctr + (idx ^ 0x1));
> > /* Reduce the likelihood that qrcu_read_lock() will loop */
> > smp_mb__after_atomic_inc();
> > qp->completed++;
> > 
> > atomic_dec(qp->ctr + idx);
> > __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
> > mutex_unlock(>mutex);
> > out:
> > smp_mb();
> > }
> > 
> > For the first "if" to give a false positive, a concurrent switch had
> > to have happened.  For example, qp->ctr[0] was zero and qp->ctr[1]
> > was two at the time of the first atomic_read(), but then qp->completed
> > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
> > of the second atomic_read.  The only way the second "if" can give us a
> > false positive is if there was another change to qp->completed in the
> > meantime -- but that means that all of the pre-existing qrcu_read_lock()
> > holders must have gotten done, otherwise the second switch could not
> > have happened.  Yes, you do incur three memory barriers on the fast
> > path, but the best you could hope for with your approach was two of them
> > (unless I am confused about how you were using barrier_sync()).
> 
> While doing qrcu, somehow I convinced myself we can't optimize out taking
> qp->mutex. Now I think I was wrong. Good!

Me, I didn't want to worry about it unless someone needed it.  Which
it now appears they do.  ;-)

> Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under ->mutex,
> was this needed for this optimization to work? I am asking because I can't
> understand how it can make any difference.

Before, we held the lock, so we could just check the single current
element.  Now we don't hold the lock, so we need to check both elements.
So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the
nested "if" statements that test both elements.

> > Oleg, does this look safe?
> 
> Yes. But let me think more about this later, I've got a fever, and can't
> think properly today :)

Get well!!!  ;-)

And yes, the concurrency on the fastpath is nontrivial...

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-01 Thread Oleg Nesterov

On 01/31, Paul E. McKenney wrote:
> 
> QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
> do what you want, as it acquires the lock unconditionally.  I am proposing
> that synchronize_qrcu() change to something like the following:
> 
>   void synchronize_qrcu(struct qrcu_struct *qp)
>   {
>   int idx;
>   
>   smp_mb();
>   
>   if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
>   smp_rmb();
>   if (atomic_read(qp->ctr[0]) +
>   atomic_read(qp->ctr[1]) <= 1)
>   goto out;
>   }
>   
>   mutex_lock(>mutex);
>   idx = qp->completed & 0x1;
>   atomic_inc(qp->ctr + (idx ^ 0x1));
>   /* Reduce the likelihood that qrcu_read_lock() will loop */
>   smp_mb__after_atomic_inc();
>   qp->completed++;
>   
>   atomic_dec(qp->ctr + idx);
>   __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
>   mutex_unlock(>mutex);
>   out:
>   smp_mb();
>   }
> 
> For the first "if" to give a false positive, a concurrent switch had
> to have happened.  For example, qp->ctr[0] was zero and qp->ctr[1]
> was two at the time of the first atomic_read(), but then qp->completed
> switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
> of the second atomic_read.  The only way the second "if" can give us a
> false positive is if there was another change to qp->completed in the
> meantime -- but that means that all of the pre-existing qrcu_read_lock()
> holders must have gotten done, otherwise the second switch could not
> have happened.  Yes, you do incur three memory barriers on the fast
> path, but the best you could hope for with your approach was two of them
> (unless I am confused about how you were using barrier_sync()).

While doing qrcu, somehow I convinced myself we can't optimize out taking
qp->mutex. Now I think I was wrong. Good!

Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under ->mutex,
was this needed for this optimization to work? I am asking because I can't
understand how it can make any difference.

> Oleg, does this look safe?

Yes. But let me think more about this later, I've got a fever, and can't
think properly today :)

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-01 Thread Oleg Nesterov

On 01/31, Paul E. McKenney wrote:
 
 QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
 do what you want, as it acquires the lock unconditionally.  I am proposing
 that synchronize_qrcu() change to something like the following:
 
   void synchronize_qrcu(struct qrcu_struct *qp)
   {
   int idx;
   
   smp_mb();
   
   if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) {
   smp_rmb();
   if (atomic_read(qp-ctr[0]) +
   atomic_read(qp-ctr[1]) = 1)
   goto out;
   }
   
   mutex_lock(qp-mutex);
   idx = qp-completed  0x1;
   atomic_inc(qp-ctr + (idx ^ 0x1));
   /* Reduce the likelihood that qrcu_read_lock() will loop */
   smp_mb__after_atomic_inc();
   qp-completed++;
   
   atomic_dec(qp-ctr + idx);
   __wait_event(qp-wq, !atomic_read(qp-ctr + idx));
   mutex_unlock(qp-mutex);
   out:
   smp_mb();
   }
 
 For the first if to give a false positive, a concurrent switch had
 to have happened.  For example, qp-ctr[0] was zero and qp-ctr[1]
 was two at the time of the first atomic_read(), but then qp-completed
 switched so that both qp-ctr[0] and qp-ctr[1] were one at the time
 of the second atomic_read.  The only way the second if can give us a
 false positive is if there was another change to qp-completed in the
 meantime -- but that means that all of the pre-existing qrcu_read_lock()
 holders must have gotten done, otherwise the second switch could not
 have happened.  Yes, you do incur three memory barriers on the fast
 path, but the best you could hope for with your approach was two of them
 (unless I am confused about how you were using barrier_sync()).

While doing qrcu, somehow I convinced myself we can't optimize out taking
qp-mutex. Now I think I was wrong. Good!

Q: you deleted if (atomic_read(qp-ctr + idx) == 1) fastpath under -mutex,
was this needed for this optimization to work? I am asking because I can't
understand how it can make any difference.

 Oleg, does this look safe?

Yes. But let me think more about this later, I've got a fever, and can't
think properly today :)

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-02-01 Thread Paul E. McKenney

On Thu, Feb 01, 2007 at 07:00:10PM +0300, Oleg Nesterov wrote:
 On 01/31, Paul E. McKenney wrote:
  
  QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
  do what you want, as it acquires the lock unconditionally.  I am proposing
  that synchronize_qrcu() change to something like the following:
  
  void synchronize_qrcu(struct qrcu_struct *qp)
  {
  int idx;
  
  smp_mb();
  
  if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) {
  smp_rmb();
  if (atomic_read(qp-ctr[0]) +
  atomic_read(qp-ctr[1]) = 1)
  goto out;
  }
  
  mutex_lock(qp-mutex);
  idx = qp-completed  0x1;
  atomic_inc(qp-ctr + (idx ^ 0x1));
  /* Reduce the likelihood that qrcu_read_lock() will loop */
  smp_mb__after_atomic_inc();
  qp-completed++;
  
  atomic_dec(qp-ctr + idx);
  __wait_event(qp-wq, !atomic_read(qp-ctr + idx));
  mutex_unlock(qp-mutex);
  out:
  smp_mb();
  }
  
  For the first if to give a false positive, a concurrent switch had
  to have happened.  For example, qp-ctr[0] was zero and qp-ctr[1]
  was two at the time of the first atomic_read(), but then qp-completed
  switched so that both qp-ctr[0] and qp-ctr[1] were one at the time
  of the second atomic_read.  The only way the second if can give us a
  false positive is if there was another change to qp-completed in the
  meantime -- but that means that all of the pre-existing qrcu_read_lock()
  holders must have gotten done, otherwise the second switch could not
  have happened.  Yes, you do incur three memory barriers on the fast
  path, but the best you could hope for with your approach was two of them
  (unless I am confused about how you were using barrier_sync()).
 
 While doing qrcu, somehow I convinced myself we can't optimize out taking
 qp-mutex. Now I think I was wrong. Good!

Me, I didn't want to worry about it unless someone needed it.  Which
it now appears they do.  ;-)

 Q: you deleted if (atomic_read(qp-ctr + idx) == 1) fastpath under -mutex,
 was this needed for this optimization to work? I am asking because I can't
 understand how it can make any difference.

Before, we held the lock, so we could just check the single current
element.  Now we don't hold the lock, so we need to check both elements.
So I replaced the if (atomic_read(qp-ctr + idx) == 1) with the
nested if statements that test both elements.

  Oleg, does this look safe?
 
 Yes. But let me think more about this later, I've got a fever, and can't
 think properly today :)

Get well!!!  ;-)

And yes, the concurrency on the fastpath is nontrivial...

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Paul E. McKenney

On Thu, Feb 01, 2007 at 01:03:09AM +0100, Peter Zijlstra wrote:
> On Wed, 2007-01-31 at 15:32 -0800, Paul E. McKenney wrote:
> 
> > The wakeup in barrier_sync() would mean that the counter was zero
> > at some point in the past.  The counter would then be rechecked, and
> > if it were still zero, barrier_sync() would invoke finish_wait() and
> > then return -- but the counter might well become non-zero in the
> > meantime, right?
> > 
> > So given that barrier_sync() is permitted to return after the counter
> > becomes non-zero, why can't it just rely on the fact that barrier_unlock()
> > saw it as zero not long in the past?
> > 
> > > >  It looks like barrier_sync() is more a
> > > > rw semaphore biased to readers.
> > > 
> > > Indeed, the locked sections are designed to be the rare case.
> > 
> > OK -- but barrier_sync() just waits for readers, it doesn't exclude them.
> > 
> > If all barrier_sync() needs to do is to wait until all pre-existing
> > barrier_lock()/barrier_unlock() pairs to complete, it seems to me to
> > be compatible with qrcu's semantics.
> > 
> > So what am I missing here?
> 
> I might be the one missing stuff, I'll have a hard look at qrcu.
> 
> The intent was that barrier_sync() would not write to memory when there
> are no active locked sections, so that the cacheline can stay shared,
> thus keeping is fast.
> 
> If qrcu does exactly this, then yes we have a match.

QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
do what you want, as it acquires the lock unconditionally.  I am proposing
that synchronize_qrcu() change to something like the following:

void synchronize_qrcu(struct qrcu_struct *qp)
{
int idx;

smp_mb();

if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
smp_rmb();
if (atomic_read(qp->ctr[0]) +
atomic_read(qp->ctr[1]) <= 1)
goto out;
}

mutex_lock(>mutex);
idx = qp->completed & 0x1;
atomic_inc(qp->ctr + (idx ^ 0x1));
/* Reduce the likelihood that qrcu_read_lock() will loop */
smp_mb__after_atomic_inc();
qp->completed++;

atomic_dec(qp->ctr + idx);
__wait_event(qp->wq, !atomic_read(qp->ctr + idx));
mutex_unlock(>mutex);
out:
smp_mb();
}

For the first "if" to give a false positive, a concurrent switch had
to have happened.  For example, qp->ctr[0] was zero and qp->ctr[1]
was two at the time of the first atomic_read(), but then qp->completed
switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
of the second atomic_read.  The only way the second "if" can give us a
false positive is if there was another change to qp->completed in the
meantime -- but that means that all of the pre-existing qrcu_read_lock()
holders must have gotten done, otherwise the second switch could not
have happened.  Yes, you do incur three memory barriers on the fast
path, but the best you could hope for with your approach was two of them
(unless I am confused about how you were using barrier_sync()).

Oleg, does this look safe?

Ugly at best, I know, but I do very much sympathize with Christoph's
desire to keep the number of synchronization primitives down to a
dull roar.  ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Peter Zijlstra

On Wed, 2007-01-31 at 15:32 -0800, Paul E. McKenney wrote:

> The wakeup in barrier_sync() would mean that the counter was zero
> at some point in the past.  The counter would then be rechecked, and
> if it were still zero, barrier_sync() would invoke finish_wait() and
> then return -- but the counter might well become non-zero in the
> meantime, right?
> 
> So given that barrier_sync() is permitted to return after the counter
> becomes non-zero, why can't it just rely on the fact that barrier_unlock()
> saw it as zero not long in the past?
> 
> > >  It looks like barrier_sync() is more a
> > > rw semaphore biased to readers.
> > 
> > Indeed, the locked sections are designed to be the rare case.
> 
> OK -- but barrier_sync() just waits for readers, it doesn't exclude them.
> 
> If all barrier_sync() needs to do is to wait until all pre-existing
> barrier_lock()/barrier_unlock() pairs to complete, it seems to me to
> be compatible with qrcu's semantics.
> 
> So what am I missing here?

I might be the one missing stuff, I'll have a hard look at qrcu.

The intent was that barrier_sync() would not write to memory when there
are no active locked sections, so that the cacheline can stay shared,
thus keeping is fast.

If qrcu does exactly this, then yes we have a match.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Paul E. McKenney

On Wed, Jan 31, 2007 at 10:48:21PM +0100, Peter Zijlstra wrote:
> On Thu, 2007-02-01 at 00:13 +0300, Oleg Nesterov wrote:
> > On 01/31, Paul E. McKenney wrote:
> > > On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
> > > > * Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> > > > > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> > > > > > This barrier thing is constructed so that it will not write in the 
> > > > > > sync() condition (the hot path) when there are no active lock 
> > > > > > sections; thus avoiding cacheline bouncing. -- I'm just not sure 
> > > > > > how 
> > > > > > this will work out in relation to PI. We might track those in the 
> > > > > > barrier scope and boost those by the max prio of the blockers.
> > > > > 
> > > > > Is this really needed?  We seem to grow new funky locking algorithms 
> > > > > exponentially, while people already have a hard time understanding 
> > > > > the 
> > > > > existing ones.
> > > > 
> > > > yes, it's needed.
> > > 
> > > Would it be possible to come up with something common between this 
> > > primitive
> > > and the one that Oleg Nesterov put together for Jens Axboe?
> > > 
> > >   http://lkml.org/lkml/2006/11/29/330
> > > 
> > > Oleg's approach acquires a lock on the update side, which Peter would
> > > not want in the uncontended case -- but perhaps there is some way to
> > > make Oleg's approach be able to safely test both counters so as to
> > > avoid acquiring the lock if there are no readers.
> > > 
> > > Oleg, any chance of this working?  I believe it does, but have not
> > > thought it through fully.
> > 
> > I think no. From the quick reading, barrier_sync() and qrcu/srcu are
> > quite different. Consider:
> > 
> > barrier_lock()
> > 
> > barrier_sync();
> > 
> > barrier_unlock();
> > ... wake up ...
> > barrier_lock();
> > 
> > schedule again
> > 
> > The last "schedule again" would be a BUG for qrcu/srcu, but probably
> > it is ok for barrier_sync().
> 
> Yes, that would be ok.

The wakeup in barrier_sync() would mean that the counter was zero
at some point in the past.  The counter would then be rechecked, and
if it were still zero, barrier_sync() would invoke finish_wait() and
then return -- but the counter might well become non-zero in the
meantime, right?

So given that barrier_sync() is permitted to return after the counter
becomes non-zero, why can't it just rely on the fact that barrier_unlock()
saw it as zero not long in the past?

> >  It looks like barrier_sync() is more a
> > rw semaphore biased to readers.
> 
> Indeed, the locked sections are designed to be the rare case.

OK -- but barrier_sync() just waits for readers, it doesn't exclude them.

If all barrier_sync() needs to do is to wait until all pre-existing
barrier_lock()/barrier_unlock() pairs to complete, it seems to me to
be compatible with qrcu's semantics.

So what am I missing here?

Thanx, Paul

> > A couple of minor off-topic notes,
> > 
> > +static inline void barrier_unlock(struct barrier *b)
> > +{
> > +   smp_wmb();
> > +   if (atomic_dec_and_test(>count))
> > +   __wake_up(>wait, 
> > TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, b);
> > 
> > This is wake_up_all(>wait), yes? I don't undestans why key == b, it 
> > could be NULL.
> > 
> > +static inline void barrier_sync(struct barrier *b)
> > +{
> > +   might_sleep();
> > +
> > +   if (unlikely(atomic_read(>count))) {
> > +   DEFINE_WAIT(wait);
> > +   prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
> > +   while (atomic_read(>count))
> > +   schedule();
> > +   finish_wait(>wait, );
> > +   }
> > +}
> > 
> > This should be open-coded wait_event(), but wrong! With the scenario above 
> > this
> > can hang forever! because the first wake_up removes the task from the 
> > >wait.
> 
> This would be me struggling with the waitqueue API, its all a tad
> confusing at first look.
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Peter Zijlstra

On Thu, 2007-02-01 at 00:13 +0300, Oleg Nesterov wrote:
> On 01/31, Paul E. McKenney wrote:
> >
> > On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
> > > 
> > > * Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> > > 
> > > > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> > > > > This barrier thing is constructed so that it will not write in the 
> > > > > sync() condition (the hot path) when there are no active lock 
> > > > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how 
> > > > > this will work out in relation to PI. We might track those in the 
> > > > > barrier scope and boost those by the max prio of the blockers.
> > > > 
> > > > Is this really needed?  We seem to grow new funky locking algorithms 
> > > > exponentially, while people already have a hard time understanding the 
> > > > existing ones.
> > > 
> > > yes, it's needed.
> > 
> > Would it be possible to come up with something common between this primitive
> > and the one that Oleg Nesterov put together for Jens Axboe?
> > 
> > http://lkml.org/lkml/2006/11/29/330
> > 
> > Oleg's approach acquires a lock on the update side, which Peter would
> > not want in the uncontended case -- but perhaps there is some way to
> > make Oleg's approach be able to safely test both counters so as to
> > avoid acquiring the lock if there are no readers.
> > 
> > Oleg, any chance of this working?  I believe it does, but have not
> > thought it through fully.
> 
> I think no. From the quick reading, barrier_sync() and qrcu/srcu are
> quite different. Consider:
> 
> barrier_lock()
> 
>   barrier_sync();
> 
> barrier_unlock();
>   ... wake up ...
>   barrier_lock();
> 
>   schedule again
> 
> The last "schedule again" would be a BUG for qrcu/srcu, but probably
> it is ok for barrier_sync().

Yes, that would be ok.

>  It looks like barrier_sync() is more a
> rw semaphore biased to readers.

Indeed, the locked sections are designed to be the rare case.

> A couple of minor off-topic notes,
> 
> +static inline void barrier_unlock(struct barrier *b)
> +{
> +   smp_wmb();
> +   if (atomic_dec_and_test(>count))
> +   __wake_up(>wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 
> 0, b);
> 
> This is wake_up_all(>wait), yes? I don't undestans why key == b, it could 
> be NULL.
> 
> +static inline void barrier_sync(struct barrier *b)
> +{
> +   might_sleep();
> +
> +   if (unlikely(atomic_read(>count))) {
> +   DEFINE_WAIT(wait);
> +   prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
> +   while (atomic_read(>count))
> +   schedule();
> +   finish_wait(>wait, );
> +   }
> +}
> 
> This should be open-coded wait_event(), but wrong! With the scenario above 
> this
> can hang forever! because the first wake_up removes the task from the 
> >wait.

This would be me struggling with the waitqueue API, its all a tad
confusing at first look.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Oleg Nesterov

On 02/01, Oleg Nesterov wrote:
> 
> +static inline void barrier_sync(struct barrier *b)
> +{
> +   might_sleep();
> +
> +   if (unlikely(atomic_read(>count))) {
> +   DEFINE_WAIT(wait);
> +   prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
> +   while (atomic_read(>count))
> +   schedule();
> +   finish_wait(>wait, );
> +   }
> +}
> 
> This should be open-coded wait_event(), but wrong! With the scenario above 
> this
> can hang forever! because the first wake_up removes the task from the 
> >wait.

I am afraid I was not clear (as usual :). prepare_to_wait means 
autoremove_wake_function.
So, if barrier_unlock() wakes up the caller of barrier_sync(), it will be 
removed
from b->wait. If it goes to schedule() because another barrier_lock() 
incremented
b->count, we can't wake it via __wake_up(>wait, ...).

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Oleg Nesterov

On 01/31, Paul E. McKenney wrote:
>
> On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
> > 
> > * Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> > 
> > > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> > > > This barrier thing is constructed so that it will not write in the 
> > > > sync() condition (the hot path) when there are no active lock 
> > > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how 
> > > > this will work out in relation to PI. We might track those in the 
> > > > barrier scope and boost those by the max prio of the blockers.
> > > 
> > > Is this really needed?  We seem to grow new funky locking algorithms 
> > > exponentially, while people already have a hard time understanding the 
> > > existing ones.
> > 
> > yes, it's needed.
> 
> Would it be possible to come up with something common between this primitive
> and the one that Oleg Nesterov put together for Jens Axboe?
> 
>   http://lkml.org/lkml/2006/11/29/330
> 
> Oleg's approach acquires a lock on the update side, which Peter would
> not want in the uncontended case -- but perhaps there is some way to
> make Oleg's approach be able to safely test both counters so as to
> avoid acquiring the lock if there are no readers.
> 
> Oleg, any chance of this working?  I believe it does, but have not
> thought it through fully.

I think no. From the quick reading, barrier_sync() and qrcu/srcu are
quite different. Consider:

barrier_lock()

barrier_sync();

barrier_unlock();
... wake up ...
barrier_lock();

schedule again

The last "schedule again" would be a BUG for qrcu/srcu, but probably
it is ok for barrier_sync(). It looks like barrier_sync() is more a
rw semaphore biased to readers.

A couple of minor off-topic notes,

+static inline void barrier_unlock(struct barrier *b)
+{
+   smp_wmb();
+   if (atomic_dec_and_test(>count))
+   __wake_up(>wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, 
b);

This is wake_up_all(>wait), yes? I don't undestans why key == b, it could be 
NULL.

+static inline void barrier_sync(struct barrier *b)
+{
+   might_sleep();
+
+   if (unlikely(atomic_read(>count))) {
+   DEFINE_WAIT(wait);
+   prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
+   while (atomic_read(>count))
+   schedule();
+   finish_wait(>wait, );
+   }
+}

This should be open-coded wait_event(), but wrong! With the scenario above this
can hang forever! because the first wake_up removes the task from the >wait.

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Paul E. McKenney

On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
> 
> * Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> 
> > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> > > This barrier thing is constructed so that it will not write in the 
> > > sync() condition (the hot path) when there are no active lock 
> > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how 
> > > this will work out in relation to PI. We might track those in the 
> > > barrier scope and boost those by the max prio of the blockers.
> > 
> > Is this really needed?  We seem to grow new funky locking algorithms 
> > exponentially, while people already have a hard time understanding the 
> > existing ones.
> 
> yes, it's needed.

Would it be possible to come up with something common between this primitive
and the one that Oleg Nesterov put together for Jens Axboe?

http://lkml.org/lkml/2006/11/29/330

Oleg's approach acquires a lock on the update side, which Peter would
not want in the uncontended case -- but perhaps there is some way to
make Oleg's approach be able to safely test both counters so as to
avoid acquiring the lock if there are no readers.

Oleg, any chance of this working?  I believe it does, but have not
thought it through fully.

If it does turn out that we cannot converge these, I believe that
Peter's implementation needs an smp_mb() at both the beginning
and the end of barrier_sync().  Without the first smp_mb(), the
test in barrier_sync() might precede the prior change, and without
the second smp_mb() the barrier_sync() might slide after the following
cleanup code.  (But I could easily be misunderstanding the code
using barrier_sync().)

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Paul E. McKenney

On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
 
 * Christoph Hellwig [EMAIL PROTECTED] wrote:
 
  On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
   This barrier thing is constructed so that it will not write in the 
   sync() condition (the hot path) when there are no active lock 
   sections; thus avoiding cacheline bouncing. -- I'm just not sure how 
   this will work out in relation to PI. We might track those in the 
   barrier scope and boost those by the max prio of the blockers.
  
  Is this really needed?  We seem to grow new funky locking algorithms 
  exponentially, while people already have a hard time understanding the 
  existing ones.
 
 yes, it's needed.

Would it be possible to come up with something common between this primitive
and the one that Oleg Nesterov put together for Jens Axboe?

http://lkml.org/lkml/2006/11/29/330

Oleg's approach acquires a lock on the update side, which Peter would
not want in the uncontended case -- but perhaps there is some way to
make Oleg's approach be able to safely test both counters so as to
avoid acquiring the lock if there are no readers.

Oleg, any chance of this working?  I believe it does, but have not
thought it through fully.

If it does turn out that we cannot converge these, I believe that
Peter's implementation needs an smp_mb() at both the beginning
and the end of barrier_sync().  Without the first smp_mb(), the
test in barrier_sync() might precede the prior change, and without
the second smp_mb() the barrier_sync() might slide after the following
cleanup code.  (But I could easily be misunderstanding the code
using barrier_sync().)

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Oleg Nesterov

On 01/31, Paul E. McKenney wrote:

 On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
  
  * Christoph Hellwig [EMAIL PROTECTED] wrote:
  
   On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
This barrier thing is constructed so that it will not write in the 
sync() condition (the hot path) when there are no active lock 
sections; thus avoiding cacheline bouncing. -- I'm just not sure how 
this will work out in relation to PI. We might track those in the 
barrier scope and boost those by the max prio of the blockers.
   
   Is this really needed?  We seem to grow new funky locking algorithms 
   exponentially, while people already have a hard time understanding the 
   existing ones.
  
  yes, it's needed.
 
 Would it be possible to come up with something common between this primitive
 and the one that Oleg Nesterov put together for Jens Axboe?
 
   http://lkml.org/lkml/2006/11/29/330
 
 Oleg's approach acquires a lock on the update side, which Peter would
 not want in the uncontended case -- but perhaps there is some way to
 make Oleg's approach be able to safely test both counters so as to
 avoid acquiring the lock if there are no readers.
 
 Oleg, any chance of this working?  I believe it does, but have not
 thought it through fully.

I think no. From the quick reading, barrier_sync() and qrcu/srcu are
quite different. Consider:

barrier_lock()

barrier_sync();

barrier_unlock();
... wake up ...
barrier_lock();

schedule again

The last schedule again would be a BUG for qrcu/srcu, but probably
it is ok for barrier_sync(). It looks like barrier_sync() is more a
rw semaphore biased to readers.

A couple of minor off-topic notes,

+static inline void barrier_unlock(struct barrier *b)
+{
+   smp_wmb();
+   if (atomic_dec_and_test(b-count))
+   __wake_up(b-wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, 
b);

This is wake_up_all(b-wait), yes? I don't undestans why key == b, it could be 
NULL.

+static inline void barrier_sync(struct barrier *b)
+{
+   might_sleep();
+
+   if (unlikely(atomic_read(b-count))) {
+   DEFINE_WAIT(wait);
+   prepare_to_wait(b-wait, wait, TASK_UNINTERRUPTIBLE);
+   while (atomic_read(b-count))
+   schedule();
+   finish_wait(b-wait, wait);
+   }
+}

This should be open-coded wait_event(), but wrong! With the scenario above this
can hang forever! because the first wake_up removes the task from the b-wait.

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Oleg Nesterov

On 02/01, Oleg Nesterov wrote:
 
 +static inline void barrier_sync(struct barrier *b)
 +{
 +   might_sleep();
 +
 +   if (unlikely(atomic_read(b-count))) {
 +   DEFINE_WAIT(wait);
 +   prepare_to_wait(b-wait, wait, TASK_UNINTERRUPTIBLE);
 +   while (atomic_read(b-count))
 +   schedule();
 +   finish_wait(b-wait, wait);
 +   }
 +}
 
 This should be open-coded wait_event(), but wrong! With the scenario above 
 this
 can hang forever! because the first wake_up removes the task from the 
 b-wait.

I am afraid I was not clear (as usual :). prepare_to_wait means 
autoremove_wake_function.
So, if barrier_unlock() wakes up the caller of barrier_sync(), it will be 
removed
from b-wait. If it goes to schedule() because another barrier_lock() 
incremented
b-count, we can't wake it via __wake_up(b-wait, ...).

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Peter Zijlstra

On Thu, 2007-02-01 at 00:13 +0300, Oleg Nesterov wrote:
 On 01/31, Paul E. McKenney wrote:
 
  On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
   
   * Christoph Hellwig [EMAIL PROTECTED] wrote:
   
On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
 This barrier thing is constructed so that it will not write in the 
 sync() condition (the hot path) when there are no active lock 
 sections; thus avoiding cacheline bouncing. -- I'm just not sure how 
 this will work out in relation to PI. We might track those in the 
 barrier scope and boost those by the max prio of the blockers.

Is this really needed?  We seem to grow new funky locking algorithms 
exponentially, while people already have a hard time understanding the 
existing ones.
   
   yes, it's needed.
  
  Would it be possible to come up with something common between this primitive
  and the one that Oleg Nesterov put together for Jens Axboe?
  
  http://lkml.org/lkml/2006/11/29/330
  
  Oleg's approach acquires a lock on the update side, which Peter would
  not want in the uncontended case -- but perhaps there is some way to
  make Oleg's approach be able to safely test both counters so as to
  avoid acquiring the lock if there are no readers.
  
  Oleg, any chance of this working?  I believe it does, but have not
  thought it through fully.
 
 I think no. From the quick reading, barrier_sync() and qrcu/srcu are
 quite different. Consider:
 
 barrier_lock()
 
   barrier_sync();
 
 barrier_unlock();
   ... wake up ...
   barrier_lock();
 
   schedule again
 
 The last schedule again would be a BUG for qrcu/srcu, but probably
 it is ok for barrier_sync().

Yes, that would be ok.

  It looks like barrier_sync() is more a
 rw semaphore biased to readers.

Indeed, the locked sections are designed to be the rare case.

 A couple of minor off-topic notes,
 
 +static inline void barrier_unlock(struct barrier *b)
 +{
 +   smp_wmb();
 +   if (atomic_dec_and_test(b-count))
 +   __wake_up(b-wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 
 0, b);
 
 This is wake_up_all(b-wait), yes? I don't undestans why key == b, it could 
 be NULL.
 
 +static inline void barrier_sync(struct barrier *b)
 +{
 +   might_sleep();
 +
 +   if (unlikely(atomic_read(b-count))) {
 +   DEFINE_WAIT(wait);
 +   prepare_to_wait(b-wait, wait, TASK_UNINTERRUPTIBLE);
 +   while (atomic_read(b-count))
 +   schedule();
 +   finish_wait(b-wait, wait);
 +   }
 +}
 
 This should be open-coded wait_event(), but wrong! With the scenario above 
 this
 can hang forever! because the first wake_up removes the task from the 
 b-wait.

This would be me struggling with the waitqueue API, its all a tad
confusing at first look.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Paul E. McKenney

On Wed, Jan 31, 2007 at 10:48:21PM +0100, Peter Zijlstra wrote:
 On Thu, 2007-02-01 at 00:13 +0300, Oleg Nesterov wrote:
  On 01/31, Paul E. McKenney wrote:
   On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
* Christoph Hellwig [EMAIL PROTECTED] wrote:
 On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
  This barrier thing is constructed so that it will not write in the 
  sync() condition (the hot path) when there are no active lock 
  sections; thus avoiding cacheline bouncing. -- I'm just not sure 
  how 
  this will work out in relation to PI. We might track those in the 
  barrier scope and boost those by the max prio of the blockers.
 
 Is this really needed?  We seem to grow new funky locking algorithms 
 exponentially, while people already have a hard time understanding 
 the 
 existing ones.

yes, it's needed.
   
   Would it be possible to come up with something common between this 
   primitive
   and the one that Oleg Nesterov put together for Jens Axboe?
   
 http://lkml.org/lkml/2006/11/29/330
   
   Oleg's approach acquires a lock on the update side, which Peter would
   not want in the uncontended case -- but perhaps there is some way to
   make Oleg's approach be able to safely test both counters so as to
   avoid acquiring the lock if there are no readers.
   
   Oleg, any chance of this working?  I believe it does, but have not
   thought it through fully.
  
  I think no. From the quick reading, barrier_sync() and qrcu/srcu are
  quite different. Consider:
  
  barrier_lock()
  
  barrier_sync();
  
  barrier_unlock();
  ... wake up ...
  barrier_lock();
  
  schedule again
  
  The last schedule again would be a BUG for qrcu/srcu, but probably
  it is ok for barrier_sync().
 
 Yes, that would be ok.

The wakeup in barrier_sync() would mean that the counter was zero
at some point in the past.  The counter would then be rechecked, and
if it were still zero, barrier_sync() would invoke finish_wait() and
then return -- but the counter might well become non-zero in the
meantime, right?

So given that barrier_sync() is permitted to return after the counter
becomes non-zero, why can't it just rely on the fact that barrier_unlock()
saw it as zero not long in the past?

   It looks like barrier_sync() is more a
  rw semaphore biased to readers.
 
 Indeed, the locked sections are designed to be the rare case.

OK -- but barrier_sync() just waits for readers, it doesn't exclude them.

If all barrier_sync() needs to do is to wait until all pre-existing
barrier_lock()/barrier_unlock() pairs to complete, it seems to me to
be compatible with qrcu's semantics.

So what am I missing here?

Thanx, Paul

  A couple of minor off-topic notes,
  
  +static inline void barrier_unlock(struct barrier *b)
  +{
  +   smp_wmb();
  +   if (atomic_dec_and_test(b-count))
  +   __wake_up(b-wait, 
  TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, b);
  
  This is wake_up_all(b-wait), yes? I don't undestans why key == b, it 
  could be NULL.
  
  +static inline void barrier_sync(struct barrier *b)
  +{
  +   might_sleep();
  +
  +   if (unlikely(atomic_read(b-count))) {
  +   DEFINE_WAIT(wait);
  +   prepare_to_wait(b-wait, wait, TASK_UNINTERRUPTIBLE);
  +   while (atomic_read(b-count))
  +   schedule();
  +   finish_wait(b-wait, wait);
  +   }
  +}
  
  This should be open-coded wait_event(), but wrong! With the scenario above 
  this
  can hang forever! because the first wake_up removes the task from the 
  b-wait.
 
 This would be me struggling with the waitqueue API, its all a tad
 confusing at first look.
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Peter Zijlstra

On Wed, 2007-01-31 at 15:32 -0800, Paul E. McKenney wrote:

 The wakeup in barrier_sync() would mean that the counter was zero
 at some point in the past.  The counter would then be rechecked, and
 if it were still zero, barrier_sync() would invoke finish_wait() and
 then return -- but the counter might well become non-zero in the
 meantime, right?
 
 So given that barrier_sync() is permitted to return after the counter
 becomes non-zero, why can't it just rely on the fact that barrier_unlock()
 saw it as zero not long in the past?
 
It looks like barrier_sync() is more a
   rw semaphore biased to readers.
  
  Indeed, the locked sections are designed to be the rare case.
 
 OK -- but barrier_sync() just waits for readers, it doesn't exclude them.
 
 If all barrier_sync() needs to do is to wait until all pre-existing
 barrier_lock()/barrier_unlock() pairs to complete, it seems to me to
 be compatible with qrcu's semantics.
 
 So what am I missing here?

I might be the one missing stuff, I'll have a hard look at qrcu.

The intent was that barrier_sync() would not write to memory when there
are no active locked sections, so that the cacheline can stay shared,
thus keeping is fast.

If qrcu does exactly this, then yes we have a match.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-31 Thread Paul E. McKenney

On Thu, Feb 01, 2007 at 01:03:09AM +0100, Peter Zijlstra wrote:
 On Wed, 2007-01-31 at 15:32 -0800, Paul E. McKenney wrote:
 
  The wakeup in barrier_sync() would mean that the counter was zero
  at some point in the past.  The counter would then be rechecked, and
  if it were still zero, barrier_sync() would invoke finish_wait() and
  then return -- but the counter might well become non-zero in the
  meantime, right?
  
  So given that barrier_sync() is permitted to return after the counter
  becomes non-zero, why can't it just rely on the fact that barrier_unlock()
  saw it as zero not long in the past?
  
 It looks like barrier_sync() is more a
rw semaphore biased to readers.
   
   Indeed, the locked sections are designed to be the rare case.
  
  OK -- but barrier_sync() just waits for readers, it doesn't exclude them.
  
  If all barrier_sync() needs to do is to wait until all pre-existing
  barrier_lock()/barrier_unlock() pairs to complete, it seems to me to
  be compatible with qrcu's semantics.
  
  So what am I missing here?
 
 I might be the one missing stuff, I'll have a hard look at qrcu.
 
 The intent was that barrier_sync() would not write to memory when there
 are no active locked sections, so that the cacheline can stay shared,
 thus keeping is fast.
 
 If qrcu does exactly this, then yes we have a match.

QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
do what you want, as it acquires the lock unconditionally.  I am proposing
that synchronize_qrcu() change to something like the following:

void synchronize_qrcu(struct qrcu_struct *qp)
{
int idx;

smp_mb();

if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) {
smp_rmb();
if (atomic_read(qp-ctr[0]) +
atomic_read(qp-ctr[1]) = 1)
goto out;
}

mutex_lock(qp-mutex);
idx = qp-completed  0x1;
atomic_inc(qp-ctr + (idx ^ 0x1));
/* Reduce the likelihood that qrcu_read_lock() will loop */
smp_mb__after_atomic_inc();
qp-completed++;

atomic_dec(qp-ctr + idx);
__wait_event(qp-wq, !atomic_read(qp-ctr + idx));
mutex_unlock(qp-mutex);
out:
smp_mb();
}

For the first if to give a false positive, a concurrent switch had
to have happened.  For example, qp-ctr[0] was zero and qp-ctr[1]
was two at the time of the first atomic_read(), but then qp-completed
switched so that both qp-ctr[0] and qp-ctr[1] were one at the time
of the second atomic_read.  The only way the second if can give us a
false positive is if there was another change to qp-completed in the
meantime -- but that means that all of the pre-existing qrcu_read_lock()
holders must have gotten done, otherwise the second switch could not
have happened.  Yes, you do incur three memory barriers on the fast
path, but the best you could hope for with your approach was two of them
(unless I am confused about how you were using barrier_sync()).

Oleg, does this look safe?

Ugly at best, I know, but I do very much sympathize with Christoph's
desire to keep the number of synchronization primitives down to a
dull roar.  ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-28 Thread Christoph Hellwig

On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
> > > This barrier thing is constructed so that it will not write in the 
> > > sync() condition (the hot path) when there are no active lock 
> > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how 
> > > this will work out in relation to PI. We might track those in the 
> > > barrier scope and boost those by the max prio of the blockers.
> > 
> > Is this really needed?  We seem to grow new funky locking algorithms 
> > exponentially, while people already have a hard time understanding the 
> > existing ones.
> 
> yes, it's needed.

Thanks for the wonderful and indepth explanation 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-28 Thread Ingo Molnar


* Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> > This barrier thing is constructed so that it will not write in the 
> > sync() condition (the hot path) when there are no active lock 
> > sections; thus avoiding cacheline bouncing. -- I'm just not sure how 
> > this will work out in relation to PI. We might track those in the 
> > barrier scope and boost those by the max prio of the blockers.
> 
> Is this really needed?  We seem to grow new funky locking algorithms 
> exponentially, while people already have a hard time understanding the 
> existing ones.

yes, it's needed.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-28 Thread Christoph Hellwig

On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> This barrier thing is constructed so that it will not write in the sync()
> condition (the hot path) when there are no active lock sections; thus avoiding
> cacheline bouncing. -- I'm just not sure how this will work out in relation to
> PI. We might track those in the barrier scope and boost those by the max prio
> of the blockers.

Is this really needed?  We seem to grow new funky locking algorithms
exponentially, while people already have a hard time understanding
the existing ones.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-28 Thread Christoph Hellwig

On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
 This barrier thing is constructed so that it will not write in the sync()
 condition (the hot path) when there are no active lock sections; thus avoiding
 cacheline bouncing. -- I'm just not sure how this will work out in relation to
 PI. We might track those in the barrier scope and boost those by the max prio
 of the blockers.

Is this really needed?  We seem to grow new funky locking algorithms
exponentially, while people already have a hard time understanding
the existing ones.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-28 Thread Ingo Molnar


* Christoph Hellwig [EMAIL PROTECTED] wrote:

 On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
  This barrier thing is constructed so that it will not write in the 
  sync() condition (the hot path) when there are no active lock 
  sections; thus avoiding cacheline bouncing. -- I'm just not sure how 
  this will work out in relation to PI. We might track those in the 
  barrier scope and boost those by the max prio of the blockers.
 
 Is this really needed?  We seem to grow new funky locking algorithms 
 exponentially, while people already have a hard time understanding the 
 existing ones.

yes, it's needed.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

2007-01-28 Thread Christoph Hellwig

On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
   This barrier thing is constructed so that it will not write in the 
   sync() condition (the hot path) when there are no active lock 
   sections; thus avoiding cacheline bouncing. -- I'm just not sure how 
   this will work out in relation to PI. We might track those in the 
   barrier scope and boost those by the max prio of the blockers.
  
  Is this really needed?  We seem to grow new funky locking algorithms 
  exponentially, while people already have a hard time understanding the 
  existing ones.
 
 yes, it's needed.

Thanks for the wonderful and indepth explanation /sarcasm
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

38 matches

Mail list logo