Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sat, Feb 03, 2007 at 10:24:04PM -0500, Alan Stern wrote: > On Sat, 3 Feb 2007, Paul E. McKenney wrote: > > > > And another note: this all assumes that STORE-MB-LOAD works "correctly", > > > yes? > > > We have other code which relies on that, should not be a problem. > > > > We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of > > University of Pennsylvania, and Vijay Saraswat of IBM Research towards > > a "universal memory model" that accommodates all machines. Currently, > > it does in fact handle store-mb-load the way we all want, thankfully! > > We should add that many places in the kernel do depend on proper behavior > for this data access pattern. So whatever "universal memory model" we end > up with, it had better handle the pattern correctly if Linux is to support > it. Agreed! > It's interesting to note, however, that this does exclude simple MESI. Yep! And also a number of compiler optimizations, as it turns out. ;-) There is a tension between nice-to-software memory-barrier properties on the one hand and easily understood code on the other. But I guess that this is true of pretty much any software tool. Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sat, 3 Feb 2007, Paul E. McKenney wrote: > > And another note: this all assumes that STORE-MB-LOAD works "correctly", > > yes? > > We have other code which relies on that, should not be a problem. > > We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of > University of Pennsylvania, and Vijay Saraswat of IBM Research towards > a "universal memory model" that accommodates all machines. Currently, > it does in fact handle store-mb-load the way we all want, thankfully! We should add that many places in the kernel do depend on proper behavior for this data access pattern. So whatever "universal memory model" we end up with, it had better handle the pattern correctly if Linux is to support it. It's interesting to note, however, that this does exclude simple MESI. Alan Stern - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sat, Feb 03, 2007 at 07:38:50PM +0300, Oleg Nesterov wrote: > On 01/31, Paul E. McKenney wrote: > > > > QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't > > do what you want, as it acquires the lock unconditionally. I am proposing > > that synchronize_qrcu() change to something like the following: > > > > void synchronize_qrcu(struct qrcu_struct *qp) > > { > > int idx; > > > > smp_mb(); > > > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) { > > smp_rmb(); > > if (atomic_read(qp->ctr[0]) + > > atomic_read(qp->ctr[1]) <= 1) > > goto out; > > } > > > > mutex_lock(>mutex); > > idx = qp->completed & 0x1; > > atomic_inc(qp->ctr + (idx ^ 0x1)); > > /* Reduce the likelihood that qrcu_read_lock() will loop */ > > smp_mb__after_atomic_inc(); > > I almost forgot. Currently this smp_mb__after_atomic_inc() is not strictly > needed, and the comment is correct. However, it becomes mandatory with your > optimization. Without this barrier, it is possible that both checks above > mutex_lock() will see the result of atomic_dec(), but not the atomic_inc(). > > So, may I ask you to also update this comment? > > /* >* Reduce the likelihood that qrcu_read_lock() will loop >* AND >* make sure the second re-check above will see the result >* of atomic_inc() if it sees the result of atomic_dec() >*/ > > Something like this, I hope you will make it better. Good catch!!! I will make sure that this is reflected. Also, validation is proceeding nicely, if extremely hoggishly. The validation program and output thus far attached in case someone is interested. The three-readers/three-updaters case has worked its way up to 16% of the memory on a 48GB machine. ;-) If it overflows, I will see if I can get someone to convert it to VHDL and run hardware validation tools on it. This turned out to be necessary for the -rt implementation of RCU... > And another note: this all assumes that STORE-MB-LOAD works "correctly", yes? > We have other code which relies on that, should not be a problem. We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of University of Pennsylvania, and Vijay Saraswat of IBM Research towards a "universal memory model" that accommodates all machines. Currently, it does in fact handle store-mb-load the way we all want, thankfully! Actually, the other guys are doing most of the formalism, my main role has been programming a very stupid checker based on their formalisms and yelling at them when something bad happens. See the directory in the x10 project on SourceForge if you want more info, but be warned that the checker's UI really sucks. It's input and output formats are abominations that could only have been dreamed up by someone who started out on punchcards and early-70s BASIC. Not pretty, but it does a good job of letting people know what I think the formalism is saying! There are some people working on a Prolog program called jmmsolve that as a much nicer input format, but I need to prototype memory barriers before they will incorporate them (currently, each statement acts as if it had smp_mb() before and after it). Also, their output is as yet incomprehensible to me. Thanx, Paul > (Alan Stern cc'ed). > > Oleg. > qrcuval.2007.02.03a.tgz Description: qrcuval.2007.02.03a.tgz
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On 01/31, Paul E. McKenney wrote: > > QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't > do what you want, as it acquires the lock unconditionally. I am proposing > that synchronize_qrcu() change to something like the following: > > void synchronize_qrcu(struct qrcu_struct *qp) > { > int idx; > > smp_mb(); > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) { > smp_rmb(); > if (atomic_read(qp->ctr[0]) + > atomic_read(qp->ctr[1]) <= 1) > goto out; > } > > mutex_lock(>mutex); > idx = qp->completed & 0x1; > atomic_inc(qp->ctr + (idx ^ 0x1)); > /* Reduce the likelihood that qrcu_read_lock() will loop */ > smp_mb__after_atomic_inc(); I almost forgot. Currently this smp_mb__after_atomic_inc() is not strictly needed, and the comment is correct. However, it becomes mandatory with your optimization. Without this barrier, it is possible that both checks above mutex_lock() will see the result of atomic_dec(), but not the atomic_inc(). So, may I ask you to also update this comment? /* * Reduce the likelihood that qrcu_read_lock() will loop * AND * make sure the second re-check above will see the result * of atomic_inc() if it sees the result of atomic_dec() */ Something like this, I hope you will make it better. And another note: this all assumes that STORE-MB-LOAD works "correctly", yes? We have other code which relies on that, should not be a problem. (Alan Stern cc'ed). Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On 01/31, Paul E. McKenney wrote: QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't do what you want, as it acquires the lock unconditionally. I am proposing that synchronize_qrcu() change to something like the following: void synchronize_qrcu(struct qrcu_struct *qp) { int idx; smp_mb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) { smp_rmb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) goto out; } mutex_lock(qp-mutex); idx = qp-completed 0x1; atomic_inc(qp-ctr + (idx ^ 0x1)); /* Reduce the likelihood that qrcu_read_lock() will loop */ smp_mb__after_atomic_inc(); I almost forgot. Currently this smp_mb__after_atomic_inc() is not strictly needed, and the comment is correct. However, it becomes mandatory with your optimization. Without this barrier, it is possible that both checks above mutex_lock() will see the result of atomic_dec(), but not the atomic_inc(). So, may I ask you to also update this comment? /* * Reduce the likelihood that qrcu_read_lock() will loop * AND * make sure the second re-check above will see the result * of atomic_inc() if it sees the result of atomic_dec() */ Something like this, I hope you will make it better. And another note: this all assumes that STORE-MB-LOAD works correctly, yes? We have other code which relies on that, should not be a problem. (Alan Stern cc'ed). Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sat, Feb 03, 2007 at 07:38:50PM +0300, Oleg Nesterov wrote: On 01/31, Paul E. McKenney wrote: QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't do what you want, as it acquires the lock unconditionally. I am proposing that synchronize_qrcu() change to something like the following: void synchronize_qrcu(struct qrcu_struct *qp) { int idx; smp_mb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) { smp_rmb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) goto out; } mutex_lock(qp-mutex); idx = qp-completed 0x1; atomic_inc(qp-ctr + (idx ^ 0x1)); /* Reduce the likelihood that qrcu_read_lock() will loop */ smp_mb__after_atomic_inc(); I almost forgot. Currently this smp_mb__after_atomic_inc() is not strictly needed, and the comment is correct. However, it becomes mandatory with your optimization. Without this barrier, it is possible that both checks above mutex_lock() will see the result of atomic_dec(), but not the atomic_inc(). So, may I ask you to also update this comment? /* * Reduce the likelihood that qrcu_read_lock() will loop * AND * make sure the second re-check above will see the result * of atomic_inc() if it sees the result of atomic_dec() */ Something like this, I hope you will make it better. Good catch!!! I will make sure that this is reflected. Also, validation is proceeding nicely, if extremely hoggishly. The validation program and output thus far attached in case someone is interested. The three-readers/three-updaters case has worked its way up to 16% of the memory on a 48GB machine. ;-) If it overflows, I will see if I can get someone to convert it to VHDL and run hardware validation tools on it. This turned out to be necessary for the -rt implementation of RCU... And another note: this all assumes that STORE-MB-LOAD works correctly, yes? We have other code which relies on that, should not be a problem. We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of University of Pennsylvania, and Vijay Saraswat of IBM Research towards a universal memory model that accommodates all machines. Currently, it does in fact handle store-mb-load the way we all want, thankfully! Actually, the other guys are doing most of the formalism, my main role has been programming a very stupid checker based on their formalisms and yelling at them when something bad happens. See the directory in the x10 project on SourceForge if you want more info, but be warned that the checker's UI really sucks. It's input and output formats are abominations that could only have been dreamed up by someone who started out on punchcards and early-70s BASIC. Not pretty, but it does a good job of letting people know what I think the formalism is saying! There are some people working on a Prolog program called jmmsolve that as a much nicer input format, but I need to prototype memory barriers before they will incorporate them (currently, each statement acts as if it had smp_mb() before and after it). Also, their output is as yet incomprehensible to me. Thanx, Paul (Alan Stern cc'ed). Oleg. qrcuval.2007.02.03a.tgz Description: qrcuval.2007.02.03a.tgz
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sat, 3 Feb 2007, Paul E. McKenney wrote: And another note: this all assumes that STORE-MB-LOAD works correctly, yes? We have other code which relies on that, should not be a problem. We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of University of Pennsylvania, and Vijay Saraswat of IBM Research towards a universal memory model that accommodates all machines. Currently, it does in fact handle store-mb-load the way we all want, thankfully! We should add that many places in the kernel do depend on proper behavior for this data access pattern. So whatever universal memory model we end up with, it had better handle the pattern correctly if Linux is to support it. It's interesting to note, however, that this does exclude simple MESI. Alan Stern - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sat, Feb 03, 2007 at 10:24:04PM -0500, Alan Stern wrote: On Sat, 3 Feb 2007, Paul E. McKenney wrote: And another note: this all assumes that STORE-MB-LOAD works correctly, yes? We have other code which relies on that, should not be a problem. We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of University of Pennsylvania, and Vijay Saraswat of IBM Research towards a universal memory model that accommodates all machines. Currently, it does in fact handle store-mb-load the way we all want, thankfully! We should add that many places in the kernel do depend on proper behavior for this data access pattern. So whatever universal memory model we end up with, it had better handle the pattern correctly if Linux is to support it. Agreed! It's interesting to note, however, that this does exclude simple MESI. Yep! And also a number of compiler optimizations, as it turns out. ;-) There is a tension between nice-to-software memory-barrier properties on the one hand and easily understood code on the other. But I guess that this is true of pretty much any software tool. Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Fri, Feb 02, 2007 at 02:56:12PM +0300, Oleg Nesterov wrote: > On 02/01, Paul E. McKenney wrote: > > > > > > > > void synchronize_qrcu(struct qrcu_struct *qp) > > > > { > > > > int idx; > > > > > > > > smp_mb(); > > > > > > > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) > > > > <= 1) { > > > > smp_rmb(); > > > > if (atomic_read(qp->ctr[0]) + > > > > atomic_read(qp->ctr[1]) <= 1) > > > > goto out; > > > > } > > > > > > > > mutex_lock(>mutex); > > > > idx = qp->completed & 0x1; > > > > atomic_inc(qp->ctr + (idx ^ 0x1)); > > > > /* Reduce the likelihood that qrcu_read_lock() will > > > > loop */ > > > > smp_mb__after_atomic_inc(); > > > > qp->completed++; > > > > > > > > atomic_dec(qp->ctr + idx); > > > > __wait_event(qp->wq, !atomic_read(qp->ctr + idx)); > > > > mutex_unlock(>mutex); > > > > out: > > > > smp_mb(); > > > > } > > > > > > > > For the first "if" to give a false positive, a concurrent switch had > > > > to have happened. For example, qp->ctr[0] was zero and qp->ctr[1] > > > > was two at the time of the first atomic_read(), but then qp->completed > > > > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time > > > > of the second atomic_read. The only way the second "if" can give us a > > > > false positive is if there was another change to qp->completed in the > > > > meantime -- but that means that all of the pre-existing qrcu_read_lock() > > > > holders must have gotten done, otherwise the second switch could not > > > > have happened. Yes, you do incur three memory barriers on the fast > > > > path, but the best you could hope for with your approach was two of them > > > > (unless I am confused about how you were using barrier_sync()). > > Yes. Without synchronize_qrcu() in between, one of the counters should be == > 0, > another >= 1. == 1 means we have no active readers. So the false positive > really > means a concurrent switch. And we can check twice - excellent idea! Well, if it ends up really working. ;-) This one needs more than just testing -- I will put together a Promela model that does a full state-space search for races. I do have one fall-back, namely putting both counters into a single aligned long. But I would like to avoid this, since there might be architectures out there that cannot cleanly store into half-longs. Such architectures would have to use atomic ops. :-/ > > > While doing qrcu, somehow I convinced myself we can't optimize out taking > > > qp->mutex. Now I think I was wrong. Good! > > > > Me, I didn't want to worry about it unless someone needed it. Which > > it now appears they do. ;-) > > No. I do remember I tried hard to optimize out taking qp->mutex, but failed. > So I decided it is not possible. And now you show that I just don't have > enough > brains! (of course, I hate you :) Coming from you, that is high praise indeed!!! ;-) Now if it really does work... > > > Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under > > > ->mutex, > > > was this needed for this optimization to work? I am asking because I can't > > > understand how it can make any difference. > > > > Before, we held the lock, so we could just check the single current > > element. Now we don't hold the lock, so we need to check both elements. > > So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the > > nested "if" statements that test both elements. > > Ah, my question was different. The current version of qrcu does > > mutex_lock(>mutex); > > idx = qp->completed & 0x1; > if (atomic_read(qp->ctr + idx) == 1)// fast path > return; > > ... > > and it seems to me that we can retain this fastpath even with your > optimization, > no? Surely, it is not so important, but it is nearly free. Ah! This does make sense, excellent point!!! > Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only > have > P-4 ht). I will do the Promela model, and if that works, I will submit a patch. Thanx, Paul > Peter, do you think you can use qrcu? > > Oleg. > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Fri, 2007-02-02 at 14:56 +0300, Oleg Nesterov wrote: > On 02/01, Paul E. McKenney wrote: > > > > > > > > void synchronize_qrcu(struct qrcu_struct *qp) > > > > { > > > > int idx; > > > > > > > > smp_mb(); > > > > > > > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) > > > > <= 1) { > > > > smp_rmb(); > > > > if (atomic_read(qp->ctr[0]) + > > > > atomic_read(qp->ctr[1]) <= 1) > > > > goto out; > > > > } > > > > > > > > mutex_lock(>mutex); > > > > idx = qp->completed & 0x1; > > > > atomic_inc(qp->ctr + (idx ^ 0x1)); > > > > /* Reduce the likelihood that qrcu_read_lock() will > > > > loop */ > > > > smp_mb__after_atomic_inc(); > > > > qp->completed++; > > > > > > > > atomic_dec(qp->ctr + idx); > > > > __wait_event(qp->wq, !atomic_read(qp->ctr + idx)); > > > > mutex_unlock(>mutex); > > > > out: > > > > smp_mb(); > > > > } > > > > > > > > For the first "if" to give a false positive, a concurrent switch had > > > > to have happened. For example, qp->ctr[0] was zero and qp->ctr[1] > > > > was two at the time of the first atomic_read(), but then qp->completed > > > > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time > > > > of the second atomic_read. The only way the second "if" can give us a > > > > false positive is if there was another change to qp->completed in the > > > > meantime -- but that means that all of the pre-existing qrcu_read_lock() > > > > holders must have gotten done, otherwise the second switch could not > > > > have happened. Yes, you do incur three memory barriers on the fast > > > > path, but the best you could hope for with your approach was two of them > > > > (unless I am confused about how you were using barrier_sync()). > > Yes. Without synchronize_qrcu() in between, one of the counters should be == > 0, > another >= 1. == 1 means we have no active readers. So the false positive > really > means a concurrent switch. And we can check twice - excellent idea! > > > > While doing qrcu, somehow I convinced myself we can't optimize out taking > > > qp->mutex. Now I think I was wrong. Good! > > > > Me, I didn't want to worry about it unless someone needed it. Which > > it now appears they do. ;-) > > No. I do remember I tried hard to optimize out taking qp->mutex, but failed. > So I decided it is not possible. And now you show that I just don't have > enough > brains! (of course, I hate you :) > > > > Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under > > > ->mutex, > > > was this needed for this optimization to work? I am asking because I can't > > > understand how it can make any difference. > > > > Before, we held the lock, so we could just check the single current > > element. Now we don't hold the lock, so we need to check both elements. > > So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the > > nested "if" statements that test both elements. > > Ah, my question was different. The current version of qrcu does > > mutex_lock(>mutex); > > idx = qp->completed & 0x1; > if (atomic_read(qp->ctr + idx) == 1)// fast path > return; > > ... > > and it seems to me that we can retain this fastpath even with your > optimization, > no? Surely, it is not so important, but it is nearly free. > > Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only > have > P-4 ht). > > Peter, do you think you can use qrcu? Yes, this looks very much like what I need. Awesome work! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On 02/01, Paul E. McKenney wrote: > > > > > > void synchronize_qrcu(struct qrcu_struct *qp) > > > { > > > int idx; > > > > > > smp_mb(); > > > > > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) { > > > smp_rmb(); > > > if (atomic_read(qp->ctr[0]) + > > > atomic_read(qp->ctr[1]) <= 1) > > > goto out; > > > } > > > > > > mutex_lock(>mutex); > > > idx = qp->completed & 0x1; > > > atomic_inc(qp->ctr + (idx ^ 0x1)); > > > /* Reduce the likelihood that qrcu_read_lock() will loop */ > > > smp_mb__after_atomic_inc(); > > > qp->completed++; > > > > > > atomic_dec(qp->ctr + idx); > > > __wait_event(qp->wq, !atomic_read(qp->ctr + idx)); > > > mutex_unlock(>mutex); > > > out: > > > smp_mb(); > > > } > > > > > > For the first "if" to give a false positive, a concurrent switch had > > > to have happened. For example, qp->ctr[0] was zero and qp->ctr[1] > > > was two at the time of the first atomic_read(), but then qp->completed > > > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time > > > of the second atomic_read. The only way the second "if" can give us a > > > false positive is if there was another change to qp->completed in the > > > meantime -- but that means that all of the pre-existing qrcu_read_lock() > > > holders must have gotten done, otherwise the second switch could not > > > have happened. Yes, you do incur three memory barriers on the fast > > > path, but the best you could hope for with your approach was two of them > > > (unless I am confused about how you were using barrier_sync()). Yes. Without synchronize_qrcu() in between, one of the counters should be == 0, another >= 1. == 1 means we have no active readers. So the false positive really means a concurrent switch. And we can check twice - excellent idea! > > While doing qrcu, somehow I convinced myself we can't optimize out taking > > qp->mutex. Now I think I was wrong. Good! > > Me, I didn't want to worry about it unless someone needed it. Which > it now appears they do. ;-) No. I do remember I tried hard to optimize out taking qp->mutex, but failed. So I decided it is not possible. And now you show that I just don't have enough brains! (of course, I hate you :) > > Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under > > ->mutex, > > was this needed for this optimization to work? I am asking because I can't > > understand how it can make any difference. > > Before, we held the lock, so we could just check the single current > element. Now we don't hold the lock, so we need to check both elements. > So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the > nested "if" statements that test both elements. Ah, my question was different. The current version of qrcu does mutex_lock(>mutex); idx = qp->completed & 0x1; if (atomic_read(qp->ctr + idx) == 1)// fast path return; ... and it seems to me that we can retain this fastpath even with your optimization, no? Surely, it is not so important, but it is nearly free. Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only have P-4 ht). Peter, do you think you can use qrcu? Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On 02/01, Paul E. McKenney wrote: void synchronize_qrcu(struct qrcu_struct *qp) { int idx; smp_mb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) { smp_rmb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) goto out; } mutex_lock(qp-mutex); idx = qp-completed 0x1; atomic_inc(qp-ctr + (idx ^ 0x1)); /* Reduce the likelihood that qrcu_read_lock() will loop */ smp_mb__after_atomic_inc(); qp-completed++; atomic_dec(qp-ctr + idx); __wait_event(qp-wq, !atomic_read(qp-ctr + idx)); mutex_unlock(qp-mutex); out: smp_mb(); } For the first if to give a false positive, a concurrent switch had to have happened. For example, qp-ctr[0] was zero and qp-ctr[1] was two at the time of the first atomic_read(), but then qp-completed switched so that both qp-ctr[0] and qp-ctr[1] were one at the time of the second atomic_read. The only way the second if can give us a false positive is if there was another change to qp-completed in the meantime -- but that means that all of the pre-existing qrcu_read_lock() holders must have gotten done, otherwise the second switch could not have happened. Yes, you do incur three memory barriers on the fast path, but the best you could hope for with your approach was two of them (unless I am confused about how you were using barrier_sync()). Yes. Without synchronize_qrcu() in between, one of the counters should be == 0, another = 1. == 1 means we have no active readers. So the false positive really means a concurrent switch. And we can check twice - excellent idea! While doing qrcu, somehow I convinced myself we can't optimize out taking qp-mutex. Now I think I was wrong. Good! Me, I didn't want to worry about it unless someone needed it. Which it now appears they do. ;-) No. I do remember I tried hard to optimize out taking qp-mutex, but failed. So I decided it is not possible. And now you show that I just don't have enough brains! (of course, I hate you :) Q: you deleted if (atomic_read(qp-ctr + idx) == 1) fastpath under -mutex, was this needed for this optimization to work? I am asking because I can't understand how it can make any difference. Before, we held the lock, so we could just check the single current element. Now we don't hold the lock, so we need to check both elements. So I replaced the if (atomic_read(qp-ctr + idx) == 1) with the nested if statements that test both elements. Ah, my question was different. The current version of qrcu does mutex_lock(qp-mutex); idx = qp-completed 0x1; if (atomic_read(qp-ctr + idx) == 1)// fast path return; ... and it seems to me that we can retain this fastpath even with your optimization, no? Surely, it is not so important, but it is nearly free. Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only have P-4 ht). Peter, do you think you can use qrcu? Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Fri, 2007-02-02 at 14:56 +0300, Oleg Nesterov wrote: On 02/01, Paul E. McKenney wrote: void synchronize_qrcu(struct qrcu_struct *qp) { int idx; smp_mb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) { smp_rmb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) goto out; } mutex_lock(qp-mutex); idx = qp-completed 0x1; atomic_inc(qp-ctr + (idx ^ 0x1)); /* Reduce the likelihood that qrcu_read_lock() will loop */ smp_mb__after_atomic_inc(); qp-completed++; atomic_dec(qp-ctr + idx); __wait_event(qp-wq, !atomic_read(qp-ctr + idx)); mutex_unlock(qp-mutex); out: smp_mb(); } For the first if to give a false positive, a concurrent switch had to have happened. For example, qp-ctr[0] was zero and qp-ctr[1] was two at the time of the first atomic_read(), but then qp-completed switched so that both qp-ctr[0] and qp-ctr[1] were one at the time of the second atomic_read. The only way the second if can give us a false positive is if there was another change to qp-completed in the meantime -- but that means that all of the pre-existing qrcu_read_lock() holders must have gotten done, otherwise the second switch could not have happened. Yes, you do incur three memory barriers on the fast path, but the best you could hope for with your approach was two of them (unless I am confused about how you were using barrier_sync()). Yes. Without synchronize_qrcu() in between, one of the counters should be == 0, another = 1. == 1 means we have no active readers. So the false positive really means a concurrent switch. And we can check twice - excellent idea! While doing qrcu, somehow I convinced myself we can't optimize out taking qp-mutex. Now I think I was wrong. Good! Me, I didn't want to worry about it unless someone needed it. Which it now appears they do. ;-) No. I do remember I tried hard to optimize out taking qp-mutex, but failed. So I decided it is not possible. And now you show that I just don't have enough brains! (of course, I hate you :) Q: you deleted if (atomic_read(qp-ctr + idx) == 1) fastpath under -mutex, was this needed for this optimization to work? I am asking because I can't understand how it can make any difference. Before, we held the lock, so we could just check the single current element. Now we don't hold the lock, so we need to check both elements. So I replaced the if (atomic_read(qp-ctr + idx) == 1) with the nested if statements that test both elements. Ah, my question was different. The current version of qrcu does mutex_lock(qp-mutex); idx = qp-completed 0x1; if (atomic_read(qp-ctr + idx) == 1)// fast path return; ... and it seems to me that we can retain this fastpath even with your optimization, no? Surely, it is not so important, but it is nearly free. Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only have P-4 ht). Peter, do you think you can use qrcu? Yes, this looks very much like what I need. Awesome work! - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Fri, Feb 02, 2007 at 02:56:12PM +0300, Oleg Nesterov wrote: On 02/01, Paul E. McKenney wrote: void synchronize_qrcu(struct qrcu_struct *qp) { int idx; smp_mb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) { smp_rmb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) goto out; } mutex_lock(qp-mutex); idx = qp-completed 0x1; atomic_inc(qp-ctr + (idx ^ 0x1)); /* Reduce the likelihood that qrcu_read_lock() will loop */ smp_mb__after_atomic_inc(); qp-completed++; atomic_dec(qp-ctr + idx); __wait_event(qp-wq, !atomic_read(qp-ctr + idx)); mutex_unlock(qp-mutex); out: smp_mb(); } For the first if to give a false positive, a concurrent switch had to have happened. For example, qp-ctr[0] was zero and qp-ctr[1] was two at the time of the first atomic_read(), but then qp-completed switched so that both qp-ctr[0] and qp-ctr[1] were one at the time of the second atomic_read. The only way the second if can give us a false positive is if there was another change to qp-completed in the meantime -- but that means that all of the pre-existing qrcu_read_lock() holders must have gotten done, otherwise the second switch could not have happened. Yes, you do incur three memory barriers on the fast path, but the best you could hope for with your approach was two of them (unless I am confused about how you were using barrier_sync()). Yes. Without synchronize_qrcu() in between, one of the counters should be == 0, another = 1. == 1 means we have no active readers. So the false positive really means a concurrent switch. And we can check twice - excellent idea! Well, if it ends up really working. ;-) This one needs more than just testing -- I will put together a Promela model that does a full state-space search for races. I do have one fall-back, namely putting both counters into a single aligned long. But I would like to avoid this, since there might be architectures out there that cannot cleanly store into half-longs. Such architectures would have to use atomic ops. :-/ While doing qrcu, somehow I convinced myself we can't optimize out taking qp-mutex. Now I think I was wrong. Good! Me, I didn't want to worry about it unless someone needed it. Which it now appears they do. ;-) No. I do remember I tried hard to optimize out taking qp-mutex, but failed. So I decided it is not possible. And now you show that I just don't have enough brains! (of course, I hate you :) Coming from you, that is high praise indeed!!! ;-) Now if it really does work... Q: you deleted if (atomic_read(qp-ctr + idx) == 1) fastpath under -mutex, was this needed for this optimization to work? I am asking because I can't understand how it can make any difference. Before, we held the lock, so we could just check the single current element. Now we don't hold the lock, so we need to check both elements. So I replaced the if (atomic_read(qp-ctr + idx) == 1) with the nested if statements that test both elements. Ah, my question was different. The current version of qrcu does mutex_lock(qp-mutex); idx = qp-completed 0x1; if (atomic_read(qp-ctr + idx) == 1)// fast path return; ... and it seems to me that we can retain this fastpath even with your optimization, no? Surely, it is not so important, but it is nearly free. Ah! This does make sense, excellent point!!! Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only have P-4 ht). I will do the Promela model, and if that works, I will submit a patch. Thanx, Paul Peter, do you think you can use qrcu? Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Thu, Feb 01, 2007 at 07:00:10PM +0300, Oleg Nesterov wrote: > On 01/31, Paul E. McKenney wrote: > > > > QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't > > do what you want, as it acquires the lock unconditionally. I am proposing > > that synchronize_qrcu() change to something like the following: > > > > void synchronize_qrcu(struct qrcu_struct *qp) > > { > > int idx; > > > > smp_mb(); > > > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) { > > smp_rmb(); > > if (atomic_read(qp->ctr[0]) + > > atomic_read(qp->ctr[1]) <= 1) > > goto out; > > } > > > > mutex_lock(>mutex); > > idx = qp->completed & 0x1; > > atomic_inc(qp->ctr + (idx ^ 0x1)); > > /* Reduce the likelihood that qrcu_read_lock() will loop */ > > smp_mb__after_atomic_inc(); > > qp->completed++; > > > > atomic_dec(qp->ctr + idx); > > __wait_event(qp->wq, !atomic_read(qp->ctr + idx)); > > mutex_unlock(>mutex); > > out: > > smp_mb(); > > } > > > > For the first "if" to give a false positive, a concurrent switch had > > to have happened. For example, qp->ctr[0] was zero and qp->ctr[1] > > was two at the time of the first atomic_read(), but then qp->completed > > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time > > of the second atomic_read. The only way the second "if" can give us a > > false positive is if there was another change to qp->completed in the > > meantime -- but that means that all of the pre-existing qrcu_read_lock() > > holders must have gotten done, otherwise the second switch could not > > have happened. Yes, you do incur three memory barriers on the fast > > path, but the best you could hope for with your approach was two of them > > (unless I am confused about how you were using barrier_sync()). > > While doing qrcu, somehow I convinced myself we can't optimize out taking > qp->mutex. Now I think I was wrong. Good! Me, I didn't want to worry about it unless someone needed it. Which it now appears they do. ;-) > Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under ->mutex, > was this needed for this optimization to work? I am asking because I can't > understand how it can make any difference. Before, we held the lock, so we could just check the single current element. Now we don't hold the lock, so we need to check both elements. So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the nested "if" statements that test both elements. > > Oleg, does this look safe? > > Yes. But let me think more about this later, I've got a fever, and can't > think properly today :) Get well!!! ;-) And yes, the concurrency on the fastpath is nontrivial... Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On 01/31, Paul E. McKenney wrote: > > QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't > do what you want, as it acquires the lock unconditionally. I am proposing > that synchronize_qrcu() change to something like the following: > > void synchronize_qrcu(struct qrcu_struct *qp) > { > int idx; > > smp_mb(); > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) { > smp_rmb(); > if (atomic_read(qp->ctr[0]) + > atomic_read(qp->ctr[1]) <= 1) > goto out; > } > > mutex_lock(>mutex); > idx = qp->completed & 0x1; > atomic_inc(qp->ctr + (idx ^ 0x1)); > /* Reduce the likelihood that qrcu_read_lock() will loop */ > smp_mb__after_atomic_inc(); > qp->completed++; > > atomic_dec(qp->ctr + idx); > __wait_event(qp->wq, !atomic_read(qp->ctr + idx)); > mutex_unlock(>mutex); > out: > smp_mb(); > } > > For the first "if" to give a false positive, a concurrent switch had > to have happened. For example, qp->ctr[0] was zero and qp->ctr[1] > was two at the time of the first atomic_read(), but then qp->completed > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time > of the second atomic_read. The only way the second "if" can give us a > false positive is if there was another change to qp->completed in the > meantime -- but that means that all of the pre-existing qrcu_read_lock() > holders must have gotten done, otherwise the second switch could not > have happened. Yes, you do incur three memory barriers on the fast > path, but the best you could hope for with your approach was two of them > (unless I am confused about how you were using barrier_sync()). While doing qrcu, somehow I convinced myself we can't optimize out taking qp->mutex. Now I think I was wrong. Good! Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under ->mutex, was this needed for this optimization to work? I am asking because I can't understand how it can make any difference. > Oleg, does this look safe? Yes. But let me think more about this later, I've got a fever, and can't think properly today :) Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On 01/31, Paul E. McKenney wrote: QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't do what you want, as it acquires the lock unconditionally. I am proposing that synchronize_qrcu() change to something like the following: void synchronize_qrcu(struct qrcu_struct *qp) { int idx; smp_mb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) { smp_rmb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) goto out; } mutex_lock(qp-mutex); idx = qp-completed 0x1; atomic_inc(qp-ctr + (idx ^ 0x1)); /* Reduce the likelihood that qrcu_read_lock() will loop */ smp_mb__after_atomic_inc(); qp-completed++; atomic_dec(qp-ctr + idx); __wait_event(qp-wq, !atomic_read(qp-ctr + idx)); mutex_unlock(qp-mutex); out: smp_mb(); } For the first if to give a false positive, a concurrent switch had to have happened. For example, qp-ctr[0] was zero and qp-ctr[1] was two at the time of the first atomic_read(), but then qp-completed switched so that both qp-ctr[0] and qp-ctr[1] were one at the time of the second atomic_read. The only way the second if can give us a false positive is if there was another change to qp-completed in the meantime -- but that means that all of the pre-existing qrcu_read_lock() holders must have gotten done, otherwise the second switch could not have happened. Yes, you do incur three memory barriers on the fast path, but the best you could hope for with your approach was two of them (unless I am confused about how you were using barrier_sync()). While doing qrcu, somehow I convinced myself we can't optimize out taking qp-mutex. Now I think I was wrong. Good! Q: you deleted if (atomic_read(qp-ctr + idx) == 1) fastpath under -mutex, was this needed for this optimization to work? I am asking because I can't understand how it can make any difference. Oleg, does this look safe? Yes. But let me think more about this later, I've got a fever, and can't think properly today :) Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Thu, Feb 01, 2007 at 07:00:10PM +0300, Oleg Nesterov wrote: On 01/31, Paul E. McKenney wrote: QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't do what you want, as it acquires the lock unconditionally. I am proposing that synchronize_qrcu() change to something like the following: void synchronize_qrcu(struct qrcu_struct *qp) { int idx; smp_mb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) { smp_rmb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) goto out; } mutex_lock(qp-mutex); idx = qp-completed 0x1; atomic_inc(qp-ctr + (idx ^ 0x1)); /* Reduce the likelihood that qrcu_read_lock() will loop */ smp_mb__after_atomic_inc(); qp-completed++; atomic_dec(qp-ctr + idx); __wait_event(qp-wq, !atomic_read(qp-ctr + idx)); mutex_unlock(qp-mutex); out: smp_mb(); } For the first if to give a false positive, a concurrent switch had to have happened. For example, qp-ctr[0] was zero and qp-ctr[1] was two at the time of the first atomic_read(), but then qp-completed switched so that both qp-ctr[0] and qp-ctr[1] were one at the time of the second atomic_read. The only way the second if can give us a false positive is if there was another change to qp-completed in the meantime -- but that means that all of the pre-existing qrcu_read_lock() holders must have gotten done, otherwise the second switch could not have happened. Yes, you do incur three memory barriers on the fast path, but the best you could hope for with your approach was two of them (unless I am confused about how you were using barrier_sync()). While doing qrcu, somehow I convinced myself we can't optimize out taking qp-mutex. Now I think I was wrong. Good! Me, I didn't want to worry about it unless someone needed it. Which it now appears they do. ;-) Q: you deleted if (atomic_read(qp-ctr + idx) == 1) fastpath under -mutex, was this needed for this optimization to work? I am asking because I can't understand how it can make any difference. Before, we held the lock, so we could just check the single current element. Now we don't hold the lock, so we need to check both elements. So I replaced the if (atomic_read(qp-ctr + idx) == 1) with the nested if statements that test both elements. Oleg, does this look safe? Yes. But let me think more about this later, I've got a fever, and can't think properly today :) Get well!!! ;-) And yes, the concurrency on the fastpath is nontrivial... Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Thu, Feb 01, 2007 at 01:03:09AM +0100, Peter Zijlstra wrote: > On Wed, 2007-01-31 at 15:32 -0800, Paul E. McKenney wrote: > > > The wakeup in barrier_sync() would mean that the counter was zero > > at some point in the past. The counter would then be rechecked, and > > if it were still zero, barrier_sync() would invoke finish_wait() and > > then return -- but the counter might well become non-zero in the > > meantime, right? > > > > So given that barrier_sync() is permitted to return after the counter > > becomes non-zero, why can't it just rely on the fact that barrier_unlock() > > saw it as zero not long in the past? > > > > > > It looks like barrier_sync() is more a > > > > rw semaphore biased to readers. > > > > > > Indeed, the locked sections are designed to be the rare case. > > > > OK -- but barrier_sync() just waits for readers, it doesn't exclude them. > > > > If all barrier_sync() needs to do is to wait until all pre-existing > > barrier_lock()/barrier_unlock() pairs to complete, it seems to me to > > be compatible with qrcu's semantics. > > > > So what am I missing here? > > I might be the one missing stuff, I'll have a hard look at qrcu. > > The intent was that barrier_sync() would not write to memory when there > are no active locked sections, so that the cacheline can stay shared, > thus keeping is fast. > > If qrcu does exactly this, then yes we have a match. QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't do what you want, as it acquires the lock unconditionally. I am proposing that synchronize_qrcu() change to something like the following: void synchronize_qrcu(struct qrcu_struct *qp) { int idx; smp_mb(); if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) { smp_rmb(); if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) goto out; } mutex_lock(>mutex); idx = qp->completed & 0x1; atomic_inc(qp->ctr + (idx ^ 0x1)); /* Reduce the likelihood that qrcu_read_lock() will loop */ smp_mb__after_atomic_inc(); qp->completed++; atomic_dec(qp->ctr + idx); __wait_event(qp->wq, !atomic_read(qp->ctr + idx)); mutex_unlock(>mutex); out: smp_mb(); } For the first "if" to give a false positive, a concurrent switch had to have happened. For example, qp->ctr[0] was zero and qp->ctr[1] was two at the time of the first atomic_read(), but then qp->completed switched so that both qp->ctr[0] and qp->ctr[1] were one at the time of the second atomic_read. The only way the second "if" can give us a false positive is if there was another change to qp->completed in the meantime -- but that means that all of the pre-existing qrcu_read_lock() holders must have gotten done, otherwise the second switch could not have happened. Yes, you do incur three memory barriers on the fast path, but the best you could hope for with your approach was two of them (unless I am confused about how you were using barrier_sync()). Oleg, does this look safe? Ugly at best, I know, but I do very much sympathize with Christoph's desire to keep the number of synchronization primitives down to a dull roar. ;-) Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Wed, 2007-01-31 at 15:32 -0800, Paul E. McKenney wrote: > The wakeup in barrier_sync() would mean that the counter was zero > at some point in the past. The counter would then be rechecked, and > if it were still zero, barrier_sync() would invoke finish_wait() and > then return -- but the counter might well become non-zero in the > meantime, right? > > So given that barrier_sync() is permitted to return after the counter > becomes non-zero, why can't it just rely on the fact that barrier_unlock() > saw it as zero not long in the past? > > > > It looks like barrier_sync() is more a > > > rw semaphore biased to readers. > > > > Indeed, the locked sections are designed to be the rare case. > > OK -- but barrier_sync() just waits for readers, it doesn't exclude them. > > If all barrier_sync() needs to do is to wait until all pre-existing > barrier_lock()/barrier_unlock() pairs to complete, it seems to me to > be compatible with qrcu's semantics. > > So what am I missing here? I might be the one missing stuff, I'll have a hard look at qrcu. The intent was that barrier_sync() would not write to memory when there are no active locked sections, so that the cacheline can stay shared, thus keeping is fast. If qrcu does exactly this, then yes we have a match. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Wed, Jan 31, 2007 at 10:48:21PM +0100, Peter Zijlstra wrote: > On Thu, 2007-02-01 at 00:13 +0300, Oleg Nesterov wrote: > > On 01/31, Paul E. McKenney wrote: > > > On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote: > > > > * Christoph Hellwig <[EMAIL PROTECTED]> wrote: > > > > > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: > > > > > > This barrier thing is constructed so that it will not write in the > > > > > > sync() condition (the hot path) when there are no active lock > > > > > > sections; thus avoiding cacheline bouncing. -- I'm just not sure > > > > > > how > > > > > > this will work out in relation to PI. We might track those in the > > > > > > barrier scope and boost those by the max prio of the blockers. > > > > > > > > > > Is this really needed? We seem to grow new funky locking algorithms > > > > > exponentially, while people already have a hard time understanding > > > > > the > > > > > existing ones. > > > > > > > > yes, it's needed. > > > > > > Would it be possible to come up with something common between this > > > primitive > > > and the one that Oleg Nesterov put together for Jens Axboe? > > > > > > http://lkml.org/lkml/2006/11/29/330 > > > > > > Oleg's approach acquires a lock on the update side, which Peter would > > > not want in the uncontended case -- but perhaps there is some way to > > > make Oleg's approach be able to safely test both counters so as to > > > avoid acquiring the lock if there are no readers. > > > > > > Oleg, any chance of this working? I believe it does, but have not > > > thought it through fully. > > > > I think no. From the quick reading, barrier_sync() and qrcu/srcu are > > quite different. Consider: > > > > barrier_lock() > > > > barrier_sync(); > > > > barrier_unlock(); > > ... wake up ... > > barrier_lock(); > > > > schedule again > > > > The last "schedule again" would be a BUG for qrcu/srcu, but probably > > it is ok for barrier_sync(). > > Yes, that would be ok. The wakeup in barrier_sync() would mean that the counter was zero at some point in the past. The counter would then be rechecked, and if it were still zero, barrier_sync() would invoke finish_wait() and then return -- but the counter might well become non-zero in the meantime, right? So given that barrier_sync() is permitted to return after the counter becomes non-zero, why can't it just rely on the fact that barrier_unlock() saw it as zero not long in the past? > > It looks like barrier_sync() is more a > > rw semaphore biased to readers. > > Indeed, the locked sections are designed to be the rare case. OK -- but barrier_sync() just waits for readers, it doesn't exclude them. If all barrier_sync() needs to do is to wait until all pre-existing barrier_lock()/barrier_unlock() pairs to complete, it seems to me to be compatible with qrcu's semantics. So what am I missing here? Thanx, Paul > > A couple of minor off-topic notes, > > > > +static inline void barrier_unlock(struct barrier *b) > > +{ > > + smp_wmb(); > > + if (atomic_dec_and_test(>count)) > > + __wake_up(>wait, > > TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, b); > > > > This is wake_up_all(>wait), yes? I don't undestans why key == b, it > > could be NULL. > > > > +static inline void barrier_sync(struct barrier *b) > > +{ > > + might_sleep(); > > + > > + if (unlikely(atomic_read(>count))) { > > + DEFINE_WAIT(wait); > > + prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); > > + while (atomic_read(>count)) > > + schedule(); > > + finish_wait(>wait, ); > > + } > > +} > > > > This should be open-coded wait_event(), but wrong! With the scenario above > > this > > can hang forever! because the first wake_up removes the task from the > > >wait. > > This would be me struggling with the waitqueue API, its all a tad > confusing at first look. > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Thu, 2007-02-01 at 00:13 +0300, Oleg Nesterov wrote: > On 01/31, Paul E. McKenney wrote: > > > > On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote: > > > > > > * Christoph Hellwig <[EMAIL PROTECTED]> wrote: > > > > > > > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: > > > > > This barrier thing is constructed so that it will not write in the > > > > > sync() condition (the hot path) when there are no active lock > > > > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how > > > > > this will work out in relation to PI. We might track those in the > > > > > barrier scope and boost those by the max prio of the blockers. > > > > > > > > Is this really needed? We seem to grow new funky locking algorithms > > > > exponentially, while people already have a hard time understanding the > > > > existing ones. > > > > > > yes, it's needed. > > > > Would it be possible to come up with something common between this primitive > > and the one that Oleg Nesterov put together for Jens Axboe? > > > > http://lkml.org/lkml/2006/11/29/330 > > > > Oleg's approach acquires a lock on the update side, which Peter would > > not want in the uncontended case -- but perhaps there is some way to > > make Oleg's approach be able to safely test both counters so as to > > avoid acquiring the lock if there are no readers. > > > > Oleg, any chance of this working? I believe it does, but have not > > thought it through fully. > > I think no. From the quick reading, barrier_sync() and qrcu/srcu are > quite different. Consider: > > barrier_lock() > > barrier_sync(); > > barrier_unlock(); > ... wake up ... > barrier_lock(); > > schedule again > > The last "schedule again" would be a BUG for qrcu/srcu, but probably > it is ok for barrier_sync(). Yes, that would be ok. > It looks like barrier_sync() is more a > rw semaphore biased to readers. Indeed, the locked sections are designed to be the rare case. > A couple of minor off-topic notes, > > +static inline void barrier_unlock(struct barrier *b) > +{ > + smp_wmb(); > + if (atomic_dec_and_test(>count)) > + __wake_up(>wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, > 0, b); > > This is wake_up_all(>wait), yes? I don't undestans why key == b, it could > be NULL. > > +static inline void barrier_sync(struct barrier *b) > +{ > + might_sleep(); > + > + if (unlikely(atomic_read(>count))) { > + DEFINE_WAIT(wait); > + prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); > + while (atomic_read(>count)) > + schedule(); > + finish_wait(>wait, ); > + } > +} > > This should be open-coded wait_event(), but wrong! With the scenario above > this > can hang forever! because the first wake_up removes the task from the > >wait. This would be me struggling with the waitqueue API, its all a tad confusing at first look. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On 02/01, Oleg Nesterov wrote: > > +static inline void barrier_sync(struct barrier *b) > +{ > + might_sleep(); > + > + if (unlikely(atomic_read(>count))) { > + DEFINE_WAIT(wait); > + prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); > + while (atomic_read(>count)) > + schedule(); > + finish_wait(>wait, ); > + } > +} > > This should be open-coded wait_event(), but wrong! With the scenario above > this > can hang forever! because the first wake_up removes the task from the > >wait. I am afraid I was not clear (as usual :). prepare_to_wait means autoremove_wake_function. So, if barrier_unlock() wakes up the caller of barrier_sync(), it will be removed from b->wait. If it goes to schedule() because another barrier_lock() incremented b->count, we can't wake it via __wake_up(>wait, ...). Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On 01/31, Paul E. McKenney wrote: > > On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote: > > > > * Christoph Hellwig <[EMAIL PROTECTED]> wrote: > > > > > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: > > > > This barrier thing is constructed so that it will not write in the > > > > sync() condition (the hot path) when there are no active lock > > > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how > > > > this will work out in relation to PI. We might track those in the > > > > barrier scope and boost those by the max prio of the blockers. > > > > > > Is this really needed? We seem to grow new funky locking algorithms > > > exponentially, while people already have a hard time understanding the > > > existing ones. > > > > yes, it's needed. > > Would it be possible to come up with something common between this primitive > and the one that Oleg Nesterov put together for Jens Axboe? > > http://lkml.org/lkml/2006/11/29/330 > > Oleg's approach acquires a lock on the update side, which Peter would > not want in the uncontended case -- but perhaps there is some way to > make Oleg's approach be able to safely test both counters so as to > avoid acquiring the lock if there are no readers. > > Oleg, any chance of this working? I believe it does, but have not > thought it through fully. I think no. From the quick reading, barrier_sync() and qrcu/srcu are quite different. Consider: barrier_lock() barrier_sync(); barrier_unlock(); ... wake up ... barrier_lock(); schedule again The last "schedule again" would be a BUG for qrcu/srcu, but probably it is ok for barrier_sync(). It looks like barrier_sync() is more a rw semaphore biased to readers. A couple of minor off-topic notes, +static inline void barrier_unlock(struct barrier *b) +{ + smp_wmb(); + if (atomic_dec_and_test(>count)) + __wake_up(>wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, b); This is wake_up_all(>wait), yes? I don't undestans why key == b, it could be NULL. +static inline void barrier_sync(struct barrier *b) +{ + might_sleep(); + + if (unlikely(atomic_read(>count))) { + DEFINE_WAIT(wait); + prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); + while (atomic_read(>count)) + schedule(); + finish_wait(>wait, ); + } +} This should be open-coded wait_event(), but wrong! With the scenario above this can hang forever! because the first wake_up removes the task from the >wait. Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote: > > * Christoph Hellwig <[EMAIL PROTECTED]> wrote: > > > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: > > > This barrier thing is constructed so that it will not write in the > > > sync() condition (the hot path) when there are no active lock > > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how > > > this will work out in relation to PI. We might track those in the > > > barrier scope and boost those by the max prio of the blockers. > > > > Is this really needed? We seem to grow new funky locking algorithms > > exponentially, while people already have a hard time understanding the > > existing ones. > > yes, it's needed. Would it be possible to come up with something common between this primitive and the one that Oleg Nesterov put together for Jens Axboe? http://lkml.org/lkml/2006/11/29/330 Oleg's approach acquires a lock on the update side, which Peter would not want in the uncontended case -- but perhaps there is some way to make Oleg's approach be able to safely test both counters so as to avoid acquiring the lock if there are no readers. Oleg, any chance of this working? I believe it does, but have not thought it through fully. If it does turn out that we cannot converge these, I believe that Peter's implementation needs an smp_mb() at both the beginning and the end of barrier_sync(). Without the first smp_mb(), the test in barrier_sync() might precede the prior change, and without the second smp_mb() the barrier_sync() might slide after the following cleanup code. (But I could easily be misunderstanding the code using barrier_sync().) Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote: * Christoph Hellwig [EMAIL PROTECTED] wrote: On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: This barrier thing is constructed so that it will not write in the sync() condition (the hot path) when there are no active lock sections; thus avoiding cacheline bouncing. -- I'm just not sure how this will work out in relation to PI. We might track those in the barrier scope and boost those by the max prio of the blockers. Is this really needed? We seem to grow new funky locking algorithms exponentially, while people already have a hard time understanding the existing ones. yes, it's needed. Would it be possible to come up with something common between this primitive and the one that Oleg Nesterov put together for Jens Axboe? http://lkml.org/lkml/2006/11/29/330 Oleg's approach acquires a lock on the update side, which Peter would not want in the uncontended case -- but perhaps there is some way to make Oleg's approach be able to safely test both counters so as to avoid acquiring the lock if there are no readers. Oleg, any chance of this working? I believe it does, but have not thought it through fully. If it does turn out that we cannot converge these, I believe that Peter's implementation needs an smp_mb() at both the beginning and the end of barrier_sync(). Without the first smp_mb(), the test in barrier_sync() might precede the prior change, and without the second smp_mb() the barrier_sync() might slide after the following cleanup code. (But I could easily be misunderstanding the code using barrier_sync().) Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On 01/31, Paul E. McKenney wrote: On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote: * Christoph Hellwig [EMAIL PROTECTED] wrote: On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: This barrier thing is constructed so that it will not write in the sync() condition (the hot path) when there are no active lock sections; thus avoiding cacheline bouncing. -- I'm just not sure how this will work out in relation to PI. We might track those in the barrier scope and boost those by the max prio of the blockers. Is this really needed? We seem to grow new funky locking algorithms exponentially, while people already have a hard time understanding the existing ones. yes, it's needed. Would it be possible to come up with something common between this primitive and the one that Oleg Nesterov put together for Jens Axboe? http://lkml.org/lkml/2006/11/29/330 Oleg's approach acquires a lock on the update side, which Peter would not want in the uncontended case -- but perhaps there is some way to make Oleg's approach be able to safely test both counters so as to avoid acquiring the lock if there are no readers. Oleg, any chance of this working? I believe it does, but have not thought it through fully. I think no. From the quick reading, barrier_sync() and qrcu/srcu are quite different. Consider: barrier_lock() barrier_sync(); barrier_unlock(); ... wake up ... barrier_lock(); schedule again The last schedule again would be a BUG for qrcu/srcu, but probably it is ok for barrier_sync(). It looks like barrier_sync() is more a rw semaphore biased to readers. A couple of minor off-topic notes, +static inline void barrier_unlock(struct barrier *b) +{ + smp_wmb(); + if (atomic_dec_and_test(b-count)) + __wake_up(b-wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, b); This is wake_up_all(b-wait), yes? I don't undestans why key == b, it could be NULL. +static inline void barrier_sync(struct barrier *b) +{ + might_sleep(); + + if (unlikely(atomic_read(b-count))) { + DEFINE_WAIT(wait); + prepare_to_wait(b-wait, wait, TASK_UNINTERRUPTIBLE); + while (atomic_read(b-count)) + schedule(); + finish_wait(b-wait, wait); + } +} This should be open-coded wait_event(), but wrong! With the scenario above this can hang forever! because the first wake_up removes the task from the b-wait. Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On 02/01, Oleg Nesterov wrote: +static inline void barrier_sync(struct barrier *b) +{ + might_sleep(); + + if (unlikely(atomic_read(b-count))) { + DEFINE_WAIT(wait); + prepare_to_wait(b-wait, wait, TASK_UNINTERRUPTIBLE); + while (atomic_read(b-count)) + schedule(); + finish_wait(b-wait, wait); + } +} This should be open-coded wait_event(), but wrong! With the scenario above this can hang forever! because the first wake_up removes the task from the b-wait. I am afraid I was not clear (as usual :). prepare_to_wait means autoremove_wake_function. So, if barrier_unlock() wakes up the caller of barrier_sync(), it will be removed from b-wait. If it goes to schedule() because another barrier_lock() incremented b-count, we can't wake it via __wake_up(b-wait, ...). Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Thu, 2007-02-01 at 00:13 +0300, Oleg Nesterov wrote: On 01/31, Paul E. McKenney wrote: On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote: * Christoph Hellwig [EMAIL PROTECTED] wrote: On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: This barrier thing is constructed so that it will not write in the sync() condition (the hot path) when there are no active lock sections; thus avoiding cacheline bouncing. -- I'm just not sure how this will work out in relation to PI. We might track those in the barrier scope and boost those by the max prio of the blockers. Is this really needed? We seem to grow new funky locking algorithms exponentially, while people already have a hard time understanding the existing ones. yes, it's needed. Would it be possible to come up with something common between this primitive and the one that Oleg Nesterov put together for Jens Axboe? http://lkml.org/lkml/2006/11/29/330 Oleg's approach acquires a lock on the update side, which Peter would not want in the uncontended case -- but perhaps there is some way to make Oleg's approach be able to safely test both counters so as to avoid acquiring the lock if there are no readers. Oleg, any chance of this working? I believe it does, but have not thought it through fully. I think no. From the quick reading, barrier_sync() and qrcu/srcu are quite different. Consider: barrier_lock() barrier_sync(); barrier_unlock(); ... wake up ... barrier_lock(); schedule again The last schedule again would be a BUG for qrcu/srcu, but probably it is ok for barrier_sync(). Yes, that would be ok. It looks like barrier_sync() is more a rw semaphore biased to readers. Indeed, the locked sections are designed to be the rare case. A couple of minor off-topic notes, +static inline void barrier_unlock(struct barrier *b) +{ + smp_wmb(); + if (atomic_dec_and_test(b-count)) + __wake_up(b-wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, b); This is wake_up_all(b-wait), yes? I don't undestans why key == b, it could be NULL. +static inline void barrier_sync(struct barrier *b) +{ + might_sleep(); + + if (unlikely(atomic_read(b-count))) { + DEFINE_WAIT(wait); + prepare_to_wait(b-wait, wait, TASK_UNINTERRUPTIBLE); + while (atomic_read(b-count)) + schedule(); + finish_wait(b-wait, wait); + } +} This should be open-coded wait_event(), but wrong! With the scenario above this can hang forever! because the first wake_up removes the task from the b-wait. This would be me struggling with the waitqueue API, its all a tad confusing at first look. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Wed, Jan 31, 2007 at 10:48:21PM +0100, Peter Zijlstra wrote: On Thu, 2007-02-01 at 00:13 +0300, Oleg Nesterov wrote: On 01/31, Paul E. McKenney wrote: On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote: * Christoph Hellwig [EMAIL PROTECTED] wrote: On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: This barrier thing is constructed so that it will not write in the sync() condition (the hot path) when there are no active lock sections; thus avoiding cacheline bouncing. -- I'm just not sure how this will work out in relation to PI. We might track those in the barrier scope and boost those by the max prio of the blockers. Is this really needed? We seem to grow new funky locking algorithms exponentially, while people already have a hard time understanding the existing ones. yes, it's needed. Would it be possible to come up with something common between this primitive and the one that Oleg Nesterov put together for Jens Axboe? http://lkml.org/lkml/2006/11/29/330 Oleg's approach acquires a lock on the update side, which Peter would not want in the uncontended case -- but perhaps there is some way to make Oleg's approach be able to safely test both counters so as to avoid acquiring the lock if there are no readers. Oleg, any chance of this working? I believe it does, but have not thought it through fully. I think no. From the quick reading, barrier_sync() and qrcu/srcu are quite different. Consider: barrier_lock() barrier_sync(); barrier_unlock(); ... wake up ... barrier_lock(); schedule again The last schedule again would be a BUG for qrcu/srcu, but probably it is ok for barrier_sync(). Yes, that would be ok. The wakeup in barrier_sync() would mean that the counter was zero at some point in the past. The counter would then be rechecked, and if it were still zero, barrier_sync() would invoke finish_wait() and then return -- but the counter might well become non-zero in the meantime, right? So given that barrier_sync() is permitted to return after the counter becomes non-zero, why can't it just rely on the fact that barrier_unlock() saw it as zero not long in the past? It looks like barrier_sync() is more a rw semaphore biased to readers. Indeed, the locked sections are designed to be the rare case. OK -- but barrier_sync() just waits for readers, it doesn't exclude them. If all barrier_sync() needs to do is to wait until all pre-existing barrier_lock()/barrier_unlock() pairs to complete, it seems to me to be compatible with qrcu's semantics. So what am I missing here? Thanx, Paul A couple of minor off-topic notes, +static inline void barrier_unlock(struct barrier *b) +{ + smp_wmb(); + if (atomic_dec_and_test(b-count)) + __wake_up(b-wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, b); This is wake_up_all(b-wait), yes? I don't undestans why key == b, it could be NULL. +static inline void barrier_sync(struct barrier *b) +{ + might_sleep(); + + if (unlikely(atomic_read(b-count))) { + DEFINE_WAIT(wait); + prepare_to_wait(b-wait, wait, TASK_UNINTERRUPTIBLE); + while (atomic_read(b-count)) + schedule(); + finish_wait(b-wait, wait); + } +} This should be open-coded wait_event(), but wrong! With the scenario above this can hang forever! because the first wake_up removes the task from the b-wait. This would be me struggling with the waitqueue API, its all a tad confusing at first look. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Wed, 2007-01-31 at 15:32 -0800, Paul E. McKenney wrote: The wakeup in barrier_sync() would mean that the counter was zero at some point in the past. The counter would then be rechecked, and if it were still zero, barrier_sync() would invoke finish_wait() and then return -- but the counter might well become non-zero in the meantime, right? So given that barrier_sync() is permitted to return after the counter becomes non-zero, why can't it just rely on the fact that barrier_unlock() saw it as zero not long in the past? It looks like barrier_sync() is more a rw semaphore biased to readers. Indeed, the locked sections are designed to be the rare case. OK -- but barrier_sync() just waits for readers, it doesn't exclude them. If all barrier_sync() needs to do is to wait until all pre-existing barrier_lock()/barrier_unlock() pairs to complete, it seems to me to be compatible with qrcu's semantics. So what am I missing here? I might be the one missing stuff, I'll have a hard look at qrcu. The intent was that barrier_sync() would not write to memory when there are no active locked sections, so that the cacheline can stay shared, thus keeping is fast. If qrcu does exactly this, then yes we have a match. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Thu, Feb 01, 2007 at 01:03:09AM +0100, Peter Zijlstra wrote: On Wed, 2007-01-31 at 15:32 -0800, Paul E. McKenney wrote: The wakeup in barrier_sync() would mean that the counter was zero at some point in the past. The counter would then be rechecked, and if it were still zero, barrier_sync() would invoke finish_wait() and then return -- but the counter might well become non-zero in the meantime, right? So given that barrier_sync() is permitted to return after the counter becomes non-zero, why can't it just rely on the fact that barrier_unlock() saw it as zero not long in the past? It looks like barrier_sync() is more a rw semaphore biased to readers. Indeed, the locked sections are designed to be the rare case. OK -- but barrier_sync() just waits for readers, it doesn't exclude them. If all barrier_sync() needs to do is to wait until all pre-existing barrier_lock()/barrier_unlock() pairs to complete, it seems to me to be compatible with qrcu's semantics. So what am I missing here? I might be the one missing stuff, I'll have a hard look at qrcu. The intent was that barrier_sync() would not write to memory when there are no active locked sections, so that the cacheline can stay shared, thus keeping is fast. If qrcu does exactly this, then yes we have a match. QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't do what you want, as it acquires the lock unconditionally. I am proposing that synchronize_qrcu() change to something like the following: void synchronize_qrcu(struct qrcu_struct *qp) { int idx; smp_mb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) { smp_rmb(); if (atomic_read(qp-ctr[0]) + atomic_read(qp-ctr[1]) = 1) goto out; } mutex_lock(qp-mutex); idx = qp-completed 0x1; atomic_inc(qp-ctr + (idx ^ 0x1)); /* Reduce the likelihood that qrcu_read_lock() will loop */ smp_mb__after_atomic_inc(); qp-completed++; atomic_dec(qp-ctr + idx); __wait_event(qp-wq, !atomic_read(qp-ctr + idx)); mutex_unlock(qp-mutex); out: smp_mb(); } For the first if to give a false positive, a concurrent switch had to have happened. For example, qp-ctr[0] was zero and qp-ctr[1] was two at the time of the first atomic_read(), but then qp-completed switched so that both qp-ctr[0] and qp-ctr[1] were one at the time of the second atomic_read. The only way the second if can give us a false positive is if there was another change to qp-completed in the meantime -- but that means that all of the pre-existing qrcu_read_lock() holders must have gotten done, otherwise the second switch could not have happened. Yes, you do incur three memory barriers on the fast path, but the best you could hope for with your approach was two of them (unless I am confused about how you were using barrier_sync()). Oleg, does this look safe? Ugly at best, I know, but I do very much sympathize with Christoph's desire to keep the number of synchronization primitives down to a dull roar. ;-) Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote: > > > This barrier thing is constructed so that it will not write in the > > > sync() condition (the hot path) when there are no active lock > > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how > > > this will work out in relation to PI. We might track those in the > > > barrier scope and boost those by the max prio of the blockers. > > > > Is this really needed? We seem to grow new funky locking algorithms > > exponentially, while people already have a hard time understanding the > > existing ones. > > yes, it's needed. Thanks for the wonderful and indepth explanation - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
* Christoph Hellwig <[EMAIL PROTECTED]> wrote: > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: > > This barrier thing is constructed so that it will not write in the > > sync() condition (the hot path) when there are no active lock > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how > > this will work out in relation to PI. We might track those in the > > barrier scope and boost those by the max prio of the blockers. > > Is this really needed? We seem to grow new funky locking algorithms > exponentially, while people already have a hard time understanding the > existing ones. yes, it's needed. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: > This barrier thing is constructed so that it will not write in the sync() > condition (the hot path) when there are no active lock sections; thus avoiding > cacheline bouncing. -- I'm just not sure how this will work out in relation to > PI. We might track those in the barrier scope and boost those by the max prio > of the blockers. Is this really needed? We seem to grow new funky locking algorithms exponentially, while people already have a hard time understanding the existing ones. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: This barrier thing is constructed so that it will not write in the sync() condition (the hot path) when there are no active lock sections; thus avoiding cacheline bouncing. -- I'm just not sure how this will work out in relation to PI. We might track those in the barrier scope and boost those by the max prio of the blockers. Is this really needed? We seem to grow new funky locking algorithms exponentially, while people already have a hard time understanding the existing ones. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
* Christoph Hellwig [EMAIL PROTECTED] wrote: On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote: This barrier thing is constructed so that it will not write in the sync() condition (the hot path) when there are no active lock sections; thus avoiding cacheline bouncing. -- I'm just not sure how this will work out in relation to PI. We might track those in the barrier scope and boost those by the max prio of the blockers. Is this really needed? We seem to grow new funky locking algorithms exponentially, while people already have a hard time understanding the existing ones. yes, it's needed. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] barrier: a scalable synchonisation barrier
On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote: This barrier thing is constructed so that it will not write in the sync() condition (the hot path) when there are no active lock sections; thus avoiding cacheline bouncing. -- I'm just not sure how this will work out in relation to PI. We might track those in the barrier scope and boost those by the max prio of the blockers. Is this really needed? We seem to grow new funky locking algorithms exponentially, while people already have a hard time understanding the existing ones. yes, it's needed. Thanks for the wonderful and indepth explanation /sarcasm - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/