Re: Next round: revised futex(2) man page for review
On Wed, Oct 07, 2015 at 10:34:19AM +0100, Michael Kerrisk (man-pages) wrote: > On 08/19/2015 03:40 PM, Thomas Gleixner wrote: > > On Wed, 5 Aug 2015, Darren Hart wrote: > >> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) > >> wrote: > >>> .\" FIXME XXX = Start of adapted Hart/Guniguntala text = > >>> .\" The following text is drawn from the Hart/Guniguntala paper > >>> .\" (listed in SEE ALSO), but I have reworded some pieces > >>> .\" significantly. Please check it. > >>> > >>>The PI futex operations described below differ from the other > >>>futex operations in that they impose policy on the use of the > >>>value of the futex word: > >>> > >>>* If the lock is not acquired, the futex word's value shall be > >>> 0. > >>> > >>>* If the lock is acquired, the futex word's value shall be the > >>> thread ID (TID; see gettid(2)) of the owning thread. > >>> > >>>* If the lock is owned and there are threads contending for the > >>> lock, then the FUTEX_WAITERS bit shall be set in the futex > >>> word's value; in other words, this value is: > >>> > >>> FUTEX_WAITERS | TID > >>> > >>> > >>>Note that a PI futex word never just has the value FUTEX_WAITERS, > >>>which is a permissible state for non-PI futexes. > >> > >> The second clause is inappropriate. I don't know if that was yours or > >> mine, but non-PI futexes do not have a kernel defined value policy, so > >> ==FUTEX_WAITERS cannot be a "permissible state" as any value is > >> permissible for non-PI futexes, and none have a kernel defined state. > > > > Depends. If the regular futex is configured as robust, then we have a > > kernel defined value policy as well. > Right. > Okay -- so do we need a change to the text here? Hrm. We probably need a way to indicate that kernel-defined futex word value policy only applies to PI and or ROBUST futexes. > > >>> .\" FIXME I'm not quite clear on the meaning of the following sentence. > >>> .\" Is this trying to say that while blocked in a > >>> .\" FUTEX_WAIT_REQUEUE_PI, it could happen that another > >>> .\" task does a FUTEX_WAKE on uaddr that simply causes > >>> .\" a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI > >>> .\" does not complete? What happens then to the > >>> FUTEX_WAIT_REQUEUE_PI > >>> .\" opertion? Does it remain blocked, or does it unblock > >>> .\" In which case, what does user space see? > >>> > >>> The > >>> waiter can be removed from the wait on uaddr via > >>> FUTEX_WAKE without requeueing on uaddr2. > >> > >> Userspace should see the task wake and continue executing. This would > >> effectively be a cancelation operation - which I didn't think was > >> supported. Thomas? > > > > We probably never intended to support it, but looking at the code it > > works (did not try it though). It returns to user space with > > -EWOULDBLOCK. So it basically behaves like any other spurious wakeup. > > Again, I assume no changes are required to the man page(?). I'd rather not document this as supported or intended behavior. FUTEX_WAIT_REQUEUE_PI is documented as being paired with and only with FUTEX_CMP_REQUEUE_PI. Anything else is undefined behavior. If we want to support a cancelation, it should be deliberate - and we should probably test it ;-) -- Darren Hart Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
On Wed, Oct 07, 2015 at 09:30:46AM +0100, Michael Kerrisk (man-pages) wrote: > Hello Thomas, > > Thanks for the follow up! > > Some open questions below are marked with the string ###. A couple of comments from me below, although I suspect you have this much covered already. > > On 08/19/2015 04:17 PM, Thomas Gleixner wrote: > > On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote: > FUTEX_CMP_REQUEUE (since Linux 2.6.7) > This operation first checks whether the location uaddr > still contains the value val3. If not, the operation > fails with the error EAGAIN. Otherwise, the operation > wakes up a maximum of val waiters that are waiting on the > futex at uaddr. If there are more than val waiters, then > the remaining waiters are removed from the wait queue of > the source futex at uaddr and added to the wait queue of > the target futex at uaddr2. The val2 argument specifies > an upper limit on the number of waiters that are requeued > to the futex at uaddr2. > > .\" FIXME(Torvald) Is the following correct? Or is just the decision > .\" which threads to wake or requeue part of the atomic operation? > > The load from uaddr is an atomic memory access (i.e., > using atomic machine instructions of the respective archi‐ > tecture). This load, the comparison with val3, and the > requeueing of any waiters are performed atomically and > totally ordered with respect to other operations on the > same futex word. > >>> > >>> It's atomic as the other atomic operations on the futex word. It's > >>> always performed with the proper lock(s) held in the kernel. That > >>> means any concurrent operation will serialize on that lock(s). User > >>> space has to make sure, that depending on the observed value no > >>> concurrent operations happen, but that's something the kernel cannot > >>> control. > >> > >> ??? > >> Sorry, I'm not clear here. Is the current text correct then? Or is some > >> change needed. > > > > I think we need some change here because the meaning of atomic is > > unclear. The basic rules of futexes are: > > > > - All modifying operations on the futex value have to be done with > >atomic instructions, usually cmpxchg. That applies to both kernel > >and user space. > > > >That's the atomicity at the futex value level. > > > > - In the kernel we have to create/modify/destroy state in order to > >provide the blocking/requeueing etc. > > > >This state needs protection as well. So all operations related to > >the kernel internal state are serialized on the hash bucket > >locks. The hash buckets are a scalability mechanism to avoid > >contention on a single lock protecting all kernel internal > >state. For simplicity reasons you can just think of a global lock > >protecting all kernel internal state. > > > >If the kernel creates/modifies state then it can be necessary to > >either reread the futex value or modify it. That happens under the > >locks as well. > > > >So in the case of requeue, we take the proper locks and perform the > >comparison with val3 and the requeueing with the locks held. > > > >So that lock protection makes these operations 'atomic'. The > >correct expression is 'serialized'. > > ### > So, here, i think I need some specific pointers on the precise text > changes that are required. Let's talk about this f2f. For convenience, > here's the relevant text once again quoted: Not speaking for tglx, but I think the point here is to distinguish between atomic (as in cmpxchg comparison tests performed on the futex word) and serialized (as in the management of futex hashbuckets and task states). > >FUTEX_CMP_REQUEUE (since Linux 2.6.7) > This operation first checks whether the location uaddr > still contains the value val3. If not, the operation > fails with the error EAGAIN. Otherwise, the operation Here you might explain the _CMP_ qualifier and note atomicity of the operation: The _CMP_ refers to the verification of the userspace state as specified by through the arguments. This operation first atomically compares the value at uaddr with the value val3 ... > wakes up a maximum of val waiters that are waiting on the > futex at uaddr. If there are more than val waiters, then > the remaining waiters are removed from the wait queue of > the source futex at uaddr and added to the wait queue of > the target futex at uaddr2. The val2 argument specifies > an upper limit on the number
Re: Next round: revised futex(2) man page for review
On 08/19/2015 03:40 PM, Thomas Gleixner wrote: > On Wed, 5 Aug 2015, Darren Hart wrote: >> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote: >>> .\" FIXME XXX = Start of adapted Hart/Guniguntala text = >>> .\" The following text is drawn from the Hart/Guniguntala paper >>> .\" (listed in SEE ALSO), but I have reworded some pieces >>> .\" significantly. Please check it. >>> >>>The PI futex operations described below differ from the other >>>futex operations in that they impose policy on the use of the >>>value of the futex word: >>> >>>* If the lock is not acquired, the futex word's value shall be >>> 0. >>> >>>* If the lock is acquired, the futex word's value shall be the >>> thread ID (TID; see gettid(2)) of the owning thread. >>> >>>* If the lock is owned and there are threads contending for the >>> lock, then the FUTEX_WAITERS bit shall be set in the futex >>> word's value; in other words, this value is: >>> >>> FUTEX_WAITERS | TID >>> >>> >>>Note that a PI futex word never just has the value FUTEX_WAITERS, >>>which is a permissible state for non-PI futexes. >> >> The second clause is inappropriate. I don't know if that was yours or >> mine, but non-PI futexes do not have a kernel defined value policy, so >> ==FUTEX_WAITERS cannot be a "permissible state" as any value is >> permissible for non-PI futexes, and none have a kernel defined state. > > Depends. If the regular futex is configured as robust, then we have a > kernel defined value policy as well. Okay -- so do we need a change to the text here? >>> .\" FIXME I'm not quite clear on the meaning of the following sentence. >>> .\" Is this trying to say that while blocked in a >>> .\" FUTEX_WAIT_REQUEUE_PI, it could happen that another >>> .\" task does a FUTEX_WAKE on uaddr that simply causes >>> .\" a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI >>> .\" does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI >>> .\" opertion? Does it remain blocked, or does it unblock >>> .\" In which case, what does user space see? >>> >>> The >>> waiter can be removed from the wait on uaddr via >>> FUTEX_WAKE without requeueing on uaddr2. >> >> Userspace should see the task wake and continue executing. This would >> effectively be a cancelation operation - which I didn't think was >> supported. Thomas? > > We probably never intended to support it, but looking at the code it > works (did not try it though). It returns to user space with > -EWOULDBLOCK. So it basically behaves like any other spurious wakeup. Again, I assume no changes are required to the man page(?). Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
Hello Thomas, Thanks for the follow up! Some open questions below are marked with the string ###. On 08/19/2015 04:17 PM, Thomas Gleixner wrote: > On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote: FUTEX_CMP_REQUEUE (since Linux 2.6.7) This operation first checks whether the location uaddr still contains the value val3. If not, the operation fails with the error EAGAIN. Otherwise, the operation wakes up a maximum of val waiters that are waiting on the futex at uaddr. If there are more than val waiters, then the remaining waiters are removed from the wait queue of the source futex at uaddr and added to the wait queue of the target futex at uaddr2. The val2 argument specifies an upper limit on the number of waiters that are requeued to the futex at uaddr2. .\" FIXME(Torvald) Is the following correct? Or is just the decision .\" which threads to wake or requeue part of the atomic operation? The load from uaddr is an atomic memory access (i.e., using atomic machine instructions of the respective archi‐ tecture). This load, the comparison with val3, and the requeueing of any waiters are performed atomically and totally ordered with respect to other operations on the same futex word. >>> >>> It's atomic as the other atomic operations on the futex word. It's >>> always performed with the proper lock(s) held in the kernel. That >>> means any concurrent operation will serialize on that lock(s). User >>> space has to make sure, that depending on the observed value no >>> concurrent operations happen, but that's something the kernel cannot >>> control. >> >> ??? >> Sorry, I'm not clear here. Is the current text correct then? Or is some >> change needed. > > I think we need some change here because the meaning of atomic is > unclear. The basic rules of futexes are: > > - All modifying operations on the futex value have to be done with >atomic instructions, usually cmpxchg. That applies to both kernel >and user space. > >That's the atomicity at the futex value level. > > - In the kernel we have to create/modify/destroy state in order to >provide the blocking/requeueing etc. > >This state needs protection as well. So all operations related to >the kernel internal state are serialized on the hash bucket >locks. The hash buckets are a scalability mechanism to avoid >contention on a single lock protecting all kernel internal >state. For simplicity reasons you can just think of a global lock >protecting all kernel internal state. > >If the kernel creates/modifies state then it can be necessary to >either reread the futex value or modify it. That happens under the >locks as well. > >So in the case of requeue, we take the proper locks and perform the >comparison with val3 and the requeueing with the locks held. > >So that lock protection makes these operations 'atomic'. The >correct expression is 'serialized'. ### So, here, i think I need some specific pointers on the precise text changes that are required. Let's talk about this f2f. For convenience, here's the relevant text once again quoted: FUTEX_CMP_REQUEUE (since Linux 2.6.7) This operation first checks whether the location uaddr still contains the value val3. If not, the operation fails with the error EAGAIN. Otherwise, the operation wakes up a maximum of val waiters that are waiting on the futex at uaddr. If there are more than val waiters, then the remaining waiters are removed from the wait queue of the source futex at uaddr and added to the wait queue of the target futex at uaddr2. The val2 argument specifies an upper limit on the number of waiters that are requeued to the futex at uaddr2. The load from uaddr is an atomic memory access (i.e., using atomic machine instructions of the respective archi‐ tecture). This load, the comparison with val3, and the requeueing of any waiters are performed atomically and totally ordered with respect to other operations on the same futex word. .\" FIXME We need some explanation in the following paragraph of *why* .\" it is important to note that "the kernel will update the .\" futex word's value prior It is important to note to returning to user space" . Can someone explain? that the kernel will update the futex word's value prior
Re: Next round: revised futex(2) man page for review
On Thu, Aug 20, 2015 at 01:17:03AM +0200, Thomas Gleixner wrote: ... > > >> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart > > >> .\" made the observation that "EINVAL is returned if the non-pi > > >> .\" to pi or op pairing semantics are violated." > > >> .\" Probably there needs to be a general statement about this > > >> .\" requirement, probably located at about this point in the page. > > >> .\" Darren (or someone else), care to take a shot at this? > > > > > > Well, that's hard to describe because the kernel only has a limited > > > way of detecting such mismatches. It only can detect it when there are > > > non PI waiters on a futex and a PI function is called or vice versa. > > > > Hmmm. Okay, I filed your comments away for reference, but > > hopefully someone can help with some actual text. > > I let Darren come up with something sensible :) Heh, right, no pressure then... I responded to Michael on this recently, copied here for reference: FUTEX_WAIT_REQUEUE_PI can return -EINVAL if called with invalid parameters, such as uaddr==uaddr2, or (in the case of SHARED futexes), the associated keys match (meaning it's the same futex word - shared memory, inode, etc.). This can't happen if the stated policy of requeueing from non-pi to pi is followed as the same word cannot be both non-pi and pi at the same time, requiring them to be unique futex words. FUTEX_CMP_REQUEUE_PI will fail similarly if uaddr and uaddr2 are the same futex word. Also, if nr_wake != 1. But, to the point I was making above, FUTEX_CMP_REQUEUE_PI must requeue uaddr to the same uaddr2 specified in the previous FUTEX_WAIT_REQUEUE_PI call. FUTEX_WAIT_REQUEUE_PI sets up the operation, FUTEX_CMP_REQUEUE_PI completes it, and they must agree on uaddr and uaddr2. Michael, are you still looking for something more from me, or is this FIXME now complete? -- Darren Hart Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
On Sat, Aug 08, 2015 at 08:57:35AM +0200, Michael Kerrisk (man-pages) wrote: ... > >> .\" FIXME = End of adapted Hart/Guniguntala text = > >> > >> > >> > >> .\" FIXME We need some explanation in the following paragraph of *why* > >> .\" it is important to note that "the kernel will update the > >> .\" futex word's value prior > >>It is important to note to returning to user space" . Can someone > >>explain? that the kernel will update the futex word's value > >>prior to returning to user space. Unlike the other futex opera‐ > >>tions described above, the PI futex operations are designed for > >>the implementation of very specific IPC mechanisms. > > > > If the kernel didn't perform the update prior to returning to userspace, > > we could end up in an invalid state. Such as having an owner, but the > > value being 0. Or having waiters, but not having FUTEX_WAITERS set. > > So I've now reworked this passage to read: > >It is important to note that the kernel will update the futex >word's value prior to returning to user space. (This prevents >the possibility of the futex word's value ending up in an invalid >state, such as having an owner but the value being 0, or having >waiters but not having the FUTEX_WAITERS bit set.) > > Okay? Yes. > > >> .\" > >> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart > >> .\" made the observation that "EINVAL is returned if the non-pi > >> .\" to pi or op pairing semantics are violated." > >> .\" Probably there needs to be a general statement about this > >> .\" requirement, probably located at about this point in the page. > >> .\" Darren (or someone else), care to take a shot at this? > > > > We can probably borrow from either the futex.c comments or the > > futex-requeue-pi.txt in Documentation. Also, it is important to note > > that the PI requeue operations require two distinct uadders (although > > that is implied by requiring "non-pi to pi" as a futex cannot be both. > > > > Or... perhaps something like: > > > > Due to the kernel imposed futex word value policy, PI futex > > operations have additional usage requirements: > > > > FUTEX_WAIT_REQUEUE_PI must be paired with FUTEX_CMP_REQUEUE_PI > > and be performed from a non-pi futex to a distinct pi futex. > > Failing to do so will return EINVAL. > > For which operation does the EINVAL occur: FUTEX_WAIT_REQUEUE_PI or > FUTEX_CMP_REQUEUE_PI? FUTEX_WAIT_REQUEUE_PI can return -EINVAL if called with invalid parameters, such as uaddr==uaddr2, or (in the case of SHARED futexes), the associated keys match (meaning it's the same futex word - shared memory, inode, etc.). This can't happen if the stated policy of requeueing from non-pi to pi is followed as the same word cannot be both non-pi and pi at the same time, requiring them to be unique futex words. FUTEX_CMP_REQUEUE_PI will fail similarly if uaddr and uaddr2 are the same futex word. Also, if nr_wake != 1. But, to the point I was making above, FUTEX_CMP_REQUEUE_PI must reque uaddr to same uaddr2 specified in the previous FUTEX_WAIT_REQUEUE_PI call. FUTEX_WAIT_REQUEUE_PI sets up the operation, FUTEX_CMP_REQUEUE_PI completes it, and they must agree on uaddr and uaddr2. ... > > And their PRIVATE counterparts of course (which is assumed as it is a > > flag to the opcode). > > Yes. But I don't think that needs to be called out explicitly here (?). Agreed. > > >> .\" FIXME XXX = Start of adapted Hart/Guniguntala text = > >> .\" The following text is drawn from the Hart/Guniguntala paper > >> .\" (listed in SEE ALSO), but I have reworded some pieces > >> .\" significantly. Please check it. > >> > >>The PI futex operations described below differ from the other > >>futex operations in that they impose policy on the use of the > >>value of the futex word: > >> > >>* If the lock is not acquired, the futex word's value shall be > >> 0. > >> > >>* If the lock is acquired, the futex word's value shall be the > >> thread ID (TID; see gettid(2)) of the owning thread. > >> > >>* If the lock is owned and there are threads contending for the > >> lock, then the FUTEX_WAITERS bit shall be set in the futex > >> word's value; in other words, this value is: > >> > >> FUTEX_WAITERS | TID > >> > >> > >>Note that a PI futex word never just has the value FUTEX_WAITERS, > >>which is a permissible state for non-PI futexes. > > > > The second clause is inappropriate. I don't know if that was yours or > > mine, but non-PI futexes do not have a kernel defined value policy, so > > ==FUTEX_WAITERS cannot be a "permissible state" as any value is > > permissible for non-PI futexes, and none have a kernel defined state. > > >
Re: Next round: revised futex(2) man page for review
On Thu, Aug 20, 2015 at 12:40:46AM +0200, Thomas Gleixner wrote: > On Wed, 5 Aug 2015, Darren Hart wrote: > > On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote: > > > .\" FIXME XXX = Start of adapted Hart/Guniguntala text = > > > .\" The following text is drawn from the Hart/Guniguntala paper > > > .\" (listed in SEE ALSO), but I have reworded some pieces > > > .\" significantly. Please check it. > > > > > >The PI futex operations described below differ from the other > > >futex operations in that they impose policy on the use of the > > >value of the futex word: > > > > > >* If the lock is not acquired, the futex word's value shall be > > > 0. > > > > > >* If the lock is acquired, the futex word's value shall be the > > > thread ID (TID; see gettid(2)) of the owning thread. > > > > > >* If the lock is owned and there are threads contending for the > > > lock, then the FUTEX_WAITERS bit shall be set in the futex > > > word's value; in other words, this value is: > > > > > > FUTEX_WAITERS | TID > > > > > > > > >Note that a PI futex word never just has the value FUTEX_WAITERS, > > >which is a permissible state for non-PI futexes. > > > > The second clause is inappropriate. I don't know if that was yours or > > mine, but non-PI futexes do not have a kernel defined value policy, so > > ==FUTEX_WAITERS cannot be a "permissible state" as any value is > > permissible for non-PI futexes, and none have a kernel defined state. > > Depends. If the regular futex is configured as robust, then we have a > kernel defined value policy as well. Indeed, thanks for catching that. -- Darren Hart Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote: > >>FUTEX_CMP_REQUEUE (since Linux 2.6.7) > >> This operation first checks whether the location uaddr > >> still contains the value val3. If not, the operation > >> fails with the error EAGAIN. Otherwise, the operation > >> wakes up a maximum of val waiters that are waiting on the > >> futex at uaddr. If there are more than val waiters, then > >> the remaining waiters are removed from the wait queue of > >> the source futex at uaddr and added to the wait queue of > >> the target futex at uaddr2. The val2 argument specifies > >> an upper limit on the number of waiters that are requeued > >> to the futex at uaddr2. > >> > >> .\" FIXME(Torvald) Is the following correct? Or is just the decision > >> .\" which threads to wake or requeue part of the atomic operation? > >> > >> The load from uaddr is an atomic memory access (i.e., > >> using atomic machine instructions of the respective archi‐ > >> tecture). This load, the comparison with val3, and the > >> requeueing of any waiters are performed atomically and > >> totally ordered with respect to other operations on the > >> same futex word. > > > > It's atomic as the other atomic operations on the futex word. It's > > always performed with the proper lock(s) held in the kernel. That > > means any concurrent operation will serialize on that lock(s). User > > space has to make sure, that depending on the observed value no > > concurrent operations happen, but that's something the kernel cannot > > control. > > ??? > Sorry, I'm not clear here. Is the current text correct then? Or is some > change needed. I think we need some change here because the meaning of atomic is unclear. The basic rules of futexes are: - All modifying operations on the futex value have to be done with atomic instructions, usually cmpxchg. That applies to both kernel and user space. That's the atomicity at the futex value level. - In the kernel we have to create/modify/destroy state in order to provide the blocking/requeueing etc. This state needs protection as well. So all operations related to the kernel internal state are serialized on the hash bucket locks. The hash buckets are a scalability mechanism to avoid contention on a single lock protecting all kernel internal state. For simplicity reasons you can just think of a global lock protecting all kernel internal state. If the kernel creates/modifies state then it can be necessary to either reread the futex value or modify it. That happens under the locks as well. So in the case of requeue, we take the proper locks and perform the comparison with val3 and the requeueing with the locks held. So that lock protection makes these operations 'atomic'. The correct expression is 'serialized'. > >> .\" FIXME We need some explanation in the following paragraph of *why* > >> .\" it is important to note that "the kernel will update the > >> .\" futex word's value prior > >>It is important to note to returning to user space" . Can someone > >>explain? that the kernel will update the futex word's value > >>prior to returning to user space. Unlike the other futex opera‐ > >>tions described above, the PI futex operations are designed for > >>the implementation of very specific IPC mechanisms. > > > > If there are multiple waiters on a pi futex then a wake pi operation > > will wake the first waiter and hand over the lock to this waiter. This > > includes handing over the rtmutex which represents the futex in the > > kernel. The strict requirement is that the futex owner and the rtmutex > > owner must be the same, except for the update period which is > > serialized by the futex internal locking. That means the kernel must > > update the user space value prior to returning to user space. And as noted above: It must update while holding the proper locks. > >> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart > >> .\" made the observation that "EINVAL is returned if the non-pi > >> .\" to pi or op pairing semantics are violated." > >> .\" Probably there needs to be a general statement about this > >> .\" requirement, probably located at about this point in the page. > >> .\" Darren (or someone else), care to take a shot at this? > > > > Well, that's hard to describe because the kernel only has a limited > > way of detecting such mismatches. It only can detect it when there are > > non PI waiters on a futex and a PI function is called or vice versa. > > Hmmm. Okay, I filed your comments away for reference, but > hopefully someone can help with som
Re: Next round: revised futex(2) man page for review
On Wed, 5 Aug 2015, Darren Hart wrote: > On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote: > > .\" FIXME XXX = Start of adapted Hart/Guniguntala text = > > .\" The following text is drawn from the Hart/Guniguntala paper > > .\" (listed in SEE ALSO), but I have reworded some pieces > > .\" significantly. Please check it. > > > >The PI futex operations described below differ from the other > >futex operations in that they impose policy on the use of the > >value of the futex word: > > > >* If the lock is not acquired, the futex word's value shall be > > 0. > > > >* If the lock is acquired, the futex word's value shall be the > > thread ID (TID; see gettid(2)) of the owning thread. > > > >* If the lock is owned and there are threads contending for the > > lock, then the FUTEX_WAITERS bit shall be set in the futex > > word's value; in other words, this value is: > > > > FUTEX_WAITERS | TID > > > > > >Note that a PI futex word never just has the value FUTEX_WAITERS, > >which is a permissible state for non-PI futexes. > > The second clause is inappropriate. I don't know if that was yours or > mine, but non-PI futexes do not have a kernel defined value policy, so > ==FUTEX_WAITERS cannot be a "permissible state" as any value is > permissible for non-PI futexes, and none have a kernel defined state. Depends. If the regular futex is configured as robust, then we have a kernel defined value policy as well. > > .\" FIXME I'm not quite clear on the meaning of the following sentence. > > .\" Is this trying to say that while blocked in a > > .\" FUTEX_WAIT_REQUEUE_PI, it could happen that another > > .\" task does a FUTEX_WAKE on uaddr that simply causes > > .\" a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI > > .\" does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI > > .\" opertion? Does it remain blocked, or does it unblock > > .\" In which case, what does user space see? > > > > The > > waiter can be removed from the wait on uaddr via > > FUTEX_WAKE without requeueing on uaddr2. > > Userspace should see the task wake and continue executing. This would > effectively be a cancelation operation - which I didn't think was > supported. Thomas? We probably never intended to support it, but looking at the code it works (did not try it though). It returns to user space with -EWOULDBLOCK. So it basically behaves like any other spurious wakeup. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
Hi Darren, Some of my comments below will refer to the reply I just sent to tglx (and the list) a few minutes ago. On 08/06/2015 12:21 AM, Darren Hart wrote: > On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote: >> Hello all, >> > > Michael, thank you for your diligence in following up and collecting > reviews. I've attempted to respond to what I was specifically called out > in or where I had something specific to add in addition to other > replies. Thanks! > After this, will you send another version (numbered for reference > maybe?) with any remaining FIXMEs that haven't yet been addressed > according to your accounting? Yes, I'll be sending out another draft (probably after a short delay, while I see what further responses come back on the mails I just sent.) In any case, the latest version of the page can be found at http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex >>Priority-inheritance futexes >>Linux supports priority-inheritance (PI) futexes in order to han‐ >>dle priority-inversion problems that can be encountered with nor‐ >>mal futex locks. Priority inversion is the problem that occurs >>when a high-priority task is blocked waiting to acquire a lock >>held by a low-priority task, while tasks at an intermediate pri‐ >>ority continuously preempt the low-priority task from the CPU. >>Consequently, the low-priority task makes no progress toward >>releasing the lock, and the high-priority task remains blocked. >> >>Priority inheritance is a mechanism for dealing with the prior‐ >>ity-inversion problem. With this mechanism, when a high-priority >>task becomes blocked by a lock held by a low-priority task, the >>latter's priority is temporarily raised to that of the former, so >>that it is not preempted by any intermediate level tasks, and can >>thus make progress toward releasing the lock. To be effective, >>priority inheritance must be transitive, meaning that if a high- >>priority task blocks on a lock held by a lower-priority task that >>is itself blocked by lock held by another intermediate-priority >>task (and so on, for chains of arbitrary length), then both of >>those task (or more generally, all of the tasks in a lock chain) >>have their priorities raised to be the same as the high-priority >>task. >> >> .\" FIXME XXX The following is my attempt at a definition of PI futexes, >> .\" based on mail discussions with Darren Hart. Does it seem okay? >> >>From a user-space perspective, what makes a futex PI-aware is a >>policy agreement between user space and the kernel about the >>value of the futex word (described in a moment), coupled with the >>use of the PI futex operations described below (in particular, >>FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, and FUTEX_CMP_REQUEUE_PI). > > Yes. Was this intended to be a complete opcode list? No. I'll remove that list, in case its misunderstood that way. > PI operations must > use paired operations. > > (FUTEX_LOCK_PI | FUTEX_TRYLOCK_PI) : FUTEX_UNLOCK_PI > FUTEX_WAIT_REQUEUE_PI : FUTEX_CMP_REQUEUE_PI And now I've made that point explicitly in the page. See my comment lower down. > And their PRIVATE counterparts of course (which is assumed as it is a > flag to the opcode). Yes. But I don't think that needs to be called out explicitly here (?). >> .\" FIXME XXX = Start of adapted Hart/Guniguntala text = >> .\" The following text is drawn from the Hart/Guniguntala paper >> .\" (listed in SEE ALSO), but I have reworded some pieces >> .\" significantly. Please check it. >> >>The PI futex operations described below differ from the other >>futex operations in that they impose policy on the use of the >>value of the futex word: >> >>* If the lock is not acquired, the futex word's value shall be >> 0. >> >>* If the lock is acquired, the futex word's value shall be the >> thread ID (TID; see gettid(2)) of the owning thread. >> >>* If the lock is owned and there are threads contending for the >> lock, then the FUTEX_WAITERS bit shall be set in the futex >> word's value; in other words, this value is: >> >> FUTEX_WAITERS | TID >> >> >>Note that a PI futex word never just has the value FUTEX_WAITERS, >>which is a permissible state for non-PI futexes. > > The second clause is inappropriate. I don't know if that was yours or > mine, but non-PI futexes do not have a kernel defined value policy, so > ==FUTEX_WAITERS cannot be a "permissible state" as any value is > permissible for non-PI futexes, and none have a kernel defined state. > > Perhaps include a Note under the third bullet as: > >
Re: Next round: revised futex(2) man page for review
On 07/28/2015 11:03 PM, Thomas Gleixner wrote: > On Tue, 28 Jul 2015, Peter Zijlstra wrote: > >> On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote: >> FUTEX_WAKE (since Linux 2.6.0) This operation wakes at most val of the waiters that are waiting (e.g., inside FUTEX_WAIT) on the futex word at the address uaddr. Most commonly, val is specified as either 1 (wake up a single waiter) or INT_MAX (wake up all wait‐ ers). No guarantee is provided about which waiters are awoken (e.g., a waiter with a higher scheduling priority is not guaranteed to be awoken in preference to a waiter with a lower priority). >>> >>> That's only correct up to Linux 2.6.21. >>> >>> Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER >>> threads this takes the nice level into account. Threads with the same >>> priority are woken in FIFO order. >> >> Maybe don't mention the effects of SCHED_OTHER, order by nice value is >> 'wrong'. > > Indeed. > >> Also, this code seems to use plist, which means it won't do the right >> thing for SCHED_DEADLINE either. >> >> Do we want to go fix that? > > I think so. So, no change to this piece of text then? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
Hi Thomas, Thank you for the comments below. This helps hugely: more than 30 of my FIXMEs have now gone away! I have a few open questions, which you can find by searching for the string "???". If you would have a chance to look at those, I'd appreciate it. On 07/28/2015 10:23 PM, Thomas Gleixner wrote: > On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote: >>FUTEX_CLOCK_REALTIME (since Linux 2.6.28) >> This option bit can be employed only with the >> FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI operations. >> >> If this option is set, the kernel treats timeout as an >> absolute time based on CLOCK_REALTIME. >> >> .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay? >> If this option is not set, the kernel treats timeout as >> relative time, measured against the CLOCK_MONOTONIC clock. > > That's correct. Thanks. >>The operation specified in futex_op is one of the following: >> >>FUTEX_WAIT (since Linux 2.6.0) >> This operation tests that the value at the futex word >> pointed to by the address uaddr still contains the >> expected value val, and if so, then sleeps awaiting >> FUTEX_WAKE on the futex word. The load of the value of >> the futex word is an atomic memory access (i.e., using >> atomic machine instructions of the respective architec‐ >> ture). This load, the comparison with the expected value, >> and starting to sleep are performed atomically and totally >> ordered with respect to other futex operations on the same >> futex word. If the thread starts to sleep, it is consid‐ >> ered a waiter on this futex word. If the futex value does >> not match val, then the call fails immediately with the >> error EAGAIN. >> >> The purpose of the comparison with the expected value is >> to prevent lost wake-ups: If another thread changed the >> value of the futex word after the calling thread decided >> to block based on the prior value, and if the other thread >> executed a FUTEX_WAKE operation (or similar wake-up) after >> the value change and before this FUTEX_WAIT operation, >> then the latter will observe the value change and will not >> start to sleep. >> >> If the timeout argument is non-NULL, its contents specify >> a relative timeout for the wait, measured according to the >> .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay? > > Yes. Thanks. > >> CLOCK_MONOTONIC clock. (This interval will be rounded up >> to the system clock granularity, and kernel scheduling >> delays mean that the blocking interval may overrun by a >> small amount.) > > The given wait time will be rounded up to the system > clock granularity and is guaranteed not to expire > early. > > There are a gazillion reasons why it can expire late, but the > guarantee is that it never expires prematurely. > >>If timeout is NULL, the call blocks indef‐ >> initely. > > Right. Thanks. Reworded as you suggest. >> The arguments uaddr2 and val3 are ignored. >> >> >>FUTEX_WAKE (since Linux 2.6.0) >> This operation wakes at most val of the waiters that are >> waiting (e.g., inside FUTEX_WAIT) on the futex word at the >> address uaddr. Most commonly, val is specified as either >> 1 (wake up a single waiter) or INT_MAX (wake up all wait‐ >> ers). No guarantee is provided about which waiters are >> awoken (e.g., a waiter with a higher scheduling priority >> is not guaranteed to be awoken in preference to a waiter >> with a lower priority). > > That's only correct up to Linux 2.6.21. > > Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER > threads this takes the nice level into account. Threads with the same > priority are woken in FIFO order. So, this got picked up in a little subthread by Peter Zijsltra. I'll reply there. >> The arguments timeout, uaddr2, and val3 are ignored. > >> >>FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25) >> This operation creates a file descriptor that is associ‐ >> ated with the futex at uaddr. The caller must close the >> returned file descriptor after use. When another process >> or thread performs a FUTEX_WAKE on the futex word, the >> file descriptor indicates as being readable with >>
Re: Next round: revised futex(2) man page for review
On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote: > Hello all, > Michael, thank you for your diligence in following up and collecting reviews. I've attempted to respond to what I was specifically called out in or where I had something specific to add in addition to other replies. After this, will you send another version (numbered for reference maybe?) with any remaining FIXMEs that haven't yet been addressed according to your accounting? ... >Priority-inheritance futexes >Linux supports priority-inheritance (PI) futexes in order to han‐ >dle priority-inversion problems that can be encountered with nor‐ >mal futex locks. Priority inversion is the problem that occurs >when a high-priority task is blocked waiting to acquire a lock >held by a low-priority task, while tasks at an intermediate pri‐ >ority continuously preempt the low-priority task from the CPU. >Consequently, the low-priority task makes no progress toward >releasing the lock, and the high-priority task remains blocked. > >Priority inheritance is a mechanism for dealing with the prior‐ >ity-inversion problem. With this mechanism, when a high-priority >task becomes blocked by a lock held by a low-priority task, the >latter's priority is temporarily raised to that of the former, so >that it is not preempted by any intermediate level tasks, and can >thus make progress toward releasing the lock. To be effective, >priority inheritance must be transitive, meaning that if a high- >priority task blocks on a lock held by a lower-priority task that >is itself blocked by lock held by another intermediate-priority >task (and so on, for chains of arbitrary length), then both of >those task (or more generally, all of the tasks in a lock chain) >have their priorities raised to be the same as the high-priority >task. > > .\" FIXME XXX The following is my attempt at a definition of PI futexes, > .\" based on mail discussions with Darren Hart. Does it seem okay? > >From a user-space perspective, what makes a futex PI-aware is a >policy agreement between user space and the kernel about the >value of the futex word (described in a moment), coupled with the >use of the PI futex operations described below (in particular, >FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, and FUTEX_CMP_REQUEUE_PI). Yes. Was this intended to be a complete opcode list? PI operations must use paired operations. (FUTEX_LOCK_PI | FUTEX_TRYLOCK_PI) : FUTEX_UNLOCK_PI FUTEX_WAIT_REQUEUE_PI : FUTEX_CMP_REQUEUE_PI And their PRIVATE counterparts of course (which is assumed as it is a flag to the opcode). > > .\" FIXME XXX = Start of adapted Hart/Guniguntala text = > .\" The following text is drawn from the Hart/Guniguntala paper > .\" (listed in SEE ALSO), but I have reworded some pieces > .\" significantly. Please check it. > >The PI futex operations described below differ from the other >futex operations in that they impose policy on the use of the >value of the futex word: > >* If the lock is not acquired, the futex word's value shall be > 0. > >* If the lock is acquired, the futex word's value shall be the > thread ID (TID; see gettid(2)) of the owning thread. > >* If the lock is owned and there are threads contending for the > lock, then the FUTEX_WAITERS bit shall be set in the futex > word's value; in other words, this value is: > > FUTEX_WAITERS | TID > > >Note that a PI futex word never just has the value FUTEX_WAITERS, >which is a permissible state for non-PI futexes. The second clause is inappropriate. I don't know if that was yours or mine, but non-PI futexes do not have a kernel defined value policy, so ==FUTEX_WAITERS cannot be a "permissible state" as any value is permissible for non-PI futexes, and none have a kernel defined state. Perhaps include a Note under the third bullet as: Note: It is invalid for a PI futex word to have no owner and FUTEX_WAITERS set. > >With this policy in place, a user-space application can acquire a >not-acquired lock or release a lock that no other threads try to "that no other threads try to acquire" seems out of place. I think "atomic instructions" is sufficient to express how contention is handled. >acquire using atomic instructions executed in user space (e.g., a >compare-and-swap operation such as cmpxchg on the x86 architec‐ >ture). Acquiring a lock simply consists of using compare-and- >swap to atomically set the futex word's value to the caller's TID >if its previous value was 0. Releasing a lock req
Re: Next round: revised futex(2) man page for review
On 07/29/2015 06:21 AM, Darren Hart wrote: > On Tue, Jul 28, 2015 at 09:11:41PM -0700, Darren Hart wrote: >> On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote: >>> On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote: >> >> ... >> FUTEX_REQUEUE (since Linux 2.6.0) .\" FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken .\" in general, or is this comment implicitly speaking about the .\" condvar (?) use case? If the latter we might want to weaken the .\" advice below a little. .\" [Anyone else have input on this?] >>> >>> The condvar use case exposes the flaw nicely, but that's pretty much >>> true for everything which wants a sane requeue operation. >> >> In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken >> and >> should not be used in new code) and someone argued strongly against... >> stating >> that there were legitimate uses for it. Of course I'm struggling to find the >> thread and the reference at the moment - immensely useful, I know. >> >> I'll continue trying to find it and see if it can be useful here. I believe >> Torvald was on the thread as well. >> > > Found it on libc-alpha, here it is for reference: > > From: Rich Felker > Date: Wed, 29 Oct 2014 22:43:17 -0400 > To: Darren Hart > Cc: Carlos O'Donell , Roland McGrath > , > Torvald Riegel , GLIBC Devel > , > Michael Kerrisk > Subject: Re: Add futex wrapper to glibc? > > On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote: > > > We are IMO at the stage where futex is stable, few things are > > > changing, and with documentation in place, I would consider adding a > > > futex wrapper. > > > > Yes, at least for the defined OP codes. New OPs may be added of > > course, but that isn't a concern for supporting what exists today, and > > doesn't break compatibility. > > > > I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally > > broken. FUTEX_CMP_REQUEUE should *always* be used instead. The glibc > > wrapper is one way to encourage developers to do the right thing > > (don't expose the bad op in the header). > > You're mistaken here. There are plenty of valid ways to use > FUTEX_REQUEUE - for example if the calling thread is requeuing the > target(s) to a lock that the calling thread owns. Just because it > doesn't meet the needs of the way glibc was using it internally > doesn't mean it's useless for other applications. > > In any case, I don't think there's a proposal to intercept/modify the > commands to futex, just to pass them through (and possibly do a > cancellable syscall for some of them). > > Rich > > >>> Avoid using this operation. It is broken for its intended purpose. Use FUTEX_CMP_REQUEUE instead. Thisoperationperformsthesametaskas FUTEX_CMP_REQUEUE, except that no check is made using the value in val3. (The argument val3 is ignored.) Thanks, Darren, that's really helpful! I've removed the statement in the man page that FUTEX_REQUEUE is broken. By the way, Darren. There were a couple of FIXMEs in the page where you are explicitly mentioned by name. Could you take a look at those? Specifically, the large block of text starting at: >> .\" FIXME XXX The following is my attempt at a definition of PI futexes, >> .\" based on mail discussions with Darren Hart. Does it seem okay? (tglx looked at this and blessed it, but I'd like you also to check.) Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
On Tue, 28 Jul 2015, Darren Hart wrote: > Found it on libc-alpha, here it is for reference: > > From: Rich Felker > Date: Wed, 29 Oct 2014 22:43:17 -0400 > To: Darren Hart > Cc: Carlos O'Donell , Roland McGrath > , > Torvald Riegel , GLIBC Devel > , > Michael Kerrisk > Subject: Re: Add futex wrapper to glibc? > > On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote: > > > We are IMO at the stage where futex is stable, few things are > > > changing, and with documentation in place, I would consider adding a > > > futex wrapper. > > > > Yes, at least for the defined OP codes. New OPs may be added of > > course, but that isn't a concern for supporting what exists today, and > > doesn't break compatibility. > > > > I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally > > broken. FUTEX_CMP_REQUEUE should *always* be used instead. The glibc > > wrapper is one way to encourage developers to do the right thing > > (don't expose the bad op in the header). > > You're mistaken here. There are plenty of valid ways to use > FUTEX_REQUEUE - for example if the calling thread is requeuing the > target(s) to a lock that the calling thread owns. Just because it > doesn't meet the needs of the way glibc was using it internally > doesn't mean it's useless for other applications. > > In any case, I don't think there's a proposal to intercept/modify the > commands to futex, just to pass them through (and possibly do a > cancellable syscall for some of them). Fair enough. Did not think about the requeue to futex held by the caller case. In that case FUTEX_REQUEUE works as advertised. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
On Tue, Jul 28, 2015 at 09:11:41PM -0700, Darren Hart wrote: > On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote: > > On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote: > > ... > > > >FUTEX_REQUEUE (since Linux 2.6.0) > > > .\" FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken > > > .\" in general, or is this comment implicitly speaking about the > > > .\" condvar (?) use case? If the latter we might want to weaken the > > > .\" advice below a little. > > > .\" [Anyone else have input on this?] > > > > The condvar use case exposes the flaw nicely, but that's pretty much > > true for everything which wants a sane requeue operation. > > In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken > and > should not be used in new code) and someone argued strongly against... stating > that there were legitimate uses for it. Of course I'm struggling to find the > thread and the reference at the moment - immensely useful, I know. > > I'll continue trying to find it and see if it can be useful here. I believe > Torvald was on the thread as well. > Found it on libc-alpha, here it is for reference: From: Rich Felker Date: Wed, 29 Oct 2014 22:43:17 -0400 To: Darren Hart Cc: Carlos O'Donell , Roland McGrath , Torvald Riegel , GLIBC Devel , Michael Kerrisk Subject: Re: Add futex wrapper to glibc? On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote: > > We are IMO at the stage where futex is stable, few things are > > changing, and with documentation in place, I would consider adding a > > futex wrapper. > > Yes, at least for the defined OP codes. New OPs may be added of > course, but that isn't a concern for supporting what exists today, and > doesn't break compatibility. > > I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally > broken. FUTEX_CMP_REQUEUE should *always* be used instead. The glibc > wrapper is one way to encourage developers to do the right thing > (don't expose the bad op in the header). You're mistaken here. There are plenty of valid ways to use FUTEX_REQUEUE - for example if the calling thread is requeuing the target(s) to a lock that the calling thread owns. Just because it doesn't meet the needs of the way glibc was using it internally doesn't mean it's useless for other applications. In any case, I don't think there's a proposal to intercept/modify the commands to futex, just to pass them through (and possibly do a cancellable syscall for some of them). Rich > > > > > Avoid using this operation. It is broken for its intended > > > purpose. Use FUTEX_CMP_REQUEUE instead. > > > > > > Thisoperationperformsthesametaskas > > > FUTEX_CMP_REQUEUE, except that no check is made using the > > > value in val3. (The argument val3 is ignored.) > > > > > -- > Darren Hart > Intel Open Source Technology Center -- Darren Hart Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote: > On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote: ... > >FUTEX_REQUEUE (since Linux 2.6.0) > > .\" FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken > > .\" in general, or is this comment implicitly speaking about the > > .\" condvar (?) use case? If the latter we might want to weaken the > > .\" advice below a little. > > .\" [Anyone else have input on this?] > > The condvar use case exposes the flaw nicely, but that's pretty much > true for everything which wants a sane requeue operation. In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken and should not be used in new code) and someone argued strongly against... stating that there were legitimate uses for it. Of course I'm struggling to find the thread and the reference at the moment - immensely useful, I know. I'll continue trying to find it and see if it can be useful here. I believe Torvald was on the thread as well. > > > Avoid using this operation. It is broken for its intended > > purpose. Use FUTEX_CMP_REQUEUE instead. > > > > Thisoperationperformsthesametaskas > > FUTEX_CMP_REQUEUE, except that no check is made using the > > value in val3. (The argument val3 is ignored.) > > -- Darren Hart Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
On Tue, 2015-07-28 at 22:45 +0200, Peter Zijlstra wrote: > Also, this code seems to use plist, which means it won't do the right > thing for SCHED_DEADLINE either. Ick, I don't look forward to seeing nice futex plists converted into rbtrees. As opposed to, eg. rtmutexes, there are a few caveats: - Dealing with the top_waiter in rtmutexes is always easy, but in futexes we need to deal with keys, so caching the leftmost won't work as nicely. - This will bloat things like futex_wake, where O(logN) is not suited for FIFO iteration. And iterating linked lists is, in essence, all that we really do when calling futex(2). I have to wonder about the extra overhead added by these points. I do understand the dl concern, nonetheless. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
On Tue, 28 Jul 2015, Peter Zijlstra wrote: > On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote: > > > >FUTEX_WAKE (since Linux 2.6.0) > > > This operation wakes at most val of the waiters that are > > > waiting (e.g., inside FUTEX_WAIT) on the futex word at the > > > address uaddr. Most commonly, val is specified as either > > > 1 (wake up a single waiter) or INT_MAX (wake up all wait‐ > > > ers). No guarantee is provided about which waiters are > > > awoken (e.g., a waiter with a higher scheduling priority > > > is not guaranteed to be awoken in preference to a waiter > > > with a lower priority). > > > > That's only correct up to Linux 2.6.21. > > > > Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER > > threads this takes the nice level into account. Threads with the same > > priority are woken in FIFO order. > > Maybe don't mention the effects of SCHED_OTHER, order by nice value is > 'wrong'. Indeed. > Also, this code seems to use plist, which means it won't do the right > thing for SCHED_DEADLINE either. > > Do we want to go fix that? I think so. Thanks, tglx
Re: Next round: revised futex(2) man page for review
On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote: > >FUTEX_WAKE (since Linux 2.6.0) > > This operation wakes at most val of the waiters that are > > waiting (e.g., inside FUTEX_WAIT) on the futex word at the > > address uaddr. Most commonly, val is specified as either > > 1 (wake up a single waiter) or INT_MAX (wake up all wait‐ > > ers). No guarantee is provided about which waiters are > > awoken (e.g., a waiter with a higher scheduling priority > > is not guaranteed to be awoken in preference to a waiter > > with a lower priority). > > That's only correct up to Linux 2.6.21. > > Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER > threads this takes the nice level into account. Threads with the same > priority are woken in FIFO order. Maybe don't mention the effects of SCHED_OTHER, order by nice value is 'wrong'. Also, this code seems to use plist, which means it won't do the right thing for SCHED_DEADLINE either. Do we want to go fix that? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote: >FUTEX_CLOCK_REALTIME (since Linux 2.6.28) > This option bit can be employed only with the > FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI operations. > > If this option is set, the kernel treats timeout as an > absolute time based on CLOCK_REALTIME. > > .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay? > If this option is not set, the kernel treats timeout as > relative time, measured against the CLOCK_MONOTONIC clock. That's correct. >The operation specified in futex_op is one of the following: > >FUTEX_WAIT (since Linux 2.6.0) > This operation tests that the value at the futex word > pointed to by the address uaddr still contains the > expected value val, and if so, then sleeps awaiting > FUTEX_WAKE on the futex word. The load of the value of > the futex word is an atomic memory access (i.e., using > atomic machine instructions of the respective architec‐ > ture). This load, the comparison with the expected value, > and starting to sleep are performed atomically and totally > ordered with respect to other futex operations on the same > futex word. If the thread starts to sleep, it is consid‐ > ered a waiter on this futex word. If the futex value does > not match val, then the call fails immediately with the > error EAGAIN. > > The purpose of the comparison with the expected value is > to prevent lost wake-ups: If another thread changed the > value of the futex word after the calling thread decided > to block based on the prior value, and if the other thread > executed a FUTEX_WAKE operation (or similar wake-up) after > the value change and before this FUTEX_WAIT operation, > then the latter will observe the value change and will not > start to sleep. > > If the timeout argument is non-NULL, its contents specify > a relative timeout for the wait, measured according to the > .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay? Yes. > CLOCK_MONOTONIC clock. (This interval will be rounded up > to the system clock granularity, and kernel scheduling > delays mean that the blocking interval may overrun by a > small amount.) The given wait time will be rounded up to the system clock granularity and is guaranteed not to expire early. There are a gazillion reasons why it can expire late, but the guarantee is that it never expires prematurely. > If timeout is NULL, the call blocks indef‐ > initely. Right. > The arguments uaddr2 and val3 are ignored. > > >FUTEX_WAKE (since Linux 2.6.0) > This operation wakes at most val of the waiters that are > waiting (e.g., inside FUTEX_WAIT) on the futex word at the > address uaddr. Most commonly, val is specified as either > 1 (wake up a single waiter) or INT_MAX (wake up all wait‐ > ers). No guarantee is provided about which waiters are > awoken (e.g., a waiter with a higher scheduling priority > is not guaranteed to be awoken in preference to a waiter > with a lower priority). That's only correct up to Linux 2.6.21. Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER threads this takes the nice level into account. Threads with the same priority are woken in FIFO order. > The arguments timeout, uaddr2, and val3 are ignored. > >FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25) > This operation creates a file descriptor that is associ‐ > ated with the futex at uaddr. The caller must close the > returned file descriptor after use. When another process > or thread performs a FUTEX_WAKE on the futex word, the > file descriptor indicates as being readable with > select(2), poll(2), and epoll(7) > > The file descriptor can be used to obtain asynchronous > notifications: if val is nonzero, then when another > process or thread executes a FUTEX_WAKE, the caller will > receive the signal number that was passed in val. > > The arguments timeout, uaddr2 and val3 are ignored. > > .\" FIXME(Torvald) We never define "upped". Maybe just remove the > .\" following sentence? > To prevent race
Re: Revised futex(2) man page for review
On 07/28/2015 07:52 PM, Davidlohr Bueso wrote: > On Tue, 2015-07-28 at 09:44 +0200, Michael Kerrisk (man-pages) wrote: >> Maybe you still have some further improvements for the paragraph? > > Nah, this is fine enough. Looks good. Okay. Thanks. I added a Reviewed-by: for you. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On Tue, 2015-07-28 at 09:44 +0200, Michael Kerrisk (man-pages) wrote: > Maybe you still have some further improvements for the paragraph? Nah, this is fine enough. Looks good. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
Hi David, On 07/28/2015 05:16 AM, Davidlohr Bueso wrote: > On Mon, 2015-07-27 at 13:10 +0200, Michael Kerrisk (man-pages) wrote: >> Hi David, >> >> On 03/31/2015 04:45 PM, Davidlohr Bueso wrote: >>> On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote: >>> The condition is represented by the futex word, which is an address in memory supplied to the futex() system call, and the value at this mem‐ ory location. (While the virtual addresses for the same memory in sep‐ arate processes may not be equal, the kernel maps them internally so that the same memory mapped in different locations will correspond for futex() calls.) When executing a futex operation that requests to block a thread, the kernel will only block if the futex word has the value that the calling >>> >>> Given the use of "word", you should probably state right away that >>> futexes are only 32bit. >> >> So, I made the opening sentence here: >> >>The condition is represented by the futex word, which is an >>address in memory supplied to the futex() system call, and the >>32-bit value at this memory location. >> >> Okay? > > I think we can still improve :) > > I've re-read the whole first paragraphs, and have a few comments that > touch upon this specific wording. Lets see. You have: > >>The futex() system call provides a method for waiting until a >> certain >>condition becomes true. It is typically used as a blocking >> construct >>in the context of shared-memory synchronization: The program >> implements >>the majority of the synchronization in user space, and uses one >> of >>operations of the system call when it is likely that it has to >> block >>for a longer time until the condition becomes true. The program >> uses >>another operation of the system call to wake anyone waiting for a >> par‐ >>ticular condition. > > I've rephrased the next paragraph. How about adding this to follow? > >A futex is in essence a 32-bit user-space address. All futex > operations and >conditions are governed by this variable, from now on referred to as > 'futex >word'. As such, a futex is identified by the address in shared memory, > which >may or may not be shared between different processes. For virtual > memory, the >kernel will internally handle and resolve the later. This futex word, > along >with the value at its the memory location, are supplied to the futex() > system >call. > > Feel free to reword however you think is better. I agree with you that that second paragraph is a bit broken. But, like Heinrich, I'm confused by this term "32-bit ... address". I've rewritten the paragraph as: A futex is a 32-bit value—referred to below as a futex word—whose address is supplied to the futex() system call. (Futexes are 32-bits in size on all platforms, including 64-bit systems.) All futex operations are governed by this value. In order to share a futex between processes, the futex is placed in a region of shared memory, created using (for example) mmap(2) or shmat(2). (Thus the futex word may have different virtual addresses in dif‐ ferent processes, but these addresses all refer to the same loca‐ tion in physical memory.) Maybe you still have some further improvements for the paragraph? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Aw: Re: Revised futex(2) man page for review
On Tue, 2015-07-28 at 07:44 +0200, Heinrich Schuchardt wrote: > Hello David, > > >> A futex is in essence a 32-bit user-space address. > I know what a 32 bit integer is. > I am not aware of 32 bit addresses on my 64 bit operating system. Well I am referring to in the context of a user-space address, such as a 32-bit lock ('int'), but yes, my text is misleading. In fact we obviously need to cast to the word size for doing gup_fast, among other tasks. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On 07/28/2015 04:52 AM, Davidlohr Bueso wrote: > On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote: >> SEE ALSO >>get_robust_list(2), restart_syscall(2), futex(7) > > For pi futexes, I also suggest pthread_mutexattr_getprotocol(3), which > is a common entry point. Thanks. Added. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On Mon, 2015-07-27 at 13:10 +0200, Michael Kerrisk (man-pages) wrote: > Hi David, > > On 03/31/2015 04:45 PM, Davidlohr Bueso wrote: > > On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote: > > > >>The condition is represented by the futex word, which is an address > >> in > >>memory supplied to the futex() system call, and the value at this > >> mem‐ > >>ory location. (While the virtual addresses for the same memory in > >> sep‐ > >>arate processes may not be equal, the kernel maps them > >> internally so > >>that the same memory mapped in different locations will correspond > >> for > >>futex() calls.) > >> > >>When executing a futex operation that requests to block a thread, > >> the > >>kernel will only block if the futex word has the value that the > >> calling > > > > Given the use of "word", you should probably state right away that > > futexes are only 32bit. > > So, I made the opening sentence here: > >The condition is represented by the futex word, which is an >address in memory supplied to the futex() system call, and the >32-bit value at this memory location. > > Okay? I think we can still improve :) I've re-read the whole first paragraphs, and have a few comments that touch upon this specific wording. Lets see. You have: >The futex() system call provides a method for waiting until a certain >condition becomes true. It is typically used as a blocking construct >in the context of shared-memory synchronization: The program implements >the majority of the synchronization in user space, and uses one of >operations of the system call when it is likely that it has to block >for a longer time until the condition becomes true. The program uses >another operation of the system call to wake anyone waiting for a par‐ >ticular condition. I've rephrased the next paragraph. How about adding this to follow? A futex is in essence a 32-bit user-space address. All futex operations and conditions are governed by this variable, from now on referred to as 'futex word'. As such, a futex is identified by the address in shared memory, which may or may not be shared between different processes. For virtual memory, the kernel will internally handle and resolve the later. This futex word, along with the value at its the memory location, are supplied to the futex() system call. Feel free to reword however you think is better. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote: > SEE ALSO >get_robust_list(2), restart_syscall(2), futex(7) For pi futexes, I also suggest pthread_mutexattr_getprotocol(3), which is a common entry point. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Next round: revised futex(2) man page for review
On 07/27/2015 04:17 PM, Heinrich Schuchardt wrote: > instruction. A thread maybe unable > > to << missing word > > acquire a lock because it is > already acquired by another thread. It then may pass the lock's > flag as futex word and the value representing the acquired state > as the expected value to a futex() wait operation. Thanks, Heinrich. Fixed. Cheers, Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Next round: revised futex(2) man page for review
Hello all, >From a draft sent out in March, I got a few useful comments that I've now incorporated into this draft. And I got some complaints from people who did not want to read groff source. My point was that there are a bunch of FIXMEs in the page source that I wanted people to look at... Anyway, this time, I will take a different tack, interspersing the FIXMEs in a rendered version of the page. I'd greatly appreciate help with those FIXMEs. The current page source can be found at in a branch at http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex === As becomes quickly obvious upon reading it, the current futex(2) man page is in a sorry state, lacking many important details, and also the various additions that have been made to the interface over the last years. I've been working on revising it, first of all based on input I got in response to a request for help last year (http://thread.gmane.org/gmane.linux.kernel/1703405), especially taking Thomas Gleixner's input (http://thread.gmane.org/gmane.linux.kernel/1703405/focus=2952) into account. I also got some further offlist input from Darren Hart, Torvald Riegel, and Davidlohr Bueso that has been incorporated into the revised draft. Other than that, I got some useful info out of Ulrich Drepper's paper (cited at the end of the page) and one or two web pages (cited in the page source). The page has now increased in size by a factor of about 5, but is far from complete. In particular, as I reworked the page, there were many details that I was not 100% certain of, and I have added FIXME markers to the page source. In addition, Torvald added some text, and a few more FIXMEs. Some of the FIXMEs are trivial, as in: I'd like confirmation that I have correctly captured a technical detail. Others are more substantial, probably requiring the addition of further text. I appreciate that there are probably other things that can be improved in the page. (Torvald and Darren have some ideas.) However, before growing the page any further, I would like to resolve as many of the FIXMEs (and any other problems that people see) as possible in the existing text. I need help with that. (And I know that dealing with that help, if I get it, will in itself will be quite a task to deal with, which is why I have been delaying it for many weeks now, as my time has been rather limited recently.) So, please take a look at the page below. At this point, I would most especially appreciate help with the FIXMEs. Cheers, Michael FUTEX(2)Linux Programmer's Manual FUTEX(2) NAME futex - fast user-space locking SYNOPSIS #include #include int futex(int *uaddr, int futex_op, int val, const struct timespec *timeout, /* or: uint32_t val2 */ int *uaddr2, int val3); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a block‐ ing construct in the context of shared-memory synchronization: The program implements the majority of the synchronization in user space, and uses one of the operations of the system call when it is likely that it has to block for a longer time until the condition becomes true. The program uses another operation of the system call to wake anyone waiting for a particular condi‐ tion. The condition is represented by the futex word, which is an address in memory supplied to the futex() system call, and the 32-bit value at this memory location. (While the virtual addresses for the same physical memory address in separate pro‐ cesses may be different, the same physical address may be shared by the processes using mmap(2).) When executing a futex operation that requests to block a thread, the kernel will block only if the futex word has the value that the calling thread supplied as expected value. The load from the futex word, the comparison with the expected value, and the actual blocking will happen atomically and totally ordered with respect to concurrently executing futex operations on the same futex word. Thus, the futex word is used to connect the synchro‐ nization in user space with the implementation of blocking by the kernel; similar to an atomic compare-and-exchange operation that potentially changes shared memory, blocking via a futex is an atomic compare-and-block operation. One example use of futexes is implementing locks. The state of the lock (i.e., acquired or not acquired) can be represented as an atomically accessed flag in shared memory. In the uncontended case, a thread can access or modify th
Re: Revised futex(2) man page for review
Hi Peter, On 03/28/2015 01:03 PM, Peter Zijlstra wrote: > On Sat, Mar 28, 2015 at 12:47:25PM +0100, Peter Zijlstra wrote: >>FUTEX_WAIT (since Linux 2.6.0) >> This operation tests that the value at the futex word pointed >> to >> by the address uaddr still contains the expected value val, >> and >> if so, then sleeps awaiting FUTEX_WAKE on the futex word. >> The >> load of the value of the futex word is an atomic memory >> access >> (i.e., using atomic machine instructions of the >> respective >> architecture). This load, the comparison with the >> expected >> value, and starting to sleep are performed atomically >> and >> totally ordered with respect to other futex operations on >> the >> same futex word. If the thread starts to sleep, it is >> consid‐ >> ered a waiter on this futex word. If the futex value does >> not >> match val, then the call fails immediately with the >> error >> EAGAIN. >> >> The purpose of the comparison with the expected value is to >> pre‐ >> vent lost wake-ups: If another thread changed the value of >> the >> futex word after the calling thread decided to block based >> on >> the prior value, and if the other thread executed a >> FUTEX_WAKE >> operation (or similar wake-up) after the value change and >> before >> this FUTEX_WAIT operation, then the latter will observe >> the >> value change and will not start to sleep. >> >> If the timeout argument is non-NULL, its contents specify a >> rel‐ >> ative timeout for the wait, measured according to >> the >> CLOCK_MONOTONIC clock. (This interval will be rounded up to >> the >> system clock granularity, and kernel scheduling delays mean >> that >> the blocking interval may overrun by a small amount.) If >> time‐ >> out is NULL, the call blocks indefinitely. > > Would it not be better to only state that the wait will not return > before the timeout -- unless woken -- and not bother with clock > granularity and scheduling delays? Many of the pages that talk about system calls that have timeouts carry similar language, since people often have confusions about what time timeout (e.g., that it's an upper limit, not a minimum; or that it's precise to some very small granularity). Why do you think the language here is a problem? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On 04/15/2015 12:28 PM, Torvald Riegel wrote: > On Tue, 2015-04-14 at 23:40 +0200, Thomas Gleixner wrote: >> On Sat, 28 Mar 2015, Peter Zijlstra wrote: >>> On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote: So, please take a look at the page below. At this point, I would most especially appreciate help with the FIXMEs. >>> >>> For people who cannot read that troff gibberish (me).. >> >> Ditto :) >> >>> NOTES >>>Glibc does not provide a wrapper for this system call; call it >>> using >>>syscall(2). >> >> You might mention that pthread_mutex, pthread_condvar interfaces are >> high level wrappers for the syscall and recommended to be used for >> normal use cases. IIRC unnamed semaphores are implemented with futexes >> as well. > > If we add this, I'd rephrase it to something like that there are > high-level programming abstractions such as the pthread_condvar > interfaces or semaphores that are implemented using the syscall and that > are typically a better fit for normal use cases. I'd consider only the > condvars as something like a wrapper, or targeting a similar use case. I added this under NOTES: Various higher-level programming abstractions are implemented via futexes, including POSIX threads mutexes and condition variables, as well as POSIX semaphores. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
Hello Pavel, On 04/27/2015 10:37 PM, Pavel Machek wrote: > Hi! > >> The FUTEX_WAIT_OP operation is equivalent to execute the >> follow??? >> ing code atomically and totally ordered with respect to >> other >> futex operations on any of the two supplied futex words: > > "to executing"? Yep. Fixed. >> The operation and comparison that are to be performed >> are >> encoded in the bits of the argument val3. Pictorially, >> the >> encoding is: >> >> +---+---+---+---+ >> |op |cmp| oparg | cmparg | >> +---+---+---+---+ >> 4 4 12 12<== # of bits >> > > :-) > >> RETURN VALUE >>In the event of an error, all operations return -1 and set errno >> to >>indicate the cause of the error. The return value on success >> depends >>on the operation, as described in the following list: > > Did you say (at the begining) that there is no glibc wrapper? Yes, this could be clearer. I changed it to RETURN VALUE In the event of an error (and assuming that futex() was invoked via syscall(2)), all operations return -1 and set errno to indi‐ cate the cause of the error. >>EINVAL The operation in futex_op is one of those that employs a >> time??? >> out, but the supplied timeout argument was invalid (tv_sec >> was >> less than zero, or tv_nsec was not less than 1000,000,000). > > 1,000...? Fixed. Thanks for the comments! Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
Hi David, On 03/31/2015 04:45 PM, Davidlohr Bueso wrote: > On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote: > >>The condition is represented by the futex word, which is an address >> in >>memory supplied to the futex() system call, and the value at this >> mem‐ >>ory location. (While the virtual addresses for the same memory in >> sep‐ >>arate processes may not be equal, the kernel maps them internally >> so >>that the same memory mapped in different locations will correspond >> for >>futex() calls.) >> >>When executing a futex operation that requests to block a thread, >> the >>kernel will only block if the futex word has the value that the >> calling > > Given the use of "word", you should probably state right away that > futexes are only 32bit. So, I made the opening sentence here: The condition is represented by the futex word, which is an address in memory supplied to the futex() system call, and the 32-bit value at this memory location. Okay? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On 03/31/2015 03:48 AM, Rusty Russell wrote: > "Michael Kerrisk (man-pages)" writes: >> When executing a futex operation that requests to block a thread, >> the kernel will only block if the futex word has the value that the >> calling thread supplied as expected value. >> The load from the futex word, the comparison with >> the expected value, >> and the actual blocking will happen atomically and totally >> ordered with respect to concurrently executing futex operations >> on the same futex word, >> such as operations that wake threads blocked on this futex word. >> Thus, the futex word is used to connect the synchronization in user spac > > Missing 'e' in "space". Already fixed. >> .\" FIXME Please confirm that the following is correct: >> No guarantee is provided about which waiters are awoken >> (e.g., a waiter with a higher scheduling priority is not guaranteed >> to be awoken in preference to a waiter with a lower priority). > > This is true. Thanks! FIXME removed. Cheers, Michael > I didn't read the rest, as that stuff was all written by others. > Documenting them is pretty heroic; good job! > > Thanks, > Rusty. > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
Hi David, On 03/31/2015 10:36 PM, Davidlohr Bueso wrote: > On Sat, 2015-03-28 at 13:03 +0100, Peter Zijlstra wrote: >>> If the timeout argument is non-NULL, its contents specify a >>> rel‐ >>> ative timeout for the wait, measured according to >>> the >>> CLOCK_MONOTONIC clock. (This interval will be rounded up to >>> the >>> system clock granularity, and kernel scheduling delays mean >>> that >>> the blocking interval may overrun by a small amount.) If >>> time‐ >>> out is NULL, the call blocks indefinitely. >> >> Would it not be better to only state that the wait will not return >> before the timeout -- unless woken -- and not bother with clock >> granularity and scheduling delays? > > Yeah, similarly we also have this: > > FUTEX_PRIVATE_FLAG (since Linux 2.6.22) > This option bit can be employed with all futex operations. It > tells the kernel that the futex is process-private and not > shared with another process (i.e., it is only being used for > synchronization between threads of the same process). This > allows the kernel to choose the fast path for validating the > user-space address and avoids expensive VMA lookups, taking ref‐ > erence counts on file backing store, and so on. > > This to me reads a bit too much into the kernel (fastpath, refcnt, > vmas). Why not just mention that it avoids overhead in the kernel or > something? I don't recall any manpage mentioning such details, but I > could be wrong. Thanks. Agreed. I changed this to This allows the kernel to make some additional performance optimizations. > In any case its a nit, the whole doc is pretty good and > I hope you can merge it soon and then just increment ;) I ran out of time and energy at a certain point. And also got a little disheartened that I got more people complaining about groff markup than actually looked looked at the FIXMEs in the page source :-). I'll try to reboot the process. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
Hi! > The FUTEX_WAIT_OP operation is equivalent to execute the > follow??? > ing code atomically and totally ordered with respect to other > futex operations on any of the two supplied futex words: "to executing"? > The operation and comparison that are to be performed are > encoded in the bits of the argument val3. Pictorially, the > encoding is: > > +---+---+---+---+ > |op |cmp| oparg | cmparg | > +---+---+---+---+ > 4 4 12 12<== # of bits > :-) > RETURN VALUE >In the event of an error, all operations return -1 and set errno to >indicate the cause of the error. The return value on success depends >on the operation, as described in the following list: Did you say (at the begining) that there is no glibc wrapper? >EINVAL The operation in futex_op is one of those that employs a > time??? > out, but the supplied timeout argument was invalid (tv_sec was > less than zero, or tv_nsec was not less than 1000,000,000). 1,000...? > NOTES >Glibc does not provide a wrapper for this system call; call it using >syscall(2). -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On Tue, 2015-04-14 at 23:40 +0200, Thomas Gleixner wrote: > On Sat, 28 Mar 2015, Peter Zijlstra wrote: > > On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote: > > > So, please take a look at the page below. At this point, > > > I would most especially appreciate help with the FIXMEs. > > > > For people who cannot read that troff gibberish (me).. > > Ditto :) > > > NOTES > >Glibc does not provide a wrapper for this system call; call it > > using > >syscall(2). > > You might mention that pthread_mutex, pthread_condvar interfaces are > high level wrappers for the syscall and recommended to be used for > normal use cases. IIRC unnamed semaphores are implemented with futexes > as well. If we add this, I'd rephrase it to something like that there are high-level programming abstractions such as the pthread_condvar interfaces or semaphores that are implemented using the syscall and that are typically a better fit for normal use cases. I'd consider only the condvars as something like a wrapper, or targeting a similar use case. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On Sat, 28 Mar 2015, Peter Zijlstra wrote: > On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote: > > So, please take a look at the page below. At this point, > > I would most especially appreciate help with the FIXMEs. > > For people who cannot read that troff gibberish (me).. Ditto :) > NOTES >Glibc does not provide a wrapper for this system call; call it using >syscall(2). You might mention that pthread_mutex, pthread_condvar interfaces are high level wrappers for the syscall and recommended to be used for normal use cases. IIRC unnamed semaphores are implemented with futexes as well. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On Sat, 2015-03-28 at 13:03 +0100, Peter Zijlstra wrote: > > If the timeout argument is non-NULL, its contents specify a > > rel‐ > > ative timeout for the wait, measured according to > > the > > CLOCK_MONOTONIC clock. (This interval will be rounded up to > > the > > system clock granularity, and kernel scheduling delays mean > > that > > the blocking interval may overrun by a small amount.) If > > time‐ > > out is NULL, the call blocks indefinitely. > > Would it not be better to only state that the wait will not return > before the timeout -- unless woken -- and not bother with clock > granularity and scheduling delays? Yeah, similarly we also have this: FUTEX_PRIVATE_FLAG (since Linux 2.6.22) This option bit can be employed with all futex operations. It tells the kernel that the futex is process-private and not shared with another process (i.e., it is only being used for synchronization between threads of the same process). This allows the kernel to choose the fast path for validating the user-space address and avoids expensive VMA lookups, taking ref‐ erence counts on file backing store, and so on. This to me reads a bit too much into the kernel (fastpath, refcnt, vmas). Why not just mention that it avoids overhead in the kernel or something? I don't recall any manpage mentioning such details, but I could be wrong. In any case its a nit, the whole doc is pretty good and I hope you can merge it soon and then just increment ;) Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote: >The condition is represented by the futex word, which is an address in >memory supplied to the futex() system call, and the value at this mem‐ >ory location. (While the virtual addresses for the same memory in sep‐ >arate processes may not be equal, the kernel maps them internally so >that the same memory mapped in different locations will correspond for >futex() calls.) > >When executing a futex operation that requests to block a thread, the >kernel will only block if the futex word has the value that the calling Given the use of "word", you should probably state right away that futexes are only 32bit. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
"Michael Kerrisk (man-pages)" writes: > When executing a futex operation that requests to block a thread, > the kernel will only block if the futex word has the value that the > calling thread supplied as expected value. > The load from the futex word, the comparison with > the expected value, > and the actual blocking will happen atomically and totally > ordered with respect to concurrently executing futex operations > on the same futex word, > such as operations that wake threads blocked on this futex word. > Thus, the futex word is used to connect the synchronization in user spac Missing 'e' in "space". > .\" FIXME Please confirm that the following is correct: > No guarantee is provided about which waiters are awoken > (e.g., a waiter with a higher scheduling priority is not guaranteed > to be awoken in preference to a waiter with a lower priority). This is true. I didn't read the rest, as that stuff was all written by others. Documenting them is pretty heroic; good job! Thanks, Rusty. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On Sat, Mar 28, 2015 at 12:47:25PM +0100, Peter Zijlstra wrote: >FUTEX_WAIT (since Linux 2.6.0) > This operation tests that the value at the futex word pointed to > by the address uaddr still contains the expected value val, and > if so, then sleeps awaiting FUTEX_WAKE on the futex word. The > load of the value of the futex word is an atomic memory access > (i.e., using atomic machine instructions of the respective > architecture). This load, the comparison with the expected > value, and starting to sleep are performed atomically and > totally ordered with respect to other futex operations on the > same futex word. If the thread starts to sleep, it is consid‐ > ered a waiter on this futex word. If the futex value does not > match val, then the call fails immediately with the error > EAGAIN. > > The purpose of the comparison with the expected value is to pre‐ > vent lost wake-ups: If another thread changed the value of the > futex word after the calling thread decided to block based on > the prior value, and if the other thread executed a FUTEX_WAKE > operation (or similar wake-up) after the value change and before > this FUTEX_WAIT operation, then the latter will observe the > value change and will not start to sleep. > > If the timeout argument is non-NULL, its contents specify a rel‐ > ative timeout for the wait, measured according to the > CLOCK_MONOTONIC clock. (This interval will be rounded up to the > system clock granularity, and kernel scheduling delays mean that > the blocking interval may overrun by a small amount.) If time‐ > out is NULL, the call blocks indefinitely. Would it not be better to only state that the wait will not return before the timeout -- unless woken -- and not bother with clock granularity and scheduling delays? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised futex(2) man page for review
On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote: > So, please take a look at the page below. At this point, > I would most especially appreciate help with the FIXMEs. For people who cannot read that troff gibberish (me).. --- FUTEX(2) Linux Programmer's Manual FUTEX(2) NAME futex - fast user-space locking SYNOPSIS #include #include int futex(int *uaddr, int futex_op, int val, const struct timespec *timeout, /* or: u32 val2 */ int *uaddr2, int val3); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization: The program implements the majority of the synchronization in user space, and uses one of operations of the system call when it is likely that it has to block for a longer time until the condition becomes true. The program uses another operation of the system call to wake anyone waiting for a par‐ ticular condition. The condition is represented by the futex word, which is an address in memory supplied to the futex() system call, and the value at this mem‐ ory location. (While the virtual addresses for the same memory in sep‐ arate processes may not be equal, the kernel maps them internally so that the same memory mapped in different locations will correspond for futex() calls.) When executing a futex operation that requests to block a thread, the kernel will only block if the futex word has the value that the calling thread supplied as expected value. The load from the futex word, the comparison with the expected value, and the actual blocking will happen atomically and totally ordered with respect to concurrently executing futex operations on the same futex word, such as operations that wake threads blocked on this futex word. Thus, the futex word is used to connect the synchronization in user spac with the implementation of blocking by the kernel; similar to an atomic compare-and-exchange oper‐ ation that potentially changes shared memory, blocking via a futex is an atomic compare-and-block operation. See NOTES for a detailed speci‐ fication of the synchronization semantics. One example use of futexes is implementing locks. The state of the lock (i.e., acquired or not acquired) can be represented as an atomi‐ cally accessed flag in shared memory. In the uncontended case, a thread can access or modify the lock state with atomic instructions, for example atomically changing it from not acquired to acquired using an atomic compare-and-exchange instruction. If a thread cannot acquire a lock because it is already acquired by another thread, it can request to block if and only the lock is still acquired by using the lock's flag as futex word and expecting a value that represents the acquired state. When releasing the lock, a thread has to first reset the lock state to not acquired and then execute the futex operation that wakes one thread blocked on the futex word that is the lock's flag (this can be be further optimized to avoid unnecessary wake-ups). See futex(7) for more detail on how to use futexes. Besides the basic wait and wake-up futex functionality, there are fur‐ ther futex operations aimed at supporting more complex use cases. Also note that no explicit initialization or destruction are necessary to use futexes; the kernel maintains a futex (i.e., the kernel-internal implementation artifact) only while operations such as FUTEX_WAIT, described below, are being performed on a particular futex word. Arguments The uaddr argument points to the futex word. On all platforms, futexes are four-byte integers that must be aligned on a four-byte boundary. The operation to perform on the futex is specified in the futex_op argument; val is a value whose meaning and purpose depends on futex_op. The remaining arguments (timeout, uaddr2, and val3) are required only for certain of the futex operations described below. Where one of these arguments is not required, it is ignored. For several blocking operations, the timeout argument is a pointer to a timespec structure that specifies a timeout for the operation. How‐ ever, notwithstanding the prototype shown above, for some operations, this argument is instead a four-byte integer whose meaning is deter‐
Re: Revised futex(2) man page for review
On 03/28/2015 09:53 AM, Michael Kerrisk (man-pages) wrote: > Hello all, [...] > So, please take a look at the page below. At this point, > I would most especially appreciate help with the FIXMEs. One more point I should have added. The revised page currently sits in a Git branch, here: http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Revised futex(2) man page for review
Hello all, As becomes quickly obvious upon reading it, the current futex(2) man page is in a sorry state, lacking many important details, and also the various additions that have been made to the interface over the last years. I've been working on revising it, first of all based on input I got in response to a request for help last year (http://thread.gmane.org/gmane.linux.kernel/1703405), especially taking Thomas Gleixner's input (http://thread.gmane.org/gmane.linux.kernel/1703405/focus=2952) into account. I also got some further offlist input from Darren Hart, Torvald Riegel, and Davidlohr Bueso that has been incorporated into the revised draft. Other than that, I got some useful info out of Ulrich Drepper's paper (cited at the end of the page) and one or two web pages (cited in the page source). The page has now increased in size by a factor of about 5, but is far from complete. In particular, as I reworked the page, there were many details that I was not 100% certain of, and I have added FIXME markers to the page source. In addition, Torvald added some text, and a few more FIXMEs. Some of the FIXMEs are trivial, as in: I'd like confirmation that I have correctly captured a technical detail. Others are more substantial, probably requiring the addition of further text. I appreciate that there are probably other things that can be improved in the page. (Torvald and Darren have some ideas.) However, before growing the page any further, I would like to resolve as many of the FIXMEs (and any other problems that people see) as possible in the existing text. I need help with that. (And I know that dealing with that help, if I get it, will in itself will be quite a task to deal with, which is why I have been delaying it for many weeks now, as my time has been rather limited recently.) So, please take a look at the page below. At this point, I would most especially appreciate help with the FIXMEs. Cheers, Michael = .\" Page by b.hubert .\" and Copyright (C) 2015, Thomas Gleixner .\" and Copyright (C) 2015, Michael Kerrisk .\" .\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE) .\" may be freely modified and distributed .\" %%%LICENSE_END .\" .\" Niki A. Rahimi (LTC Security Development, narah...@us.ibm.com) .\" added ERRORS section. .\" .\" Modified 2004-06-17 mtk .\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE .\" .\" FIXME Still to integrate are some points from Torvald Riegel's mail of .\" 2015-01-23: .\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977 .\" .\" FIXME Do we need add some text regarding Torvald Riegel's 2015-01-24 mail .\" at http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242 .\" .TH FUTEX 2 2014-05-21 "Linux" "Linux Programmer's Manual" .SH NAME futex \- fast user-space locking .SH SYNOPSIS .nf .sp .B "#include " .B "#include " .sp .BI "int futex(int *" uaddr ", int " futex_op ", int " val , .BI " const struct timespec *" timeout , \ " \fR /* or: \fBu32 \fIval2\fP */ .BI " int *" uaddr2 ", int " val3 ); .fi .IR Note : There is no glibc wrapper for this system call; see NOTES. .SH DESCRIPTION .PP The .BR futex () system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization: The program implements the majority of the synchronization in user space, and uses one of operations of the system call when it is likely that it has to block for a longer time until the condition becomes true. The program uses another operation of the system call to wake anyone waiting for a particular condition. The condition is represented by the futex word, which is an address in memory supplied to the .BR futex () system call, and the value at this memory location. (While the virtual addresses for the same memory in separate processes may not be equal, the kernel maps them internally so that the same memory mapped in different locations will correspond for .BR futex () calls.) When executing a futex operation that requests to block a thread, the kernel will only block if the futex word has the value that the calling thread supplied as expected value. The load from the futex word, the comparison with the expected value, and the actual blocking will happen atomically and totally ordered with respect to concurrently executing futex operations on the same futex word, such as operations that wake threads blocked on this futex word. Thus, the futex word is used to connect the synchronization in user spac with the implementation of blocking by the kernel; similar to an atomic compare-and-exchange operation that potentially changes shared memory, blocking via a futex is an atomic compare-and-block operation. See NOTES for a detailed specification of the synchronization semantics. One example use of futexes is implementing locks. The state of the lock (i.e., acquired or not acquired) can be represented as an atom