Re: Next round: revised futex(2) man page for review

2015-10-08 Thread Darren Hart
On Wed, Oct 07, 2015 at 10:34:19AM +0100, Michael Kerrisk (man-pages) wrote:
> On 08/19/2015 03:40 PM, Thomas Gleixner wrote:
> > On Wed, 5 Aug 2015, Darren Hart wrote:
> >> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) 
> >> wrote:
> >>> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
> >>> .\"   The following text is drawn from the Hart/Guniguntala paper
> >>> .\"   (listed in SEE ALSO), but I have reworded some pieces
> >>> .\"   significantly. Please check it.
> >>>
> >>>The PI futex operations described below  differ  from  the  other
> >>>futex  operations  in  that  they impose policy on the use of the
> >>>value of the futex word:
> >>>
> >>>*  If the lock is not acquired, the futex word's value  shall  be
> >>>   0.
> >>>
> >>>*  If  the  lock is acquired, the futex word's value shall be the
> >>>   thread ID (TID; see gettid(2)) of the owning thread.
> >>>
> >>>*  If the lock is owned and there are threads contending for  the
> >>>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
> >>>   word's value; in other words, this value is:
> >>>
> >>>   FUTEX_WAITERS | TID
> >>>
> >>>
> >>>Note that a PI futex word never just has the value FUTEX_WAITERS,
> >>>which is a permissible state for non-PI futexes.
> >>
> >> The second clause is inappropriate. I don't know if that was yours or
> >> mine, but non-PI futexes do not have a kernel defined value policy, so
> >> ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> >> permissible for non-PI futexes, and none have a kernel defined state.
> > 
> > Depends. If the regular futex is configured as robust, then we have a
> > kernel defined value policy as well.
> 

Right.

> Okay -- so do we need a change to the text here?

Hrm. We probably need a way to indicate that kernel-defined futex word
value policy only applies to PI and or ROBUST futexes.


> 
> >>> .\" FIXME I'm not quite clear on the meaning of the following sentence.
> >>> .\"   Is this trying to say that while blocked in a
> >>> .\"   FUTEX_WAIT_REQUEUE_PI, it could happen that another
> >>> .\"   task does a FUTEX_WAKE on uaddr that simply causes
> >>> .\"   a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
> >>> .\"   does not complete? What happens then to the 
> >>> FUTEX_WAIT_REQUEUE_PI
> >>> .\"   opertion? Does it remain blocked, or does it unblock
> >>> .\"   In which case, what does user space see?
> >>>
> >>>   The
> >>>   waiter   can  be  removed  from  the  wait  on  uaddr  via
> >>>   FUTEX_WAKE without requeueing on uaddr2.
> >>
> >> Userspace should see the task wake and continue executing. This would
> >> effectively be a cancelation operation - which I didn't think was
> >> supported. Thomas?
> > 
> > We probably never intended to support it, but looking at the code it
> > works (did not try it though). It returns to user space with
> > -EWOULDBLOCK. So it basically behaves like any other spurious wakeup.
> 
> Again, I assume no changes are required to the man page(?).

I'd rather not document this as supported or intended behavior.
FUTEX_WAIT_REQUEUE_PI is documented as being paired with and only with
FUTEX_CMP_REQUEUE_PI. Anything else is undefined behavior.

If we want to support a cancelation, it should be deliberate - and we should
probably test it ;-)


-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-10-08 Thread Darren Hart
On Wed, Oct 07, 2015 at 09:30:46AM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Thomas,
> 
> Thanks for the follow up!
> 
> Some open questions below are marked with the string ###.

A couple of comments from me below, although I suspect you have this much
covered already.

> 
> On 08/19/2015 04:17 PM, Thomas Gleixner wrote:
> > On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote:
> FUTEX_CMP_REQUEUE (since Linux 2.6.7)
>    This  operation  first  checks  whether the location uaddr
>    still contains the value  val3.   If  not,  the  operation
>    fails  with  the  error  EAGAIN.  Otherwise, the operation
>    wakes up a maximum of val waiters that are waiting on  the
>    futex  at uaddr.  If there are more than val waiters, then
>    the remaining waiters are removed from the wait  queue  of
>    the  source  futex at uaddr and added to the wait queue of
>    the target futex at uaddr2.  The val2  argument  specifies
>    an  upper limit on the number of waiters that are requeued
>    to the futex at uaddr2.
> 
>  .\" FIXME(Torvald) Is the following correct?  Or is just the decision
>  .\" which threads to wake or requeue part of the atomic operation?
> 
>    The load from uaddr is  an  atomic  memory  access  (i.e.,
>    using atomic machine instructions of the respective archi‐
>    tecture).  This load, the comparison with  val3,  and  the
>    requeueing  of  any  waiters  are performed atomically and
>    totally ordered with respect to other  operations  on  the
>    same futex word.
> >>>
> >>> It's atomic as the other atomic operations on the futex word. It's
> >>> always performed with the proper lock(s) held in the kernel. That
> >>> means any concurrent operation will serialize on that lock(s). User
> >>> space has to make sure, that depending on the observed value no
> >>> concurrent operations happen, but that's something the kernel cannot
> >>> control.
> >>
> >> ???
> >> Sorry, I'm not clear here. Is the current text correct then? Or is some
> >> change needed.
> > 
> > I think we need some change here because the meaning of atomic is
> > unclear. The basic rules of futexes are:
> > 
> >  - All modifying operations on the futex value have to be done with
> >atomic instructions, usually cmpxchg. That applies to both kernel
> >and user space.
> > 
> >That's the atomicity at the futex value level.
> > 
> >  - In the kernel we have to create/modify/destroy state in order to
> >provide the blocking/requeueing etc.
> > 
> >This state needs protection as well. So all operations related to
> >the kernel internal state are serialized on the hash bucket
> >locks. The hash buckets are a scalability mechanism to avoid
> >contention on a single lock protecting all kernel internal
> >state. For simplicity reasons you can just think of a global lock
> >protecting all kernel internal state.
> > 
> >If the kernel creates/modifies state then it can be necessary to
> >either reread the futex value or modify it. That happens under the
> >locks as well.
> > 
> >So in the case of requeue, we take the proper locks and perform the
> >comparison with val3 and the requeueing with the locks held.
> >
> >So that lock protection makes these operations 'atomic'. The
> >correct expression is 'serialized'.
> 
> ###
> So, here, i think I need some specific pointers on the precise text
> changes that are required. Let's talk about this f2f. For convenience,
> here's the relevant text once again quoted:

Not speaking for tglx, but I think the point here is to distinguish between
atomic (as in cmpxchg comparison tests performed on the futex word) and
serialized (as in the management of futex hashbuckets and task states).

> 
>FUTEX_CMP_REQUEUE (since Linux 2.6.7)
>   This  operation  first  checks  whether the location uaddr
>   still contains the value  val3.   If  not,  the  operation
>   fails  with  the  error  EAGAIN.  Otherwise, the operation

Here you might explain the _CMP_ qualifier and note atomicity of the operation:

The _CMP_ refers to the verification of the userspace state as specified by
through the arguments. This operation first atomically compares the value at
uaddr with the value val3 ...


>   wakes up a maximum of val waiters that are waiting on  the
>   futex  at uaddr.  If there are more than val waiters, then
>   the remaining waiters are removed from the wait  queue  of
>   the  source  futex at uaddr and added to the wait queue of
>   the target futex at uaddr2.  The val2  argument  specifies
>   an  upper limit on the 

Re: Next round: revised futex(2) man page for review

2015-10-08 Thread Darren Hart
On Wed, Oct 07, 2015 at 09:30:46AM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Thomas,
> 
> Thanks for the follow up!
> 
> Some open questions below are marked with the string ###.

A couple of comments from me below, although I suspect you have this much
covered already.

> 
> On 08/19/2015 04:17 PM, Thomas Gleixner wrote:
> > On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote:
> FUTEX_CMP_REQUEUE (since Linux 2.6.7)
>    This  operation  first  checks  whether the location uaddr
>    still contains the value  val3.   If  not,  the  operation
>    fails  with  the  error  EAGAIN.  Otherwise, the operation
>    wakes up a maximum of val waiters that are waiting on  the
>    futex  at uaddr.  If there are more than val waiters, then
>    the remaining waiters are removed from the wait  queue  of
>    the  source  futex at uaddr and added to the wait queue of
>    the target futex at uaddr2.  The val2  argument  specifies
>    an  upper limit on the number of waiters that are requeued
>    to the futex at uaddr2.
> 
>  .\" FIXME(Torvald) Is the following correct?  Or is just the decision
>  .\" which threads to wake or requeue part of the atomic operation?
> 
>    The load from uaddr is  an  atomic  memory  access  (i.e.,
>    using atomic machine instructions of the respective archi‐
>    tecture).  This load, the comparison with  val3,  and  the
>    requeueing  of  any  waiters  are performed atomically and
>    totally ordered with respect to other  operations  on  the
>    same futex word.
> >>>
> >>> It's atomic as the other atomic operations on the futex word. It's
> >>> always performed with the proper lock(s) held in the kernel. That
> >>> means any concurrent operation will serialize on that lock(s). User
> >>> space has to make sure, that depending on the observed value no
> >>> concurrent operations happen, but that's something the kernel cannot
> >>> control.
> >>
> >> ???
> >> Sorry, I'm not clear here. Is the current text correct then? Or is some
> >> change needed.
> > 
> > I think we need some change here because the meaning of atomic is
> > unclear. The basic rules of futexes are:
> > 
> >  - All modifying operations on the futex value have to be done with
> >atomic instructions, usually cmpxchg. That applies to both kernel
> >and user space.
> > 
> >That's the atomicity at the futex value level.
> > 
> >  - In the kernel we have to create/modify/destroy state in order to
> >provide the blocking/requeueing etc.
> > 
> >This state needs protection as well. So all operations related to
> >the kernel internal state are serialized on the hash bucket
> >locks. The hash buckets are a scalability mechanism to avoid
> >contention on a single lock protecting all kernel internal
> >state. For simplicity reasons you can just think of a global lock
> >protecting all kernel internal state.
> > 
> >If the kernel creates/modifies state then it can be necessary to
> >either reread the futex value or modify it. That happens under the
> >locks as well.
> > 
> >So in the case of requeue, we take the proper locks and perform the
> >comparison with val3 and the requeueing with the locks held.
> >
> >So that lock protection makes these operations 'atomic'. The
> >correct expression is 'serialized'.
> 
> ###
> So, here, i think I need some specific pointers on the precise text
> changes that are required. Let's talk about this f2f. For convenience,
> here's the relevant text once again quoted:

Not speaking for tglx, but I think the point here is to distinguish between
atomic (as in cmpxchg comparison tests performed on the futex word) and
serialized (as in the management of futex hashbuckets and task states).

> 
>FUTEX_CMP_REQUEUE (since Linux 2.6.7)
>   This  operation  first  checks  whether the location uaddr
>   still contains the value  val3.   If  not,  the  operation
>   fails  with  the  error  EAGAIN.  Otherwise, the operation

Here you might explain the _CMP_ qualifier and note atomicity of the operation:

The _CMP_ refers to the verification of the userspace state as specified by
through the arguments. This operation first atomically compares the value at
uaddr with the value val3 ...


>   wakes up a maximum of val waiters that are waiting on  the
>   futex  at uaddr.  If there are more than val waiters, then
>   the remaining waiters are removed from the wait  queue  of
>   the  source  futex at uaddr and added to the wait queue of
>   the target futex at uaddr2.  The val2  argument  specifies
>   an  upper limit on the 

Re: Next round: revised futex(2) man page for review

2015-10-08 Thread Darren Hart
On Wed, Oct 07, 2015 at 10:34:19AM +0100, Michael Kerrisk (man-pages) wrote:
> On 08/19/2015 03:40 PM, Thomas Gleixner wrote:
> > On Wed, 5 Aug 2015, Darren Hart wrote:
> >> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) 
> >> wrote:
> >>> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
> >>> .\"   The following text is drawn from the Hart/Guniguntala paper
> >>> .\"   (listed in SEE ALSO), but I have reworded some pieces
> >>> .\"   significantly. Please check it.
> >>>
> >>>The PI futex operations described below  differ  from  the  other
> >>>futex  operations  in  that  they impose policy on the use of the
> >>>value of the futex word:
> >>>
> >>>*  If the lock is not acquired, the futex word's value  shall  be
> >>>   0.
> >>>
> >>>*  If  the  lock is acquired, the futex word's value shall be the
> >>>   thread ID (TID; see gettid(2)) of the owning thread.
> >>>
> >>>*  If the lock is owned and there are threads contending for  the
> >>>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
> >>>   word's value; in other words, this value is:
> >>>
> >>>   FUTEX_WAITERS | TID
> >>>
> >>>
> >>>Note that a PI futex word never just has the value FUTEX_WAITERS,
> >>>which is a permissible state for non-PI futexes.
> >>
> >> The second clause is inappropriate. I don't know if that was yours or
> >> mine, but non-PI futexes do not have a kernel defined value policy, so
> >> ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> >> permissible for non-PI futexes, and none have a kernel defined state.
> > 
> > Depends. If the regular futex is configured as robust, then we have a
> > kernel defined value policy as well.
> 

Right.

> Okay -- so do we need a change to the text here?

Hrm. We probably need a way to indicate that kernel-defined futex word
value policy only applies to PI and or ROBUST futexes.


> 
> >>> .\" FIXME I'm not quite clear on the meaning of the following sentence.
> >>> .\"   Is this trying to say that while blocked in a
> >>> .\"   FUTEX_WAIT_REQUEUE_PI, it could happen that another
> >>> .\"   task does a FUTEX_WAKE on uaddr that simply causes
> >>> .\"   a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
> >>> .\"   does not complete? What happens then to the 
> >>> FUTEX_WAIT_REQUEUE_PI
> >>> .\"   opertion? Does it remain blocked, or does it unblock
> >>> .\"   In which case, what does user space see?
> >>>
> >>>   The
> >>>   waiter   can  be  removed  from  the  wait  on  uaddr  via
> >>>   FUTEX_WAKE without requeueing on uaddr2.
> >>
> >> Userspace should see the task wake and continue executing. This would
> >> effectively be a cancelation operation - which I didn't think was
> >> supported. Thomas?
> > 
> > We probably never intended to support it, but looking at the code it
> > works (did not try it though). It returns to user space with
> > -EWOULDBLOCK. So it basically behaves like any other spurious wakeup.
> 
> Again, I assume no changes are required to the man page(?).

I'd rather not document this as supported or intended behavior.
FUTEX_WAIT_REQUEUE_PI is documented as being paired with and only with
FUTEX_CMP_REQUEUE_PI. Anything else is undefined behavior.

If we want to support a cancelation, it should be deliberate - and we should
probably test it ;-)


-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-10-07 Thread Michael Kerrisk (man-pages)
On 08/19/2015 03:40 PM, Thomas Gleixner wrote:
> On Wed, 5 Aug 2015, Darren Hart wrote:
>> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
>>> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
>>> .\"   The following text is drawn from the Hart/Guniguntala paper
>>> .\"   (listed in SEE ALSO), but I have reworded some pieces
>>> .\"   significantly. Please check it.
>>>
>>>The PI futex operations described below  differ  from  the  other
>>>futex  operations  in  that  they impose policy on the use of the
>>>value of the futex word:
>>>
>>>*  If the lock is not acquired, the futex word's value  shall  be
>>>   0.
>>>
>>>*  If  the  lock is acquired, the futex word's value shall be the
>>>   thread ID (TID; see gettid(2)) of the owning thread.
>>>
>>>*  If the lock is owned and there are threads contending for  the
>>>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
>>>   word's value; in other words, this value is:
>>>
>>>   FUTEX_WAITERS | TID
>>>
>>>
>>>Note that a PI futex word never just has the value FUTEX_WAITERS,
>>>which is a permissible state for non-PI futexes.
>>
>> The second clause is inappropriate. I don't know if that was yours or
>> mine, but non-PI futexes do not have a kernel defined value policy, so
>> ==FUTEX_WAITERS cannot be a "permissible state" as any value is
>> permissible for non-PI futexes, and none have a kernel defined state.
> 
> Depends. If the regular futex is configured as robust, then we have a
> kernel defined value policy as well.

Okay -- so do we need a change to the text here?

>>> .\" FIXME I'm not quite clear on the meaning of the following sentence.
>>> .\"   Is this trying to say that while blocked in a
>>> .\"   FUTEX_WAIT_REQUEUE_PI, it could happen that another
>>> .\"   task does a FUTEX_WAKE on uaddr that simply causes
>>> .\"   a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
>>> .\"   does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI
>>> .\"   opertion? Does it remain blocked, or does it unblock
>>> .\"   In which case, what does user space see?
>>>
>>>   The
>>>   waiter   can  be  removed  from  the  wait  on  uaddr  via
>>>   FUTEX_WAKE without requeueing on uaddr2.
>>
>> Userspace should see the task wake and continue executing. This would
>> effectively be a cancelation operation - which I didn't think was
>> supported. Thomas?
> 
> We probably never intended to support it, but looking at the code it
> works (did not try it though). It returns to user space with
> -EWOULDBLOCK. So it basically behaves like any other spurious wakeup.

Again, I assume no changes are required to the man page(?).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-10-07 Thread Michael Kerrisk (man-pages)
Hello Thomas,

Thanks for the follow up!

Some open questions below are marked with the string ###.

On 08/19/2015 04:17 PM, Thomas Gleixner wrote:
> On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote:
FUTEX_CMP_REQUEUE (since Linux 2.6.7)
   This  operation  first  checks  whether the location uaddr
   still contains the value  val3.   If  not,  the  operation
   fails  with  the  error  EAGAIN.  Otherwise, the operation
   wakes up a maximum of val waiters that are waiting on  the
   futex  at uaddr.  If there are more than val waiters, then
   the remaining waiters are removed from the wait  queue  of
   the  source  futex at uaddr and added to the wait queue of
   the target futex at uaddr2.  The val2  argument  specifies
   an  upper limit on the number of waiters that are requeued
   to the futex at uaddr2.

 .\" FIXME(Torvald) Is the following correct?  Or is just the decision
 .\" which threads to wake or requeue part of the atomic operation?

   The load from uaddr is  an  atomic  memory  access  (i.e.,
   using atomic machine instructions of the respective archi‐
   tecture).  This load, the comparison with  val3,  and  the
   requeueing  of  any  waiters  are performed atomically and
   totally ordered with respect to other  operations  on  the
   same futex word.
>>>
>>> It's atomic as the other atomic operations on the futex word. It's
>>> always performed with the proper lock(s) held in the kernel. That
>>> means any concurrent operation will serialize on that lock(s). User
>>> space has to make sure, that depending on the observed value no
>>> concurrent operations happen, but that's something the kernel cannot
>>> control.
>>
>> ???
>> Sorry, I'm not clear here. Is the current text correct then? Or is some
>> change needed.
> 
> I think we need some change here because the meaning of atomic is
> unclear. The basic rules of futexes are:
> 
>  - All modifying operations on the futex value have to be done with
>atomic instructions, usually cmpxchg. That applies to both kernel
>and user space.
> 
>That's the atomicity at the futex value level.
> 
>  - In the kernel we have to create/modify/destroy state in order to
>provide the blocking/requeueing etc.
> 
>This state needs protection as well. So all operations related to
>the kernel internal state are serialized on the hash bucket
>locks. The hash buckets are a scalability mechanism to avoid
>contention on a single lock protecting all kernel internal
>state. For simplicity reasons you can just think of a global lock
>protecting all kernel internal state.
> 
>If the kernel creates/modifies state then it can be necessary to
>either reread the futex value or modify it. That happens under the
>locks as well.
> 
>So in the case of requeue, we take the proper locks and perform the
>comparison with val3 and the requeueing with the locks held.
>
>So that lock protection makes these operations 'atomic'. The
>correct expression is 'serialized'.

###
So, here, i think I need some specific pointers on the precise text
changes that are required. Let's talk about this f2f. For convenience,
here's the relevant text once again quoted:

   FUTEX_CMP_REQUEUE (since Linux 2.6.7)
  This  operation  first  checks  whether the location uaddr
  still contains the value  val3.   If  not,  the  operation
  fails  with  the  error  EAGAIN.  Otherwise, the operation
  wakes up a maximum of val waiters that are waiting on  the
  futex  at uaddr.  If there are more than val waiters, then
  the remaining waiters are removed from the wait  queue  of
  the  source  futex at uaddr and added to the wait queue of
  the target futex at uaddr2.  The val2  argument  specifies
  an  upper limit on the number of waiters that are requeued
  to the futex at uaddr2.

  The load from uaddr is  an  atomic  memory  access  (i.e.,
  using atomic machine instructions of the respective archi‐
  tecture).  This load, the comparison with  val3,  and  the
  requeueing  of  any  waiters  are performed atomically and
  totally ordered with respect to other  operations  on  the
  same futex word.


 .\" FIXME We need some explanation in the following paragraph of *why*
 .\"   it is important to note that "the kernel will update the
 .\"   futex word's value prior
It is important to note to returning to user space" . Can someone
explain?   that  the  kernel  will  update the futex word's value
prior 

Re: Next round: revised futex(2) man page for review

2015-10-07 Thread Michael Kerrisk (man-pages)
Hello Thomas,

Thanks for the follow up!

Some open questions below are marked with the string ###.

On 08/19/2015 04:17 PM, Thomas Gleixner wrote:
> On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote:
FUTEX_CMP_REQUEUE (since Linux 2.6.7)
   This  operation  first  checks  whether the location uaddr
   still contains the value  val3.   If  not,  the  operation
   fails  with  the  error  EAGAIN.  Otherwise, the operation
   wakes up a maximum of val waiters that are waiting on  the
   futex  at uaddr.  If there are more than val waiters, then
   the remaining waiters are removed from the wait  queue  of
   the  source  futex at uaddr and added to the wait queue of
   the target futex at uaddr2.  The val2  argument  specifies
   an  upper limit on the number of waiters that are requeued
   to the futex at uaddr2.

 .\" FIXME(Torvald) Is the following correct?  Or is just the decision
 .\" which threads to wake or requeue part of the atomic operation?

   The load from uaddr is  an  atomic  memory  access  (i.e.,
   using atomic machine instructions of the respective archi‐
   tecture).  This load, the comparison with  val3,  and  the
   requeueing  of  any  waiters  are performed atomically and
   totally ordered with respect to other  operations  on  the
   same futex word.
>>>
>>> It's atomic as the other atomic operations on the futex word. It's
>>> always performed with the proper lock(s) held in the kernel. That
>>> means any concurrent operation will serialize on that lock(s). User
>>> space has to make sure, that depending on the observed value no
>>> concurrent operations happen, but that's something the kernel cannot
>>> control.
>>
>> ???
>> Sorry, I'm not clear here. Is the current text correct then? Or is some
>> change needed.
> 
> I think we need some change here because the meaning of atomic is
> unclear. The basic rules of futexes are:
> 
>  - All modifying operations on the futex value have to be done with
>atomic instructions, usually cmpxchg. That applies to both kernel
>and user space.
> 
>That's the atomicity at the futex value level.
> 
>  - In the kernel we have to create/modify/destroy state in order to
>provide the blocking/requeueing etc.
> 
>This state needs protection as well. So all operations related to
>the kernel internal state are serialized on the hash bucket
>locks. The hash buckets are a scalability mechanism to avoid
>contention on a single lock protecting all kernel internal
>state. For simplicity reasons you can just think of a global lock
>protecting all kernel internal state.
> 
>If the kernel creates/modifies state then it can be necessary to
>either reread the futex value or modify it. That happens under the
>locks as well.
> 
>So in the case of requeue, we take the proper locks and perform the
>comparison with val3 and the requeueing with the locks held.
>
>So that lock protection makes these operations 'atomic'. The
>correct expression is 'serialized'.

###
So, here, i think I need some specific pointers on the precise text
changes that are required. Let's talk about this f2f. For convenience,
here's the relevant text once again quoted:

   FUTEX_CMP_REQUEUE (since Linux 2.6.7)
  This  operation  first  checks  whether the location uaddr
  still contains the value  val3.   If  not,  the  operation
  fails  with  the  error  EAGAIN.  Otherwise, the operation
  wakes up a maximum of val waiters that are waiting on  the
  futex  at uaddr.  If there are more than val waiters, then
  the remaining waiters are removed from the wait  queue  of
  the  source  futex at uaddr and added to the wait queue of
  the target futex at uaddr2.  The val2  argument  specifies
  an  upper limit on the number of waiters that are requeued
  to the futex at uaddr2.

  The load from uaddr is  an  atomic  memory  access  (i.e.,
  using atomic machine instructions of the respective archi‐
  tecture).  This load, the comparison with  val3,  and  the
  requeueing  of  any  waiters  are performed atomically and
  totally ordered with respect to other  operations  on  the
  same futex word.


 .\" FIXME We need some explanation in the following paragraph of *why*
 .\"   it is important to note that "the kernel will update the
 .\"   futex word's value prior
It is important to note to returning to user space" . Can someone
explain?   that  the  kernel  will  update the futex word's value
prior 

Re: Next round: revised futex(2) man page for review

2015-10-07 Thread Michael Kerrisk (man-pages)
On 08/19/2015 03:40 PM, Thomas Gleixner wrote:
> On Wed, 5 Aug 2015, Darren Hart wrote:
>> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
>>> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
>>> .\"   The following text is drawn from the Hart/Guniguntala paper
>>> .\"   (listed in SEE ALSO), but I have reworded some pieces
>>> .\"   significantly. Please check it.
>>>
>>>The PI futex operations described below  differ  from  the  other
>>>futex  operations  in  that  they impose policy on the use of the
>>>value of the futex word:
>>>
>>>*  If the lock is not acquired, the futex word's value  shall  be
>>>   0.
>>>
>>>*  If  the  lock is acquired, the futex word's value shall be the
>>>   thread ID (TID; see gettid(2)) of the owning thread.
>>>
>>>*  If the lock is owned and there are threads contending for  the
>>>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
>>>   word's value; in other words, this value is:
>>>
>>>   FUTEX_WAITERS | TID
>>>
>>>
>>>Note that a PI futex word never just has the value FUTEX_WAITERS,
>>>which is a permissible state for non-PI futexes.
>>
>> The second clause is inappropriate. I don't know if that was yours or
>> mine, but non-PI futexes do not have a kernel defined value policy, so
>> ==FUTEX_WAITERS cannot be a "permissible state" as any value is
>> permissible for non-PI futexes, and none have a kernel defined state.
> 
> Depends. If the regular futex is configured as robust, then we have a
> kernel defined value policy as well.

Okay -- so do we need a change to the text here?

>>> .\" FIXME I'm not quite clear on the meaning of the following sentence.
>>> .\"   Is this trying to say that while blocked in a
>>> .\"   FUTEX_WAIT_REQUEUE_PI, it could happen that another
>>> .\"   task does a FUTEX_WAKE on uaddr that simply causes
>>> .\"   a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
>>> .\"   does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI
>>> .\"   opertion? Does it remain blocked, or does it unblock
>>> .\"   In which case, what does user space see?
>>>
>>>   The
>>>   waiter   can  be  removed  from  the  wait  on  uaddr  via
>>>   FUTEX_WAKE without requeueing on uaddr2.
>>
>> Userspace should see the task wake and continue executing. This would
>> effectively be a cancelation operation - which I didn't think was
>> supported. Thomas?
> 
> We probably never intended to support it, but looking at the code it
> works (did not try it though). It returns to user space with
> -EWOULDBLOCK. So it basically behaves like any other spurious wakeup.

Again, I assume no changes are required to the man page(?).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-26 Thread Darren Hart
On Thu, Aug 20, 2015 at 01:17:03AM +0200, Thomas Gleixner wrote:

...

> > >> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
> > >> .\"   made the observation that "EINVAL is returned if the non-pi 
> > >> .\"   to pi or op pairing semantics are violated."
> > >> .\"   Probably there needs to be a general statement about this
> > >> .\"   requirement, probably located at about this point in the page.
> > >> .\"   Darren (or someone else), care to take a shot at this?
> > > 
> > > Well, that's hard to describe because the kernel only has a limited
> > > way of detecting such mismatches. It only can detect it when there are
> > > non PI waiters on a futex and a PI function is called or vice versa.
> > 
> > Hmmm. Okay, I filed your comments away for reference, but
> > hopefully someone can help with some actual text.
> 
> I let Darren come up with something sensible :)

Heh, right, no pressure then...

I responded to Michael on this recently, copied here for reference:


FUTEX_WAIT_REQUEUE_PI can return -EINVAL if called with invalid parameters, such
as uaddr==uaddr2, or (in the case of SHARED futexes), the associated keys match
(meaning it's the same futex word - shared memory, inode, etc.). This can't
happen if the stated policy of requeueing from non-pi to pi is followed as the
same word cannot be both non-pi and pi at the same time, requiring them to be
unique futex words.

FUTEX_CMP_REQUEUE_PI will fail similarly if uaddr and uaddr2 are the same futex
word. Also, if nr_wake != 1.

But, to the point I was making above, FUTEX_CMP_REQUEUE_PI must requeue uaddr to
the same uaddr2 specified in the previous FUTEX_WAIT_REQUEUE_PI call.
FUTEX_WAIT_REQUEUE_PI sets up the operation, FUTEX_CMP_REQUEUE_PI completes it,
and they must agree on uaddr and uaddr2.


Michael, are you still looking for something more from me, or is this FIXME now
complete?



-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-26 Thread Darren Hart
On Thu, Aug 20, 2015 at 01:17:03AM +0200, Thomas Gleixner wrote:

...

   .\ FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
   .\   made the observation that EINVAL is returned if the non-pi 
   .\   to pi or op pairing semantics are violated.
   .\   Probably there needs to be a general statement about this
   .\   requirement, probably located at about this point in the page.
   .\   Darren (or someone else), care to take a shot at this?
   
   Well, that's hard to describe because the kernel only has a limited
   way of detecting such mismatches. It only can detect it when there are
   non PI waiters on a futex and a PI function is called or vice versa.
  
  Hmmm. Okay, I filed your comments away for reference, but
  hopefully someone can help with some actual text.
 
 I let Darren come up with something sensible :)

Heh, right, no pressure then...

I responded to Michael on this recently, copied here for reference:


FUTEX_WAIT_REQUEUE_PI can return -EINVAL if called with invalid parameters, such
as uaddr==uaddr2, or (in the case of SHARED futexes), the associated keys match
(meaning it's the same futex word - shared memory, inode, etc.). This can't
happen if the stated policy of requeueing from non-pi to pi is followed as the
same word cannot be both non-pi and pi at the same time, requiring them to be
unique futex words.

FUTEX_CMP_REQUEUE_PI will fail similarly if uaddr and uaddr2 are the same futex
word. Also, if nr_wake != 1.

But, to the point I was making above, FUTEX_CMP_REQUEUE_PI must requeue uaddr to
the same uaddr2 specified in the previous FUTEX_WAIT_REQUEUE_PI call.
FUTEX_WAIT_REQUEUE_PI sets up the operation, FUTEX_CMP_REQUEUE_PI completes it,
and they must agree on uaddr and uaddr2.


Michael, are you still looking for something more from me, or is this FIXME now
complete?



-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-24 Thread Darren Hart
On Sat, Aug 08, 2015 at 08:57:35AM +0200, Michael Kerrisk (man-pages) wrote:

...

> >> .\" FIXME = End of adapted Hart/Guniguntala text =
> >>
> >>
> >>
> >> .\" FIXME We need some explanation in the following paragraph of *why*
> >> .\"   it is important to note that "the kernel will update the
> >> .\"   futex word's value prior
> >>It is important to note to returning to user space" . Can someone
> >>explain?   that  the  kernel  will  update the futex word's value
> >>prior to returning to user space.  Unlike the other futex  opera‐
> >>tions  described  above, the PI futex operations are designed for
> >>the implementation of very specific IPC mechanisms.
> > 
> > If the kernel didn't perform the update prior to returning to userspace,
> > we could end up in an invalid state. Such as having an owner, but the
> > value being 0. Or having waiters, but not having FUTEX_WAITERS set.
> 
> So I've now reworked this passage to read:
> 
>It  is  important  to  note that the kernel will update the futex
>word's value prior to returning to user  space.   (This  prevents
>the possibility of the futex word's value ending up in an invalid
>state, such as having an owner but the value being 0,  or  having
>waiters but not having the FUTEX_WAITERS bit set.)
> 
> Okay?

Yes.

> 
> >> .\"
> >> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
> >> .\"   made the observation that "EINVAL is returned if the non-pi 
> >> .\"   to pi or op pairing semantics are violated."
> >> .\"   Probably there needs to be a general statement about this
> >> .\"   requirement, probably located at about this point in the page.
> >> .\"   Darren (or someone else), care to take a shot at this?
> > 
> > We can probably borrow from either the futex.c comments or the
> > futex-requeue-pi.txt in Documentation. Also, it is important to note
> > that the PI requeue operations require two distinct uadders (although
> > that is implied by requiring "non-pi to pi" as a futex cannot be both.
> > 
> > Or... perhaps something like:
> > 
> > Due to the kernel imposed futex word value policy, PI futex
> > operations have additional usage requirements:
> > 
> > FUTEX_WAIT_REQUEUE_PI must be paired with FUTEX_CMP_REQUEUE_PI
> > and be performed from a non-pi futex to a distinct pi futex.
> > Failing to do so will return EINVAL. 
> 
> For which operation does the EINVAL occur: FUTEX_WAIT_REQUEUE_PI or 
> FUTEX_CMP_REQUEUE_PI?

FUTEX_WAIT_REQUEUE_PI can return -EINVAL if called with invalid parameters, such
as uaddr==uaddr2, or (in the case of SHARED futexes), the associated keys match
(meaning it's the same futex word - shared memory, inode, etc.). This can't
happen if the stated policy of requeueing from non-pi to pi is followed as the
same word cannot be both non-pi and pi at the same time, requiring them to be
unique futex words.

FUTEX_CMP_REQUEUE_PI will fail similarly if uaddr and uaddr2 are the same futex
word. Also, if nr_wake != 1.

But, to the point I was making above, FUTEX_CMP_REQUEUE_PI must reque uaddr to
same uaddr2 specified in the previous FUTEX_WAIT_REQUEUE_PI call.
FUTEX_WAIT_REQUEUE_PI sets up the operation, FUTEX_CMP_REQUEUE_PI completes it,
and they must agree on uaddr and uaddr2.

...

> > And their PRIVATE counterparts of course (which is assumed as it is a
> > flag to the opcode).
> 
> Yes. But I don't think that needs to be called out explicitly here (?).


Agreed.

> 
> >> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
> >> .\"   The following text is drawn from the Hart/Guniguntala paper
> >> .\"   (listed in SEE ALSO), but I have reworded some pieces
> >> .\"   significantly. Please check it.
> >>
> >>The PI futex operations described below  differ  from  the  other
> >>futex  operations  in  that  they impose policy on the use of the
> >>value of the futex word:
> >>
> >>*  If the lock is not acquired, the futex word's value  shall  be
> >>   0.
> >>
> >>*  If  the  lock is acquired, the futex word's value shall be the
> >>   thread ID (TID; see gettid(2)) of the owning thread.
> >>
> >>*  If the lock is owned and there are threads contending for  the
> >>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
> >>   word's value; in other words, this value is:
> >>
> >>   FUTEX_WAITERS | TID
> >>
> >>
> >>Note that a PI futex word never just has the value FUTEX_WAITERS,
> >>which is a permissible state for non-PI futexes.
> > 
> > The second clause is inappropriate. I don't know if that was yours or
> > mine, but non-PI futexes do not have a kernel defined value policy, so
> > ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> > permissible for non-PI futexes, and none have a kernel defined state.
> > 

Re: Next round: revised futex(2) man page for review

2015-08-24 Thread Darren Hart
On Thu, Aug 20, 2015 at 12:40:46AM +0200, Thomas Gleixner wrote:
> On Wed, 5 Aug 2015, Darren Hart wrote:
> > On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
> > > .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
> > > .\"   The following text is drawn from the Hart/Guniguntala paper
> > > .\"   (listed in SEE ALSO), but I have reworded some pieces
> > > .\"   significantly. Please check it.
> > > 
> > >The PI futex operations described below  differ  from  the  other
> > >futex  operations  in  that  they impose policy on the use of the
> > >value of the futex word:
> > > 
> > >*  If the lock is not acquired, the futex word's value  shall  be
> > >   0.
> > > 
> > >*  If  the  lock is acquired, the futex word's value shall be the
> > >   thread ID (TID; see gettid(2)) of the owning thread.
> > > 
> > >*  If the lock is owned and there are threads contending for  the
> > >   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
> > >   word's value; in other words, this value is:
> > > 
> > >   FUTEX_WAITERS | TID
> > > 
> > > 
> > >Note that a PI futex word never just has the value FUTEX_WAITERS,
> > >which is a permissible state for non-PI futexes.
> > 
> > The second clause is inappropriate. I don't know if that was yours or
> > mine, but non-PI futexes do not have a kernel defined value policy, so
> > ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> > permissible for non-PI futexes, and none have a kernel defined state.
> 
> Depends. If the regular futex is configured as robust, then we have a
> kernel defined value policy as well.

Indeed, thanks for catching that.

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-24 Thread Darren Hart
On Thu, Aug 20, 2015 at 12:40:46AM +0200, Thomas Gleixner wrote:
 On Wed, 5 Aug 2015, Darren Hart wrote:
  On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
   .\ FIXME XXX = Start of adapted Hart/Guniguntala text =
   .\   The following text is drawn from the Hart/Guniguntala paper
   .\   (listed in SEE ALSO), but I have reworded some pieces
   .\   significantly. Please check it.
   
  The PI futex operations described below  differ  from  the  other
  futex  operations  in  that  they impose policy on the use of the
  value of the futex word:
   
  *  If the lock is not acquired, the futex word's value  shall  be
 0.
   
  *  If  the  lock is acquired, the futex word's value shall be the
 thread ID (TID; see gettid(2)) of the owning thread.
   
  *  If the lock is owned and there are threads contending for  the
 lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
 word's value; in other words, this value is:
   
 FUTEX_WAITERS | TID
   
   
  Note that a PI futex word never just has the value FUTEX_WAITERS,
  which is a permissible state for non-PI futexes.
  
  The second clause is inappropriate. I don't know if that was yours or
  mine, but non-PI futexes do not have a kernel defined value policy, so
  ==FUTEX_WAITERS cannot be a permissible state as any value is
  permissible for non-PI futexes, and none have a kernel defined state.
 
 Depends. If the regular futex is configured as robust, then we have a
 kernel defined value policy as well.

Indeed, thanks for catching that.

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-24 Thread Darren Hart
On Sat, Aug 08, 2015 at 08:57:35AM +0200, Michael Kerrisk (man-pages) wrote:

...

  .\ FIXME = End of adapted Hart/Guniguntala text =
 
 
 
  .\ FIXME We need some explanation in the following paragraph of *why*
  .\   it is important to note that the kernel will update the
  .\   futex word's value prior
 It is important to note to returning to user space . Can someone
 explain?   that  the  kernel  will  update the futex word's value
 prior to returning to user space.  Unlike the other futex  opera‐
 tions  described  above, the PI futex operations are designed for
 the implementation of very specific IPC mechanisms.
  
  If the kernel didn't perform the update prior to returning to userspace,
  we could end up in an invalid state. Such as having an owner, but the
  value being 0. Or having waiters, but not having FUTEX_WAITERS set.
 
 So I've now reworked this passage to read:
 
It  is  important  to  note that the kernel will update the futex
word's value prior to returning to user  space.   (This  prevents
the possibility of the futex word's value ending up in an invalid
state, such as having an owner but the value being 0,  or  having
waiters but not having the FUTEX_WAITERS bit set.)
 
 Okay?

Yes.

 
  .\
  .\ FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
  .\   made the observation that EINVAL is returned if the non-pi 
  .\   to pi or op pairing semantics are violated.
  .\   Probably there needs to be a general statement about this
  .\   requirement, probably located at about this point in the page.
  .\   Darren (or someone else), care to take a shot at this?
  
  We can probably borrow from either the futex.c comments or the
  futex-requeue-pi.txt in Documentation. Also, it is important to note
  that the PI requeue operations require two distinct uadders (although
  that is implied by requiring non-pi to pi as a futex cannot be both.
  
  Or... perhaps something like:
  
  Due to the kernel imposed futex word value policy, PI futex
  operations have additional usage requirements:
  
  FUTEX_WAIT_REQUEUE_PI must be paired with FUTEX_CMP_REQUEUE_PI
  and be performed from a non-pi futex to a distinct pi futex.
  Failing to do so will return EINVAL. 
 
 For which operation does the EINVAL occur: FUTEX_WAIT_REQUEUE_PI or 
 FUTEX_CMP_REQUEUE_PI?

FUTEX_WAIT_REQUEUE_PI can return -EINVAL if called with invalid parameters, such
as uaddr==uaddr2, or (in the case of SHARED futexes), the associated keys match
(meaning it's the same futex word - shared memory, inode, etc.). This can't
happen if the stated policy of requeueing from non-pi to pi is followed as the
same word cannot be both non-pi and pi at the same time, requiring them to be
unique futex words.

FUTEX_CMP_REQUEUE_PI will fail similarly if uaddr and uaddr2 are the same futex
word. Also, if nr_wake != 1.

But, to the point I was making above, FUTEX_CMP_REQUEUE_PI must reque uaddr to
same uaddr2 specified in the previous FUTEX_WAIT_REQUEUE_PI call.
FUTEX_WAIT_REQUEUE_PI sets up the operation, FUTEX_CMP_REQUEUE_PI completes it,
and they must agree on uaddr and uaddr2.

...

  And their PRIVATE counterparts of course (which is assumed as it is a
  flag to the opcode).
 
 Yes. But I don't think that needs to be called out explicitly here (?).


Agreed.

 
  .\ FIXME XXX = Start of adapted Hart/Guniguntala text =
  .\   The following text is drawn from the Hart/Guniguntala paper
  .\   (listed in SEE ALSO), but I have reworded some pieces
  .\   significantly. Please check it.
 
 The PI futex operations described below  differ  from  the  other
 futex  operations  in  that  they impose policy on the use of the
 value of the futex word:
 
 *  If the lock is not acquired, the futex word's value  shall  be
0.
 
 *  If  the  lock is acquired, the futex word's value shall be the
thread ID (TID; see gettid(2)) of the owning thread.
 
 *  If the lock is owned and there are threads contending for  the
lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
word's value; in other words, this value is:
 
FUTEX_WAITERS | TID
 
 
 Note that a PI futex word never just has the value FUTEX_WAITERS,
 which is a permissible state for non-PI futexes.
  
  The second clause is inappropriate. I don't know if that was yours or
  mine, but non-PI futexes do not have a kernel defined value policy, so
  ==FUTEX_WAITERS cannot be a permissible state as any value is
  permissible for non-PI futexes, and none have a kernel defined state.
  
  Perhaps include a Note under the third bullet as:
  
Note: It is invalid for a PI futex word to have no owner and
  FUTEX_WAITERS set.
 
 Done.
 
 With this policy in place, a user-space 

Re: Next round: revised futex(2) man page for review

2015-08-19 Thread Thomas Gleixner
On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote:
> >>FUTEX_CMP_REQUEUE (since Linux 2.6.7)
> >>   This  operation  first  checks  whether the location uaddr
> >>   still contains the value  val3.   If  not,  the  operation
> >>   fails  with  the  error  EAGAIN.  Otherwise, the operation
> >>   wakes up a maximum of val waiters that are waiting on  the
> >>   futex  at uaddr.  If there are more than val waiters, then
> >>   the remaining waiters are removed from the wait  queue  of
> >>   the  source  futex at uaddr and added to the wait queue of
> >>   the target futex at uaddr2.  The val2  argument  specifies
> >>   an  upper limit on the number of waiters that are requeued
> >>   to the futex at uaddr2.
> >>
> >> .\" FIXME(Torvald) Is the following correct?  Or is just the decision
> >> .\" which threads to wake or requeue part of the atomic operation?
> >>
> >>   The load from uaddr is  an  atomic  memory  access  (i.e.,
> >>   using atomic machine instructions of the respective archi‐
> >>   tecture).  This load, the comparison with  val3,  and  the
> >>   requeueing  of  any  waiters  are performed atomically and
> >>   totally ordered with respect to other  operations  on  the
> >>   same futex word.
> > 
> > It's atomic as the other atomic operations on the futex word. It's
> > always performed with the proper lock(s) held in the kernel. That
> > means any concurrent operation will serialize on that lock(s). User
> > space has to make sure, that depending on the observed value no
> > concurrent operations happen, but that's something the kernel cannot
> > control.
> 
> ???
> Sorry, I'm not clear here. Is the current text correct then? Or is some
> change needed.

I think we need some change here because the meaning of atomic is
unclear. The basic rules of futexes are:

 - All modifying operations on the futex value have to be done with
   atomic instructions, usually cmpxchg. That applies to both kernel
   and user space.

   That's the atomicity at the futex value level.

 - In the kernel we have to create/modify/destroy state in order to
   provide the blocking/requeueing etc.

   This state needs protection as well. So all operations related to
   the kernel internal state are serialized on the hash bucket
   locks. The hash buckets are a scalability mechanism to avoid
   contention on a single lock protecting all kernel internal
   state. For simplicity reasons you can just think of a global lock
   protecting all kernel internal state.

   If the kernel creates/modifies state then it can be necessary to
   either reread the futex value or modify it. That happens under the
   locks as well.

   So in the case of requeue, we take the proper locks and perform the
   comparison with val3 and the requeueing with the locks held.
   
   So that lock protection makes these operations 'atomic'. The
   correct expression is 'serialized'.
 
> >> .\" FIXME We need some explanation in the following paragraph of *why*
> >> .\"   it is important to note that "the kernel will update the
> >> .\"   futex word's value prior
> >>It is important to note to returning to user space" . Can someone
> >>explain?   that  the  kernel  will  update the futex word's value
> >>prior to returning to user space.  Unlike the other futex  opera‐
> >>tions  described  above, the PI futex operations are designed for
> >>the implementation of very specific IPC mechanisms.
> > 
> > If there are multiple waiters on a pi futex then a wake pi operation
> > will wake the first waiter and hand over the lock to this waiter. This
> > includes handing over the rtmutex which represents the futex in the
> > kernel. The strict requirement is that the futex owner and the rtmutex
> > owner must be the same, except for the update period which is
> > serialized by the futex internal locking. That means the kernel must
> > update the user space value prior to returning to user space.

And as noted above: It must update while holding the proper locks.

> >> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
> >> .\"   made the observation that "EINVAL is returned if the non-pi 
> >> .\"   to pi or op pairing semantics are violated."
> >> .\"   Probably there needs to be a general statement about this
> >> .\"   requirement, probably located at about this point in the page.
> >> .\"   Darren (or someone else), care to take a shot at this?
> > 
> > Well, that's hard to describe because the kernel only has a limited
> > way of detecting such mismatches. It only can detect it when there are
> > non PI waiters on a futex and a PI function is called or vice versa.
> 
> Hmmm. Okay, I filed your comments away for reference, but
> hopefully someone can help with 

Re: Next round: revised futex(2) man page for review

2015-08-19 Thread Thomas Gleixner
On Wed, 5 Aug 2015, Darren Hart wrote:
> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
> > .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
> > .\"   The following text is drawn from the Hart/Guniguntala paper
> > .\"   (listed in SEE ALSO), but I have reworded some pieces
> > .\"   significantly. Please check it.
> > 
> >The PI futex operations described below  differ  from  the  other
> >futex  operations  in  that  they impose policy on the use of the
> >value of the futex word:
> > 
> >*  If the lock is not acquired, the futex word's value  shall  be
> >   0.
> > 
> >*  If  the  lock is acquired, the futex word's value shall be the
> >   thread ID (TID; see gettid(2)) of the owning thread.
> > 
> >*  If the lock is owned and there are threads contending for  the
> >   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
> >   word's value; in other words, this value is:
> > 
> >   FUTEX_WAITERS | TID
> > 
> > 
> >Note that a PI futex word never just has the value FUTEX_WAITERS,
> >which is a permissible state for non-PI futexes.
> 
> The second clause is inappropriate. I don't know if that was yours or
> mine, but non-PI futexes do not have a kernel defined value policy, so
> ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> permissible for non-PI futexes, and none have a kernel defined state.

Depends. If the regular futex is configured as robust, then we have a
kernel defined value policy as well.

> > .\" FIXME I'm not quite clear on the meaning of the following sentence.
> > .\"   Is this trying to say that while blocked in a
> > .\"   FUTEX_WAIT_REQUEUE_PI, it could happen that another
> > .\"   task does a FUTEX_WAKE on uaddr that simply causes
> > .\"   a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
> > .\"   does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI
> > .\"   opertion? Does it remain blocked, or does it unblock
> > .\"   In which case, what does user space see?
> > 
> >   The
> >   waiter   can  be  removed  from  the  wait  on  uaddr  via
> >   FUTEX_WAKE without requeueing on uaddr2.
> 
> Userspace should see the task wake and continue executing. This would
> effectively be a cancelation operation - which I didn't think was
> supported. Thomas?

We probably never intended to support it, but looking at the code it
works (did not try it though). It returns to user space with
-EWOULDBLOCK. So it basically behaves like any other spurious wakeup.
 
Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-19 Thread Thomas Gleixner
On Wed, 5 Aug 2015, Darren Hart wrote:
 On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
  .\ FIXME XXX = Start of adapted Hart/Guniguntala text =
  .\   The following text is drawn from the Hart/Guniguntala paper
  .\   (listed in SEE ALSO), but I have reworded some pieces
  .\   significantly. Please check it.
  
 The PI futex operations described below  differ  from  the  other
 futex  operations  in  that  they impose policy on the use of the
 value of the futex word:
  
 *  If the lock is not acquired, the futex word's value  shall  be
0.
  
 *  If  the  lock is acquired, the futex word's value shall be the
thread ID (TID; see gettid(2)) of the owning thread.
  
 *  If the lock is owned and there are threads contending for  the
lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
word's value; in other words, this value is:
  
FUTEX_WAITERS | TID
  
  
 Note that a PI futex word never just has the value FUTEX_WAITERS,
 which is a permissible state for non-PI futexes.
 
 The second clause is inappropriate. I don't know if that was yours or
 mine, but non-PI futexes do not have a kernel defined value policy, so
 ==FUTEX_WAITERS cannot be a permissible state as any value is
 permissible for non-PI futexes, and none have a kernel defined state.

Depends. If the regular futex is configured as robust, then we have a
kernel defined value policy as well.

  .\ FIXME I'm not quite clear on the meaning of the following sentence.
  .\   Is this trying to say that while blocked in a
  .\   FUTEX_WAIT_REQUEUE_PI, it could happen that another
  .\   task does a FUTEX_WAKE on uaddr that simply causes
  .\   a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
  .\   does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI
  .\   opertion? Does it remain blocked, or does it unblock
  .\   In which case, what does user space see?
  
The
waiter   can  be  removed  from  the  wait  on  uaddr  via
FUTEX_WAKE without requeueing on uaddr2.
 
 Userspace should see the task wake and continue executing. This would
 effectively be a cancelation operation - which I didn't think was
 supported. Thomas?

We probably never intended to support it, but looking at the code it
works (did not try it though). It returns to user space with
-EWOULDBLOCK. So it basically behaves like any other spurious wakeup.
 
Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-19 Thread Thomas Gleixner
On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote:
 FUTEX_CMP_REQUEUE (since Linux 2.6.7)
This  operation  first  checks  whether the location uaddr
still contains the value  val3.   If  not,  the  operation
fails  with  the  error  EAGAIN.  Otherwise, the operation
wakes up a maximum of val waiters that are waiting on  the
futex  at uaddr.  If there are more than val waiters, then
the remaining waiters are removed from the wait  queue  of
the  source  futex at uaddr and added to the wait queue of
the target futex at uaddr2.  The val2  argument  specifies
an  upper limit on the number of waiters that are requeued
to the futex at uaddr2.
 
  .\ FIXME(Torvald) Is the following correct?  Or is just the decision
  .\ which threads to wake or requeue part of the atomic operation?
 
The load from uaddr is  an  atomic  memory  access  (i.e.,
using atomic machine instructions of the respective archi‐
tecture).  This load, the comparison with  val3,  and  the
requeueing  of  any  waiters  are performed atomically and
totally ordered with respect to other  operations  on  the
same futex word.
  
  It's atomic as the other atomic operations on the futex word. It's
  always performed with the proper lock(s) held in the kernel. That
  means any concurrent operation will serialize on that lock(s). User
  space has to make sure, that depending on the observed value no
  concurrent operations happen, but that's something the kernel cannot
  control.
 
 ???
 Sorry, I'm not clear here. Is the current text correct then? Or is some
 change needed.

I think we need some change here because the meaning of atomic is
unclear. The basic rules of futexes are:

 - All modifying operations on the futex value have to be done with
   atomic instructions, usually cmpxchg. That applies to both kernel
   and user space.

   That's the atomicity at the futex value level.

 - In the kernel we have to create/modify/destroy state in order to
   provide the blocking/requeueing etc.

   This state needs protection as well. So all operations related to
   the kernel internal state are serialized on the hash bucket
   locks. The hash buckets are a scalability mechanism to avoid
   contention on a single lock protecting all kernel internal
   state. For simplicity reasons you can just think of a global lock
   protecting all kernel internal state.

   If the kernel creates/modifies state then it can be necessary to
   either reread the futex value or modify it. That happens under the
   locks as well.

   So in the case of requeue, we take the proper locks and perform the
   comparison with val3 and the requeueing with the locks held.
   
   So that lock protection makes these operations 'atomic'. The
   correct expression is 'serialized'.
 
  .\ FIXME We need some explanation in the following paragraph of *why*
  .\   it is important to note that the kernel will update the
  .\   futex word's value prior
 It is important to note to returning to user space . Can someone
 explain?   that  the  kernel  will  update the futex word's value
 prior to returning to user space.  Unlike the other futex  opera‐
 tions  described  above, the PI futex operations are designed for
 the implementation of very specific IPC mechanisms.
  
  If there are multiple waiters on a pi futex then a wake pi operation
  will wake the first waiter and hand over the lock to this waiter. This
  includes handing over the rtmutex which represents the futex in the
  kernel. The strict requirement is that the futex owner and the rtmutex
  owner must be the same, except for the update period which is
  serialized by the futex internal locking. That means the kernel must
  update the user space value prior to returning to user space.

And as noted above: It must update while holding the proper locks.

  .\ FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
  .\   made the observation that EINVAL is returned if the non-pi 
  .\   to pi or op pairing semantics are violated.
  .\   Probably there needs to be a general statement about this
  .\   requirement, probably located at about this point in the page.
  .\   Darren (or someone else), care to take a shot at this?
  
  Well, that's hard to describe because the kernel only has a limited
  way of detecting such mismatches. It only can detect it when there are
  non PI waiters on a futex and a PI function is called or vice versa.
 
 Hmmm. Okay, I filed your comments away for reference, but
 hopefully someone can help with some actual text.

I let Darren come up with something sensible :)
 
  .\ FIXME Somewhere on this page (I guess under the discussion of PI
  .\   futexes) we need 

Re: Next round: revised futex(2) man page for review

2015-08-08 Thread Michael Kerrisk (man-pages)
Hi Darren,

Some of my comments below will refer to the reply I just sent
to tglx (and the list) a few minutes ago.

On 08/06/2015 12:21 AM, Darren Hart wrote:
> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello all,
>>
> 
> Michael, thank you for your diligence in following up and collecting
> reviews. I've attempted to respond to what I was specifically called out
> in or where I had something specific to add in addition to other
> replies.

Thanks!

> After this, will you send another version (numbered for reference
> maybe?) with any remaining FIXMEs that haven't yet been addressed
> according to your accounting?

Yes, I'll be sending out another draft (probably after a short delay,
while I see what further responses come back on the mails I just sent.)
In any case, the latest version of the page can be found at
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex

>>Priority-inheritance futexes
>>Linux supports priority-inheritance (PI) futexes in order to han‐
>>dle priority-inversion problems that can be encountered with nor‐
>>mal  futex  locks.  Priority inversion is the problem that occurs
>>when a high-priority task is blocked waiting to  acquire  a  lock
>>held  by a low-priority task, while tasks at an intermediate pri‐
>>ority continuously preempt the low-priority task  from  the  CPU.
>>Consequently,  the  low-priority  task  makes  no progress toward
>>releasing the lock, and the high-priority task remains blocked.
>>
>>Priority inheritance is a mechanism for dealing with  the  prior‐
>>ity-inversion problem.  With this mechanism, when a high-priority
>>task becomes blocked by a lock held by a low-priority  task,  the
>>latter's priority is temporarily raised to that of the former, so
>>that it is not preempted by any intermediate level tasks, and can
>>thus  make  progress toward releasing the lock.  To be effective,
>>priority inheritance must be transitive, meaning that if a  high-
>>priority task blocks on a lock held by a lower-priority task that
>>is itself blocked by lock held by  another  intermediate-priority
>>task  (and  so  on, for chains of arbitrary length), then both of
>>those task (or more generally, all of the tasks in a lock  chain)
>>have  their priorities raised to be the same as the high-priority
>>task.
>>
>> .\" FIXME XXX The following is my attempt at a definition of PI futexes,
>> .\"   based on mail discussions with Darren Hart. Does it seem okay?
>>
>>From a user-space perspective, what makes a futex PI-aware  is  a
>>policy  agreement  between  user  space  and the kernel about the
>>value of the futex word (described in a moment), coupled with the
>>use  of  the  PI futex operations described below (in particular,
>>FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, and FUTEX_CMP_REQUEUE_PI).
> 
> Yes. Was this intended to be a complete opcode list? 

No. I'll remove that list, in case its misunderstood that way.

> PI operations must
> use paired operations.
> 
> (FUTEX_LOCK_PI | FUTEX_TRYLOCK_PI) : FUTEX_UNLOCK_PI
> FUTEX_WAIT_REQUEUE_PI : FUTEX_CMP_REQUEUE_PI

And now I've made that point explicitly in the page. See my comment 
lower down.

> And their PRIVATE counterparts of course (which is assumed as it is a
> flag to the opcode).

Yes. But I don't think that needs to be called out explicitly here (?).

>> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
>> .\"   The following text is drawn from the Hart/Guniguntala paper
>> .\"   (listed in SEE ALSO), but I have reworded some pieces
>> .\"   significantly. Please check it.
>>
>>The PI futex operations described below  differ  from  the  other
>>futex  operations  in  that  they impose policy on the use of the
>>value of the futex word:
>>
>>*  If the lock is not acquired, the futex word's value  shall  be
>>   0.
>>
>>*  If  the  lock is acquired, the futex word's value shall be the
>>   thread ID (TID; see gettid(2)) of the owning thread.
>>
>>*  If the lock is owned and there are threads contending for  the
>>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
>>   word's value; in other words, this value is:
>>
>>   FUTEX_WAITERS | TID
>>
>>
>>Note that a PI futex word never just has the value FUTEX_WAITERS,
>>which is a permissible state for non-PI futexes.
> 
> The second clause is inappropriate. I don't know if that was yours or
> mine, but non-PI futexes do not have a kernel defined value policy, so
> ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> permissible for non-PI futexes, and none have a kernel defined state.
> 
> Perhaps include a Note under the third bullet as:
> 
>  

Re: Next round: revised futex(2) man page for review

2015-08-08 Thread Michael Kerrisk (man-pages)
On 07/28/2015 11:03 PM, Thomas Gleixner wrote:
> On Tue, 28 Jul 2015, Peter Zijlstra wrote:
> 
>> On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
>>
FUTEX_WAKE (since Linux 2.6.0)
   This  operation  wakes at most val of the waiters that are
   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
   address  uaddr.  Most commonly, val is specified as either
   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
   ers).   No  guarantee  is provided about which waiters are
   awoken (e.g., a waiter with a higher  scheduling  priority
   is  not  guaranteed to be awoken in preference to a waiter
   with a lower priority).
>>>
>>> That's only correct up to Linux 2.6.21.
>>>
>>> Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
>>> threads this takes the nice level into account. Threads with the same
>>> priority are woken in FIFO order.
>>
>> Maybe don't mention the effects of SCHED_OTHER, order by nice value is
>> 'wrong'.
> 
> Indeed.
>  
>> Also, this code seems to use plist, which means it won't do the right
>> thing for SCHED_DEADLINE either.
>>
>> Do we want to go fix that?
> 
> I think so.

So, no change to this piece of text then?

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-08 Thread Michael Kerrisk (man-pages)
Hi Thomas,

Thank you for the comments below. This helps hugely:
more than 30 of my FIXMEs have now gone away!

I have a few open questions, which you can find
by searching for the string "???". If you would have
a chance to look at those, I'd appreciate it.

On 07/28/2015 10:23 PM, Thomas Gleixner wrote:
> On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:
>>FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
>>   This   option   bit   can   be   employed  only  with  the
>>   FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI operations.
>>
>>   If this option is set, the kernel  treats  timeout  as  an
>>   absolute time based on CLOCK_REALTIME.
>>
>> .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay?
>>   If  this  option  is not set, the kernel treats timeout as
>>   relative time, measured against the CLOCK_MONOTONIC clock.
> 
> That's correct.

Thanks.

>>The operation specified in futex_op is one of the following:
>>
>>FUTEX_WAIT (since Linux 2.6.0)
>>   This operation tests that the  value  at  the  futex  word
>>   pointed  to  by  the  address  uaddr  still  contains  the
>>   expected value  val,  and  if  so,  then  sleeps  awaiting
>>   FUTEX_WAKE  on  the  futex word.  The load of the value of
>>   the futex word is an atomic  memory  access  (i.e.,  using
>>   atomic  machine  instructions  of the respective architec‐
>>   ture).  This load, the comparison with the expected value,
>>   and starting to sleep are performed atomically and totally
>>   ordered with respect to other futex operations on the same
>>   futex  word.  If the thread starts to sleep, it is consid‐
>>   ered a waiter on this futex word.  If the futex value does
>>   not  match  val,  then the call fails immediately with the
>>   error EAGAIN.
>>
>>   The purpose of the comparison with the expected  value  is
>>   to  prevent  lost  wake-ups: If another thread changed the
>>   value of the futex word after the calling  thread  decided
>>   to block based on the prior value, and if the other thread
>>   executed a FUTEX_WAKE operation (or similar wake-up) after
>>   the  value  change  and  before this FUTEX_WAIT operation,
>>   then the latter will observe the value change and will not
>>   start to sleep.
>>
>>   If  the timeout argument is non-NULL, its contents specify
>>   a relative timeout for the wait, measured according to the
>> .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay?
> 
> Yes.

Thanks.

> 
>>   CLOCK_MONOTONIC  clock.  (This interval will be rounded up
>>   to the system clock  granularity,  and  kernel  scheduling
>>   delays  mean  that  the blocking interval may overrun by a
>>   small amount.)
> 
>   The given wait time will be rounded up to the system
>   clock granularity and is guaranteed not to expire
>   early.
> 
> There are a gazillion reasons why it can expire late, but the
> guarantee is that it never expires prematurely.
> 
>>If timeout is NULL, the call blocks indef‐
>>   initely.
> 
> Right.

Thanks. Reworded as you suggest. 

>>   The arguments uaddr2 and val3 are ignored.
>>
>>
>>FUTEX_WAKE (since Linux 2.6.0)
>>   This  operation  wakes at most val of the waiters that are
>>   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
>>   address  uaddr.  Most commonly, val is specified as either
>>   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
>>   ers).   No  guarantee  is provided about which waiters are
>>   awoken (e.g., a waiter with a higher  scheduling  priority
>>   is  not  guaranteed to be awoken in preference to a waiter
>>   with a lower priority).
> 
> That's only correct up to Linux 2.6.21.
> 
> Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
> threads this takes the nice level into account. Threads with the same
> priority are woken in FIFO order.

So, this got picked up in a little subthread by Peter Zijsltra. I'll
reply there.

>>   The arguments timeout, uaddr2, and val3 are ignored.
>  
>>
>>FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25)
>>   This operation creates a file descriptor that  is  associ‐
>>   ated  with  the futex at uaddr.  The caller must close the
>>   returned file descriptor after use.  When another  process
>>   or  thread  performs  a  FUTEX_WAKE on the futex word, the
>>   file  descriptor  indicates   as   being   readable   with
>>

Re: Next round: revised futex(2) man page for review

2015-08-08 Thread Michael Kerrisk (man-pages)
Hi Thomas,

Thank you for the comments below. This helps hugely:
more than 30 of my FIXMEs have now gone away!

I have a few open questions, which you can find
by searching for the string ???. If you would have
a chance to look at those, I'd appreciate it.

On 07/28/2015 10:23 PM, Thomas Gleixner wrote:
 On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:
FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
   This   option   bit   can   be   employed  only  with  the
   FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI operations.

   If this option is set, the kernel  treats  timeout  as  an
   absolute time based on CLOCK_REALTIME.

 .\ FIXME XXX I added CLOCK_MONOTONIC below. Okay?
   If  this  option  is not set, the kernel treats timeout as
   relative time, measured against the CLOCK_MONOTONIC clock.
 
 That's correct.

Thanks.

The operation specified in futex_op is one of the following:

FUTEX_WAIT (since Linux 2.6.0)
   This operation tests that the  value  at  the  futex  word
   pointed  to  by  the  address  uaddr  still  contains  the
   expected value  val,  and  if  so,  then  sleeps  awaiting
   FUTEX_WAKE  on  the  futex word.  The load of the value of
   the futex word is an atomic  memory  access  (i.e.,  using
   atomic  machine  instructions  of the respective architec‐
   ture).  This load, the comparison with the expected value,
   and starting to sleep are performed atomically and totally
   ordered with respect to other futex operations on the same
   futex  word.  If the thread starts to sleep, it is consid‐
   ered a waiter on this futex word.  If the futex value does
   not  match  val,  then the call fails immediately with the
   error EAGAIN.

   The purpose of the comparison with the expected  value  is
   to  prevent  lost  wake-ups: If another thread changed the
   value of the futex word after the calling  thread  decided
   to block based on the prior value, and if the other thread
   executed a FUTEX_WAKE operation (or similar wake-up) after
   the  value  change  and  before this FUTEX_WAIT operation,
   then the latter will observe the value change and will not
   start to sleep.

   If  the timeout argument is non-NULL, its contents specify
   a relative timeout for the wait, measured according to the
 .\ FIXME XXX I added CLOCK_MONOTONIC below. Okay?
 
 Yes.

Thanks.

 
   CLOCK_MONOTONIC  clock.  (This interval will be rounded up
   to the system clock  granularity,  and  kernel  scheduling
   delays  mean  that  the blocking interval may overrun by a
   small amount.)
 
   The given wait time will be rounded up to the system
   clock granularity and is guaranteed not to expire
   early.
 
 There are a gazillion reasons why it can expire late, but the
 guarantee is that it never expires prematurely.
 
If timeout is NULL, the call blocks indef‐
   initely.
 
 Right.

Thanks. Reworded as you suggest. 

   The arguments uaddr2 and val3 are ignored.


FUTEX_WAKE (since Linux 2.6.0)
   This  operation  wakes at most val of the waiters that are
   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
   address  uaddr.  Most commonly, val is specified as either
   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
   ers).   No  guarantee  is provided about which waiters are
   awoken (e.g., a waiter with a higher  scheduling  priority
   is  not  guaranteed to be awoken in preference to a waiter
   with a lower priority).
 
 That's only correct up to Linux 2.6.21.
 
 Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
 threads this takes the nice level into account. Threads with the same
 priority are woken in FIFO order.

So, this got picked up in a little subthread by Peter Zijsltra. I'll
reply there.

   The arguments timeout, uaddr2, and val3 are ignored.
  

FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25)
   This operation creates a file descriptor that  is  associ‐
   ated  with  the futex at uaddr.  The caller must close the
   returned file descriptor after use.  When another  process
   or  thread  performs  a  FUTEX_WAKE on the futex word, the
   file  descriptor  indicates   as   being   readable   with
   select(2), poll(2), and epoll(7)

   The  file  descriptor  can  be used to obtain asynchronous
   notifications:  if  val  is  

Re: Next round: revised futex(2) man page for review

2015-08-08 Thread Michael Kerrisk (man-pages)
On 07/28/2015 11:03 PM, Thomas Gleixner wrote:
 On Tue, 28 Jul 2015, Peter Zijlstra wrote:
 
 On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:

FUTEX_WAKE (since Linux 2.6.0)
   This  operation  wakes at most val of the waiters that are
   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
   address  uaddr.  Most commonly, val is specified as either
   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
   ers).   No  guarantee  is provided about which waiters are
   awoken (e.g., a waiter with a higher  scheduling  priority
   is  not  guaranteed to be awoken in preference to a waiter
   with a lower priority).

 That's only correct up to Linux 2.6.21.

 Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
 threads this takes the nice level into account. Threads with the same
 priority are woken in FIFO order.

 Maybe don't mention the effects of SCHED_OTHER, order by nice value is
 'wrong'.
 
 Indeed.
  
 Also, this code seems to use plist, which means it won't do the right
 thing for SCHED_DEADLINE either.

 Do we want to go fix that?
 
 I think so.

So, no change to this piece of text then?

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-08 Thread Michael Kerrisk (man-pages)
Hi Darren,

Some of my comments below will refer to the reply I just sent
to tglx (and the list) a few minutes ago.

On 08/06/2015 12:21 AM, Darren Hart wrote:
 On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
 Hello all,

 
 Michael, thank you for your diligence in following up and collecting
 reviews. I've attempted to respond to what I was specifically called out
 in or where I had something specific to add in addition to other
 replies.

Thanks!

 After this, will you send another version (numbered for reference
 maybe?) with any remaining FIXMEs that haven't yet been addressed
 according to your accounting?

Yes, I'll be sending out another draft (probably after a short delay,
while I see what further responses come back on the mails I just sent.)
In any case, the latest version of the page can be found at
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex

Priority-inheritance futexes
Linux supports priority-inheritance (PI) futexes in order to han‐
dle priority-inversion problems that can be encountered with nor‐
mal  futex  locks.  Priority inversion is the problem that occurs
when a high-priority task is blocked waiting to  acquire  a  lock
held  by a low-priority task, while tasks at an intermediate pri‐
ority continuously preempt the low-priority task  from  the  CPU.
Consequently,  the  low-priority  task  makes  no progress toward
releasing the lock, and the high-priority task remains blocked.

Priority inheritance is a mechanism for dealing with  the  prior‐
ity-inversion problem.  With this mechanism, when a high-priority
task becomes blocked by a lock held by a low-priority  task,  the
latter's priority is temporarily raised to that of the former, so
that it is not preempted by any intermediate level tasks, and can
thus  make  progress toward releasing the lock.  To be effective,
priority inheritance must be transitive, meaning that if a  high-
priority task blocks on a lock held by a lower-priority task that
is itself blocked by lock held by  another  intermediate-priority
task  (and  so  on, for chains of arbitrary length), then both of
those task (or more generally, all of the tasks in a lock  chain)
have  their priorities raised to be the same as the high-priority
task.

 .\ FIXME XXX The following is my attempt at a definition of PI futexes,
 .\   based on mail discussions with Darren Hart. Does it seem okay?

From a user-space perspective, what makes a futex PI-aware  is  a
policy  agreement  between  user  space  and the kernel about the
value of the futex word (described in a moment), coupled with the
use  of  the  PI futex operations described below (in particular,
FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, and FUTEX_CMP_REQUEUE_PI).
 
 Yes. Was this intended to be a complete opcode list? 

No. I'll remove that list, in case its misunderstood that way.

 PI operations must
 use paired operations.
 
 (FUTEX_LOCK_PI | FUTEX_TRYLOCK_PI) : FUTEX_UNLOCK_PI
 FUTEX_WAIT_REQUEUE_PI : FUTEX_CMP_REQUEUE_PI

And now I've made that point explicitly in the page. See my comment 
lower down.

 And their PRIVATE counterparts of course (which is assumed as it is a
 flag to the opcode).

Yes. But I don't think that needs to be called out explicitly here (?).

 .\ FIXME XXX = Start of adapted Hart/Guniguntala text =
 .\   The following text is drawn from the Hart/Guniguntala paper
 .\   (listed in SEE ALSO), but I have reworded some pieces
 .\   significantly. Please check it.

The PI futex operations described below  differ  from  the  other
futex  operations  in  that  they impose policy on the use of the
value of the futex word:

*  If the lock is not acquired, the futex word's value  shall  be
   0.

*  If  the  lock is acquired, the futex word's value shall be the
   thread ID (TID; see gettid(2)) of the owning thread.

*  If the lock is owned and there are threads contending for  the
   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
   word's value; in other words, this value is:

   FUTEX_WAITERS | TID


Note that a PI futex word never just has the value FUTEX_WAITERS,
which is a permissible state for non-PI futexes.
 
 The second clause is inappropriate. I don't know if that was yours or
 mine, but non-PI futexes do not have a kernel defined value policy, so
 ==FUTEX_WAITERS cannot be a permissible state as any value is
 permissible for non-PI futexes, and none have a kernel defined state.
 
 Perhaps include a Note under the third bullet as:
 
 Note: It is invalid for a PI futex word to have no owner and
   FUTEX_WAITERS set.

Done.

With this policy in place, a 

Re: Next round: revised futex(2) man page for review

2015-08-05 Thread Darren Hart
On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
> Hello all,
> 

Michael, thank you for your diligence in following up and collecting
reviews. I've attempted to respond to what I was specifically called out
in or where I had something specific to add in addition to other
replies.

After this, will you send another version (numbered for reference
maybe?) with any remaining FIXMEs that haven't yet been addressed
according to your accounting?

...

>Priority-inheritance futexes
>Linux supports priority-inheritance (PI) futexes in order to han‐
>dle priority-inversion problems that can be encountered with nor‐
>mal  futex  locks.  Priority inversion is the problem that occurs
>when a high-priority task is blocked waiting to  acquire  a  lock
>held  by a low-priority task, while tasks at an intermediate pri‐
>ority continuously preempt the low-priority task  from  the  CPU.
>Consequently,  the  low-priority  task  makes  no progress toward
>releasing the lock, and the high-priority task remains blocked.
> 
>Priority inheritance is a mechanism for dealing with  the  prior‐
>ity-inversion problem.  With this mechanism, when a high-priority
>task becomes blocked by a lock held by a low-priority  task,  the
>latter's priority is temporarily raised to that of the former, so
>that it is not preempted by any intermediate level tasks, and can
>thus  make  progress toward releasing the lock.  To be effective,
>priority inheritance must be transitive, meaning that if a  high-
>priority task blocks on a lock held by a lower-priority task that
>is itself blocked by lock held by  another  intermediate-priority
>task  (and  so  on, for chains of arbitrary length), then both of
>those task (or more generally, all of the tasks in a lock  chain)
>have  their priorities raised to be the same as the high-priority
>task.
> 
> .\" FIXME XXX The following is my attempt at a definition of PI futexes,
> .\"   based on mail discussions with Darren Hart. Does it seem okay?
> 
>From a user-space perspective, what makes a futex PI-aware  is  a
>policy  agreement  between  user  space  and the kernel about the
>value of the futex word (described in a moment), coupled with the
>use  of  the  PI futex operations described below (in particular,
>FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, and FUTEX_CMP_REQUEUE_PI).

Yes. Was this intended to be a complete opcode list? PI operations must
use paired operations.

(FUTEX_LOCK_PI | FUTEX_TRYLOCK_PI) : FUTEX_UNLOCK_PI
FUTEX_WAIT_REQUEUE_PI : FUTEX_CMP_REQUEUE_PI

And their PRIVATE counterparts of course (which is assumed as it is a
flag to the opcode).

> 
> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
> .\"   The following text is drawn from the Hart/Guniguntala paper
> .\"   (listed in SEE ALSO), but I have reworded some pieces
> .\"   significantly. Please check it.
> 
>The PI futex operations described below  differ  from  the  other
>futex  operations  in  that  they impose policy on the use of the
>value of the futex word:
> 
>*  If the lock is not acquired, the futex word's value  shall  be
>   0.
> 
>*  If  the  lock is acquired, the futex word's value shall be the
>   thread ID (TID; see gettid(2)) of the owning thread.
> 
>*  If the lock is owned and there are threads contending for  the
>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
>   word's value; in other words, this value is:
> 
>   FUTEX_WAITERS | TID
> 
> 
>Note that a PI futex word never just has the value FUTEX_WAITERS,
>which is a permissible state for non-PI futexes.

The second clause is inappropriate. I don't know if that was yours or
mine, but non-PI futexes do not have a kernel defined value policy, so
==FUTEX_WAITERS cannot be a "permissible state" as any value is
permissible for non-PI futexes, and none have a kernel defined state.

Perhaps include a Note under the third bullet as:

  Note: It is invalid for a PI futex word to have no owner and
FUTEX_WAITERS set.

> 
>With this policy in place, a user-space application can acquire a
>not-acquired lock or release a lock that no other threads try  to

"that no other threads try to acquire" seems out of place. I think
"atomic instructions" is sufficient to express how contention is
handled.

>acquire using atomic instructions executed in user space (e.g., a
>compare-and-swap operation such as cmpxchg on the  x86  architec‐
>ture).   Acquiring  a  lock simply consists of using compare-and-
>swap to atomically set the futex word's value to the caller's TID
>if  its  previous  value  was 0.  Releasing a lock 

Re: Next round: revised futex(2) man page for review

2015-08-05 Thread Darren Hart
On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
 Hello all,
 

Michael, thank you for your diligence in following up and collecting
reviews. I've attempted to respond to what I was specifically called out
in or where I had something specific to add in addition to other
replies.

After this, will you send another version (numbered for reference
maybe?) with any remaining FIXMEs that haven't yet been addressed
according to your accounting?

...

Priority-inheritance futexes
Linux supports priority-inheritance (PI) futexes in order to han‐
dle priority-inversion problems that can be encountered with nor‐
mal  futex  locks.  Priority inversion is the problem that occurs
when a high-priority task is blocked waiting to  acquire  a  lock
held  by a low-priority task, while tasks at an intermediate pri‐
ority continuously preempt the low-priority task  from  the  CPU.
Consequently,  the  low-priority  task  makes  no progress toward
releasing the lock, and the high-priority task remains blocked.
 
Priority inheritance is a mechanism for dealing with  the  prior‐
ity-inversion problem.  With this mechanism, when a high-priority
task becomes blocked by a lock held by a low-priority  task,  the
latter's priority is temporarily raised to that of the former, so
that it is not preempted by any intermediate level tasks, and can
thus  make  progress toward releasing the lock.  To be effective,
priority inheritance must be transitive, meaning that if a  high-
priority task blocks on a lock held by a lower-priority task that
is itself blocked by lock held by  another  intermediate-priority
task  (and  so  on, for chains of arbitrary length), then both of
those task (or more generally, all of the tasks in a lock  chain)
have  their priorities raised to be the same as the high-priority
task.
 
 .\ FIXME XXX The following is my attempt at a definition of PI futexes,
 .\   based on mail discussions with Darren Hart. Does it seem okay?
 
From a user-space perspective, what makes a futex PI-aware  is  a
policy  agreement  between  user  space  and the kernel about the
value of the futex word (described in a moment), coupled with the
use  of  the  PI futex operations described below (in particular,
FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, and FUTEX_CMP_REQUEUE_PI).

Yes. Was this intended to be a complete opcode list? PI operations must
use paired operations.

(FUTEX_LOCK_PI | FUTEX_TRYLOCK_PI) : FUTEX_UNLOCK_PI
FUTEX_WAIT_REQUEUE_PI : FUTEX_CMP_REQUEUE_PI

And their PRIVATE counterparts of course (which is assumed as it is a
flag to the opcode).

 
 .\ FIXME XXX = Start of adapted Hart/Guniguntala text =
 .\   The following text is drawn from the Hart/Guniguntala paper
 .\   (listed in SEE ALSO), but I have reworded some pieces
 .\   significantly. Please check it.
 
The PI futex operations described below  differ  from  the  other
futex  operations  in  that  they impose policy on the use of the
value of the futex word:
 
*  If the lock is not acquired, the futex word's value  shall  be
   0.
 
*  If  the  lock is acquired, the futex word's value shall be the
   thread ID (TID; see gettid(2)) of the owning thread.
 
*  If the lock is owned and there are threads contending for  the
   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
   word's value; in other words, this value is:
 
   FUTEX_WAITERS | TID
 
 
Note that a PI futex word never just has the value FUTEX_WAITERS,
which is a permissible state for non-PI futexes.

The second clause is inappropriate. I don't know if that was yours or
mine, but non-PI futexes do not have a kernel defined value policy, so
==FUTEX_WAITERS cannot be a permissible state as any value is
permissible for non-PI futexes, and none have a kernel defined state.

Perhaps include a Note under the third bullet as:

  Note: It is invalid for a PI futex word to have no owner and
FUTEX_WAITERS set.

 
With this policy in place, a user-space application can acquire a
not-acquired lock or release a lock that no other threads try  to

that no other threads try to acquire seems out of place. I think
atomic instructions is sufficient to express how contention is
handled.

acquire using atomic instructions executed in user space (e.g., a
compare-and-swap operation such as cmpxchg on the  x86  architec‐
ture).   Acquiring  a  lock simply consists of using compare-and-
swap to atomically set the futex word's value to the caller's TID
if  its  previous  value  was 0.  Releasing a lock requires using
compare-and-swap to set the futex word's value to 0 if the 

Re: Next round: revised futex(2) man page for review

2015-07-30 Thread Michael Kerrisk (man-pages)
On 07/29/2015 06:21 AM, Darren Hart wrote:
> On Tue, Jul 28, 2015 at 09:11:41PM -0700, Darren Hart wrote:
>> On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
>>> On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:
>>
>> ...
>>
FUTEX_REQUEUE (since Linux 2.6.0)
 .\" FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken
 .\" in general, or is this comment implicitly speaking about the
 .\" condvar (?) use case? If the latter we might want to weaken the
 .\" advice below a little.
 .\" [Anyone else have input on this?]
>>>
>>> The condvar use case exposes the flaw nicely, but that's pretty much
>>> true for everything which wants a sane requeue operation.
>>
>> In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken 
>> and
>> should not be used in new code) and someone argued strongly against... 
>> stating
>> that there were legitimate uses for it. Of course I'm struggling to find the
>> thread and the reference at the moment - immensely useful, I know.
>>
>> I'll continue trying to find it and see if it can be useful here. I believe
>> Torvald was on the thread as well.
>>
> 
> Found it on libc-alpha, here it is for reference:
> 
>   From: Rich Felker 
>   Date: Wed, 29 Oct 2014 22:43:17 -0400
>   To: Darren Hart 
>   Cc: Carlos O'Donell , Roland McGrath 
> ,
>   Torvald Riegel , GLIBC Devel 
> ,
>   Michael Kerrisk 
>   Subject: Re: Add futex wrapper to glibc?
> 
>   On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote:
>   > > We are IMO at the stage where futex is stable, few things are
>   > > changing, and with documentation in place, I would consider adding a
>   > > futex wrapper.
>   > 
>   > Yes, at least for the defined OP codes. New OPs may be added of
>   > course, but that isn't a concern for supporting what exists today, and
>   > doesn't break compatibility.
>   > 
>   > I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally
>   > broken.  FUTEX_CMP_REQUEUE should *always* be used instead. The glibc
>   > wrapper is one way to encourage developers to do the right thing
>   > (don't expose the bad op in the header).
> 
>   You're mistaken here. There are plenty of valid ways to use
>   FUTEX_REQUEUE - for example if the calling thread is requeuing the
>   target(s) to a lock that the calling thread owns. Just because it
>   doesn't meet the needs of the way glibc was using it internally
>   doesn't mean it's useless for other applications.
> 
>   In any case, I don't think there's a proposal to intercept/modify the
>   commands to futex, just to pass them through (and possibly do a
>   cancellable syscall for some of them).
> 
>   Rich
> 
> 
>>>
   Avoid using this operation.  It is broken for its intended
   purpose.  Use FUTEX_CMP_REQUEUE instead.

   Thisoperationperformsthesametaskas
   FUTEX_CMP_REQUEUE, except that no check is made using  the
   value in val3.  (The argument val3 is ignored.)

Thanks, Darren, that's really helpful! I've removed the statement in the man
page that FUTEX_REQUEUE is broken.

By the way, Darren. There were a couple of FIXMEs in the page where you are
explicitly mentioned by name. Could you take a look at those? Specifically,
the large block of text starting at:

>> .\" FIXME XXX The following is my attempt at a definition of PI futexes,
>> .\"   based on mail discussions with Darren Hart. Does it seem okay?

   (tglx looked at this and blessed it, but I'd like you also to check.)

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-30 Thread Michael Kerrisk (man-pages)
On 07/29/2015 06:21 AM, Darren Hart wrote:
 On Tue, Jul 28, 2015 at 09:11:41PM -0700, Darren Hart wrote:
 On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
 On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:

 ...

FUTEX_REQUEUE (since Linux 2.6.0)
 .\ FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken
 .\ in general, or is this comment implicitly speaking about the
 .\ condvar (?) use case? If the latter we might want to weaken the
 .\ advice below a little.
 .\ [Anyone else have input on this?]

 The condvar use case exposes the flaw nicely, but that's pretty much
 true for everything which wants a sane requeue operation.

 In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken 
 and
 should not be used in new code) and someone argued strongly against... 
 stating
 that there were legitimate uses for it. Of course I'm struggling to find the
 thread and the reference at the moment - immensely useful, I know.

 I'll continue trying to find it and see if it can be useful here. I believe
 Torvald was on the thread as well.

 
 Found it on libc-alpha, here it is for reference:
 
   From: Rich Felker dal...@libc.org
   Date: Wed, 29 Oct 2014 22:43:17 -0400
   To: Darren Hart dvh...@infradead.org
   Cc: Carlos O'Donell car...@redhat.com, Roland McGrath 
 rol...@hack.frob.com,
   Torvald Riegel trie...@redhat.com, GLIBC Devel 
 libc-al...@sourceware.org,
   Michael Kerrisk mtk.manpa...@gmail.com
   Subject: Re: Add futex wrapper to glibc?
 
   On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote:
 We are IMO at the stage where futex is stable, few things are
 changing, and with documentation in place, I would consider adding a
 futex wrapper.

Yes, at least for the defined OP codes. New OPs may be added of
course, but that isn't a concern for supporting what exists today, and
doesn't break compatibility.

I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally
broken.  FUTEX_CMP_REQUEUE should *always* be used instead. The glibc
wrapper is one way to encourage developers to do the right thing
(don't expose the bad op in the header).
 
   You're mistaken here. There are plenty of valid ways to use
   FUTEX_REQUEUE - for example if the calling thread is requeuing the
   target(s) to a lock that the calling thread owns. Just because it
   doesn't meet the needs of the way glibc was using it internally
   doesn't mean it's useless for other applications.
 
   In any case, I don't think there's a proposal to intercept/modify the
   commands to futex, just to pass them through (and possibly do a
   cancellable syscall for some of them).
 
   Rich
 
 

   Avoid using this operation.  It is broken for its intended
   purpose.  Use FUTEX_CMP_REQUEUE instead.

   Thisoperationperformsthesametaskas
   FUTEX_CMP_REQUEUE, except that no check is made using  the
   value in val3.  (The argument val3 is ignored.)

Thanks, Darren, that's really helpful! I've removed the statement in the man
page that FUTEX_REQUEUE is broken.

By the way, Darren. There were a couple of FIXMEs in the page where you are
explicitly mentioned by name. Could you take a look at those? Specifically,
the large block of text starting at:

 .\ FIXME XXX The following is my attempt at a definition of PI futexes,
 .\   based on mail discussions with Darren Hart. Does it seem okay?

   (tglx looked at this and blessed it, but I'd like you also to check.)

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-29 Thread Thomas Gleixner
On Tue, 28 Jul 2015, Darren Hart wrote:
> Found it on libc-alpha, here it is for reference:
> 
>   From: Rich Felker 
>   Date: Wed, 29 Oct 2014 22:43:17 -0400
>   To: Darren Hart 
>   Cc: Carlos O'Donell , Roland McGrath 
> ,
>   Torvald Riegel , GLIBC Devel 
> ,
>   Michael Kerrisk 
>   Subject: Re: Add futex wrapper to glibc?
> 
>   On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote:
>   > > We are IMO at the stage where futex is stable, few things are
>   > > changing, and with documentation in place, I would consider adding a
>   > > futex wrapper.
>   > 
>   > Yes, at least for the defined OP codes. New OPs may be added of
>   > course, but that isn't a concern for supporting what exists today, and
>   > doesn't break compatibility.
>   > 
>   > I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally
>   > broken.  FUTEX_CMP_REQUEUE should *always* be used instead. The glibc
>   > wrapper is one way to encourage developers to do the right thing
>   > (don't expose the bad op in the header).
> 
>   You're mistaken here. There are plenty of valid ways to use
>   FUTEX_REQUEUE - for example if the calling thread is requeuing the
>   target(s) to a lock that the calling thread owns. Just because it
>   doesn't meet the needs of the way glibc was using it internally
>   doesn't mean it's useless for other applications.
> 
>   In any case, I don't think there's a proposal to intercept/modify the
>   commands to futex, just to pass them through (and possibly do a
>   cancellable syscall for some of them).

Fair enough. Did not think about the requeue to futex held by the
caller case. In that case FUTEX_REQUEUE works as advertised.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-29 Thread Thomas Gleixner
On Tue, 28 Jul 2015, Darren Hart wrote:
 Found it on libc-alpha, here it is for reference:
 
   From: Rich Felker dal...@libc.org
   Date: Wed, 29 Oct 2014 22:43:17 -0400
   To: Darren Hart dvh...@infradead.org
   Cc: Carlos O'Donell car...@redhat.com, Roland McGrath 
 rol...@hack.frob.com,
   Torvald Riegel trie...@redhat.com, GLIBC Devel 
 libc-al...@sourceware.org,
   Michael Kerrisk mtk.manpa...@gmail.com
   Subject: Re: Add futex wrapper to glibc?
 
   On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote:
 We are IMO at the stage where futex is stable, few things are
 changing, and with documentation in place, I would consider adding a
 futex wrapper.

Yes, at least for the defined OP codes. New OPs may be added of
course, but that isn't a concern for supporting what exists today, and
doesn't break compatibility.

I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally
broken.  FUTEX_CMP_REQUEUE should *always* be used instead. The glibc
wrapper is one way to encourage developers to do the right thing
(don't expose the bad op in the header).
 
   You're mistaken here. There are plenty of valid ways to use
   FUTEX_REQUEUE - for example if the calling thread is requeuing the
   target(s) to a lock that the calling thread owns. Just because it
   doesn't meet the needs of the way glibc was using it internally
   doesn't mean it's useless for other applications.
 
   In any case, I don't think there's a proposal to intercept/modify the
   commands to futex, just to pass them through (and possibly do a
   cancellable syscall for some of them).

Fair enough. Did not think about the requeue to futex held by the
caller case. In that case FUTEX_REQUEUE works as advertised.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Darren Hart
On Tue, Jul 28, 2015 at 09:11:41PM -0700, Darren Hart wrote:
> On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
> > On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:
> 
> ...
> 
> > >FUTEX_REQUEUE (since Linux 2.6.0)
> > > .\" FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken
> > > .\" in general, or is this comment implicitly speaking about the
> > > .\" condvar (?) use case? If the latter we might want to weaken the
> > > .\" advice below a little.
> > > .\" [Anyone else have input on this?]
> > 
> > The condvar use case exposes the flaw nicely, but that's pretty much
> > true for everything which wants a sane requeue operation.
> 
> In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken 
> and
> should not be used in new code) and someone argued strongly against... stating
> that there were legitimate uses for it. Of course I'm struggling to find the
> thread and the reference at the moment - immensely useful, I know.
> 
> I'll continue trying to find it and see if it can be useful here. I believe
> Torvald was on the thread as well.
> 

Found it on libc-alpha, here it is for reference:

From: Rich Felker 
Date: Wed, 29 Oct 2014 22:43:17 -0400
To: Darren Hart 
Cc: Carlos O'Donell , Roland McGrath 
,
Torvald Riegel , GLIBC Devel 
,
Michael Kerrisk 
Subject: Re: Add futex wrapper to glibc?

On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote:
> > We are IMO at the stage where futex is stable, few things are
> > changing, and with documentation in place, I would consider adding a
> > futex wrapper.
> 
> Yes, at least for the defined OP codes. New OPs may be added of
> course, but that isn't a concern for supporting what exists today, and
> doesn't break compatibility.
> 
> I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally
> broken.  FUTEX_CMP_REQUEUE should *always* be used instead. The glibc
> wrapper is one way to encourage developers to do the right thing
> (don't expose the bad op in the header).

You're mistaken here. There are plenty of valid ways to use
FUTEX_REQUEUE - for example if the calling thread is requeuing the
target(s) to a lock that the calling thread owns. Just because it
doesn't meet the needs of the way glibc was using it internally
doesn't mean it's useless for other applications.

In any case, I don't think there's a proposal to intercept/modify the
commands to futex, just to pass them through (and possibly do a
cancellable syscall for some of them).

Rich


> > 
> > >   Avoid using this operation.  It is broken for its intended
> > >   purpose.  Use FUTEX_CMP_REQUEUE instead.
> > > 
> > >   Thisoperationperformsthesametaskas
> > >   FUTEX_CMP_REQUEUE, except that no check is made using  the
> > >   value in val3.  (The argument val3 is ignored.)
> > > 
> 
> -- 
> Darren Hart
> Intel Open Source Technology Center

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Darren Hart
On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
> On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:

...

> >FUTEX_REQUEUE (since Linux 2.6.0)
> > .\" FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken
> > .\" in general, or is this comment implicitly speaking about the
> > .\" condvar (?) use case? If the latter we might want to weaken the
> > .\" advice below a little.
> > .\" [Anyone else have input on this?]
> 
> The condvar use case exposes the flaw nicely, but that's pretty much
> true for everything which wants a sane requeue operation.

In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken and
should not be used in new code) and someone argued strongly against... stating
that there were legitimate uses for it. Of course I'm struggling to find the
thread and the reference at the moment - immensely useful, I know.

I'll continue trying to find it and see if it can be useful here. I believe
Torvald was on the thread as well.

> 
> >   Avoid using this operation.  It is broken for its intended
> >   purpose.  Use FUTEX_CMP_REQUEUE instead.
> > 
> >   Thisoperationperformsthesametaskas
> >   FUTEX_CMP_REQUEUE, except that no check is made using  the
> >   value in val3.  (The argument val3 is ignored.)
> > 

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Davidlohr Bueso
On Tue, 2015-07-28 at 22:45 +0200, Peter Zijlstra wrote:
> Also, this code seems to use plist, which means it won't do the right
> thing for SCHED_DEADLINE either.

Ick, I don't look forward to seeing nice futex plists converted into
rbtrees. As opposed to, eg. rtmutexes, there are a few caveats:

- Dealing with the top_waiter in rtmutexes is always easy, but in
futexes we need to deal with keys, so caching the leftmost won't work as
nicely.

- This will bloat things like futex_wake, where O(logN) is not suited
for FIFO iteration. And iterating linked lists is, in essence, all that
we really do when calling futex(2).

I have to wonder about the extra overhead added by these points.  I do
understand the dl concern, nonetheless.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Thomas Gleixner
On Tue, 28 Jul 2015, Peter Zijlstra wrote:

> On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
> 
> > >FUTEX_WAKE (since Linux 2.6.0)
> > >   This  operation  wakes at most val of the waiters that are
> > >   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
> > >   address  uaddr.  Most commonly, val is specified as either
> > >   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
> > >   ers).   No  guarantee  is provided about which waiters are
> > >   awoken (e.g., a waiter with a higher  scheduling  priority
> > >   is  not  guaranteed to be awoken in preference to a waiter
> > >   with a lower priority).
> > 
> > That's only correct up to Linux 2.6.21.
> > 
> > Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
> > threads this takes the nice level into account. Threads with the same
> > priority are woken in FIFO order.
> 
> Maybe don't mention the effects of SCHED_OTHER, order by nice value is
> 'wrong'.

Indeed.
 
> Also, this code seems to use plist, which means it won't do the right
> thing for SCHED_DEADLINE either.
> 
> Do we want to go fix that?

I think so.

Thanks,

tglx


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Peter Zijlstra
On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:

> >FUTEX_WAKE (since Linux 2.6.0)
> >   This  operation  wakes at most val of the waiters that are
> >   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
> >   address  uaddr.  Most commonly, val is specified as either
> >   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
> >   ers).   No  guarantee  is provided about which waiters are
> >   awoken (e.g., a waiter with a higher  scheduling  priority
> >   is  not  guaranteed to be awoken in preference to a waiter
> >   with a lower priority).
> 
> That's only correct up to Linux 2.6.21.
> 
> Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
> threads this takes the nice level into account. Threads with the same
> priority are woken in FIFO order.

Maybe don't mention the effects of SCHED_OTHER, order by nice value is
'wrong'.

Also, this code seems to use plist, which means it won't do the right
thing for SCHED_DEADLINE either.

Do we want to go fix that?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Thomas Gleixner
On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:
>FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
>   This   option   bit   can   be   employed  only  with  the
>   FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI operations.
> 
>   If this option is set, the kernel  treats  timeout  as  an
>   absolute time based on CLOCK_REALTIME.
> 
> .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay?
>   If  this  option  is not set, the kernel treats timeout as
>   relative time, measured against the CLOCK_MONOTONIC clock.

That's correct.

>The operation specified in futex_op is one of the following:
> 
>FUTEX_WAIT (since Linux 2.6.0)
>   This operation tests that the  value  at  the  futex  word
>   pointed  to  by  the  address  uaddr  still  contains  the
>   expected value  val,  and  if  so,  then  sleeps  awaiting
>   FUTEX_WAKE  on  the  futex word.  The load of the value of
>   the futex word is an atomic  memory  access  (i.e.,  using
>   atomic  machine  instructions  of the respective architec‐
>   ture).  This load, the comparison with the expected value,
>   and starting to sleep are performed atomically and totally
>   ordered with respect to other futex operations on the same
>   futex  word.  If the thread starts to sleep, it is consid‐
>   ered a waiter on this futex word.  If the futex value does
>   not  match  val,  then the call fails immediately with the
>   error EAGAIN.
> 
>   The purpose of the comparison with the expected  value  is
>   to  prevent  lost  wake-ups: If another thread changed the
>   value of the futex word after the calling  thread  decided
>   to block based on the prior value, and if the other thread
>   executed a FUTEX_WAKE operation (or similar wake-up) after
>   the  value  change  and  before this FUTEX_WAIT operation,
>   then the latter will observe the value change and will not
>   start to sleep.
> 
>   If  the timeout argument is non-NULL, its contents specify
>   a relative timeout for the wait, measured according to the
> .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay?

Yes.

>   CLOCK_MONOTONIC  clock.  (This interval will be rounded up
>   to the system clock  granularity,  and  kernel  scheduling
>   delays  mean  that  the blocking interval may overrun by a
>   small amount.)

The given wait time will be rounded up to the system
clock granularity and is guaranteed not to expire
early.

There are a gazillion reasons why it can expire late, but the
guarantee is that it never expires prematurely.

> If timeout is NULL, the call blocks indef‐
>   initely.

Right.
 
>   The arguments uaddr2 and val3 are ignored.
> 
> 
>FUTEX_WAKE (since Linux 2.6.0)
>   This  operation  wakes at most val of the waiters that are
>   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
>   address  uaddr.  Most commonly, val is specified as either
>   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
>   ers).   No  guarantee  is provided about which waiters are
>   awoken (e.g., a waiter with a higher  scheduling  priority
>   is  not  guaranteed to be awoken in preference to a waiter
>   with a lower priority).

That's only correct up to Linux 2.6.21.

Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
threads this takes the nice level into account. Threads with the same
priority are woken in FIFO order.
 
>   The arguments timeout, uaddr2, and val3 are ignored.
 
> 
>FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25)
>   This operation creates a file descriptor that  is  associ‐
>   ated  with  the futex at uaddr.  The caller must close the
>   returned file descriptor after use.  When another  process
>   or  thread  performs  a  FUTEX_WAKE on the futex word, the
>   file  descriptor  indicates   as   being   readable   with
>   select(2), poll(2), and epoll(7)
> 
>   The  file  descriptor  can  be used to obtain asynchronous
>   notifications:  if  val  is  nonzero,  then  when  another
>   process  or  thread executes a FUTEX_WAKE, the caller will
>   receive the signal number that was passed in val.
> 
>   The arguments timeout, uaddr2 and val3 are ignored.
> 
> .\" FIXME(Torvald) We never define "upped".  Maybe just remove the
> .\"  following sentence?
>   To prevent 

Re: Revised futex(2) man page for review

2015-07-28 Thread Michael Kerrisk (man-pages)
On 07/28/2015 07:52 PM, Davidlohr Bueso wrote:
> On Tue, 2015-07-28 at 09:44 +0200, Michael Kerrisk (man-pages) wrote:
>> Maybe you still have some further improvements for the paragraph?
> 
> Nah, this is fine enough. Looks good.

Okay. Thanks. I added a Reviewed-by: for you.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-28 Thread Davidlohr Bueso
On Tue, 2015-07-28 at 09:44 +0200, Michael Kerrisk (man-pages) wrote:
> Maybe you still have some further improvements for the paragraph?

Nah, this is fine enough. Looks good.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-28 Thread Michael Kerrisk (man-pages)
Hi David,

On 07/28/2015 05:16 AM, Davidlohr Bueso wrote:
> On Mon, 2015-07-27 at 13:10 +0200, Michael Kerrisk (man-pages) wrote:
>> Hi David,
>>
>> On 03/31/2015 04:45 PM, Davidlohr Bueso wrote:
>>> On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
>>>
The condition is represented by the futex word, which is an address 
  in
memory  supplied to the futex() system call, and the value at this 
 mem‐
ory location.  (While the virtual addresses for the same memory in 
 sep‐
arate  processes  may  not be equal, the kernel maps them 
 internally so
that the same memory mapped in different locations will correspond  
 for
futex() calls.)

When  executing  a futex operation that requests to block a thread, 
 the
kernel will only block if the futex word has the value that the 
 calling
>>>
>>> Given the use of "word", you should probably state right away that
>>> futexes are only 32bit.
>>
>> So, I made the opening sentence here:
>>
>>The  condition  is  represented  by  the  futex word, which is an
>>address in memory supplied to the futex() system  call,  and  the
>>32-bit  value  at  this  memory  location. 
>>
>> Okay?
> 
> I think we can still improve :)
> 
> I've re-read the whole first paragraphs, and have a few comments that
> touch upon this specific wording. Lets see. You have:
> 
>>The  futex()  system call provides a method for waiting until a 
>> certain
>>condition becomes true.  It is typically used as a  blocking  
>> construct
>>in the context of shared-memory synchronization: The program 
>> implements
>>the majority of the synchronization in user  space,  and  uses  one  
>> of
>>operations  of  the  system call when it is likely that it has to 
>> block
>>for a longer time until the condition becomes true.  The  program  
>> uses
>>another  operation of the system call to wake anyone waiting for a 
>> par‐
>>ticular condition.
> 
> I've rephrased the next paragraph. How about adding this to follow?
> 
>A futex is in essence a 32-bit user-space address. All futex 
> operations and
>conditions are governed by this variable, from now on referred to as 
> 'futex
>word'. As such, a futex is identified by the address in shared memory, 
> which
>may or may not be shared between different processes. For virtual 
> memory, the
>kernel will internally handle and resolve the later. This futex word, 
> along
>with the value at its the memory location, are supplied to the futex() 
> system
>call.
> 
> Feel free to reword however you think is better.


I agree with you that that second paragraph is a bit broken. But, like Heinrich,
I'm confused by this term "32-bit ... address".

I've rewritten the paragraph as:

   A futex is a 32-bit value—referred to below as a futex word—whose
   address is supplied to the futex()  system  call.   (Futexes  are
   32-bits in size on all platforms, including 64-bit systems.)  All
   futex operations are governed by this value.  In order to share a
   futex  between  processes,  the  futex  is  placed in a region of
   shared memory, created using (for example) mmap(2)  or  shmat(2).
   (Thus the futex word may have different virtual addresses in dif‐
   ferent processes, but these addresses all refer to the same loca‐
   tion in physical memory.)

Maybe you still have some further improvements for the paragraph?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Aw: Re: Revised futex(2) man page for review

2015-07-28 Thread Davidlohr Bueso
On Tue, 2015-07-28 at 07:44 +0200, Heinrich Schuchardt wrote:
> Hello David,
> 
> >> A futex is in essence a 32-bit user-space address.
> I know what a 32 bit integer is.
> I am not aware of 32 bit addresses on my 64 bit operating system.

Well I am referring to in the context of a user-space address, such as a
32-bit lock ('int'), but yes, my text is misleading. In fact we
obviously need to cast to the word size for doing gup_fast, among other
tasks.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-28 Thread Michael Kerrisk (man-pages)
On 07/28/2015 04:52 AM, Davidlohr Bueso wrote:
> On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
>> SEE ALSO
>>get_robust_list(2), restart_syscall(2), futex(7)
> 
> For pi futexes, I also suggest pthread_mutexattr_getprotocol(3), which
> is a common entry point.

Thanks. Added.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Aw: Re: Revised futex(2) man page for review

2015-07-28 Thread Davidlohr Bueso
On Tue, 2015-07-28 at 07:44 +0200, Heinrich Schuchardt wrote:
 Hello David,
 
  A futex is in essence a 32-bit user-space address.
 I know what a 32 bit integer is.
 I am not aware of 32 bit addresses on my 64 bit operating system.

Well I am referring to in the context of a user-space address, such as a
32-bit lock ('int'), but yes, my text is misleading. In fact we
obviously need to cast to the word size for doing gup_fast, among other
tasks.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-28 Thread Michael Kerrisk (man-pages)
Hi David,

On 07/28/2015 05:16 AM, Davidlohr Bueso wrote:
 On Mon, 2015-07-27 at 13:10 +0200, Michael Kerrisk (man-pages) wrote:
 Hi David,

 On 03/31/2015 04:45 PM, Davidlohr Bueso wrote:
 On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:

The condition is represented by the futex word, which is an address 
  in
memory  supplied to the futex() system call, and the value at this 
 mem‐
ory location.  (While the virtual addresses for the same memory in 
 sep‐
arate  processes  may  not be equal, the kernel maps them 
 internally so
that the same memory mapped in different locations will correspond  
 for
futex() calls.)

When  executing  a futex operation that requests to block a thread, 
 the
kernel will only block if the futex word has the value that the 
 calling

 Given the use of word, you should probably state right away that
 futexes are only 32bit.

 So, I made the opening sentence here:

The  condition  is  represented  by  the  futex word, which is an
address in memory supplied to the futex() system  call,  and  the
32-bit  value  at  this  memory  location. 

 Okay?
 
 I think we can still improve :)
 
 I've re-read the whole first paragraphs, and have a few comments that
 touch upon this specific wording. Lets see. You have:
 
The  futex()  system call provides a method for waiting until a 
 certain
condition becomes true.  It is typically used as a  blocking  
 construct
in the context of shared-memory synchronization: The program 
 implements
the majority of the synchronization in user  space,  and  uses  one  
 of
operations  of  the  system call when it is likely that it has to 
 block
for a longer time until the condition becomes true.  The  program  
 uses
another  operation of the system call to wake anyone waiting for a 
 par‐
ticular condition.
 
 I've rephrased the next paragraph. How about adding this to follow?
 
A futex is in essence a 32-bit user-space address. All futex 
 operations and
conditions are governed by this variable, from now on referred to as 
 'futex
word'. As such, a futex is identified by the address in shared memory, 
 which
may or may not be shared between different processes. For virtual 
 memory, the
kernel will internally handle and resolve the later. This futex word, 
 along
with the value at its the memory location, are supplied to the futex() 
 system
call.
 
 Feel free to reword however you think is better.


I agree with you that that second paragraph is a bit broken. But, like Heinrich,
I'm confused by this term 32-bit ... address.

I've rewritten the paragraph as:

   A futex is a 32-bit value—referred to below as a futex word—whose
   address is supplied to the futex()  system  call.   (Futexes  are
   32-bits in size on all platforms, including 64-bit systems.)  All
   futex operations are governed by this value.  In order to share a
   futex  between  processes,  the  futex  is  placed in a region of
   shared memory, created using (for example) mmap(2)  or  shmat(2).
   (Thus the futex word may have different virtual addresses in dif‐
   ferent processes, but these addresses all refer to the same loca‐
   tion in physical memory.)

Maybe you still have some further improvements for the paragraph?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Davidlohr Bueso
On Tue, 2015-07-28 at 22:45 +0200, Peter Zijlstra wrote:
 Also, this code seems to use plist, which means it won't do the right
 thing for SCHED_DEADLINE either.

Ick, I don't look forward to seeing nice futex plists converted into
rbtrees. As opposed to, eg. rtmutexes, there are a few caveats:

- Dealing with the top_waiter in rtmutexes is always easy, but in
futexes we need to deal with keys, so caching the leftmost won't work as
nicely.

- This will bloat things like futex_wake, where O(logN) is not suited
for FIFO iteration. And iterating linked lists is, in essence, all that
we really do when calling futex(2).

I have to wonder about the extra overhead added by these points.  I do
understand the dl concern, nonetheless.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Darren Hart
On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
 On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:

...

 FUTEX_REQUEUE (since Linux 2.6.0)
  .\ FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken
  .\ in general, or is this comment implicitly speaking about the
  .\ condvar (?) use case? If the latter we might want to weaken the
  .\ advice below a little.
  .\ [Anyone else have input on this?]
 
 The condvar use case exposes the flaw nicely, but that's pretty much
 true for everything which wants a sane requeue operation.

In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken and
should not be used in new code) and someone argued strongly against... stating
that there were legitimate uses for it. Of course I'm struggling to find the
thread and the reference at the moment - immensely useful, I know.

I'll continue trying to find it and see if it can be useful here. I believe
Torvald was on the thread as well.

 
Avoid using this operation.  It is broken for its intended
purpose.  Use FUTEX_CMP_REQUEUE instead.
  
Thisoperationperformsthesametaskas
FUTEX_CMP_REQUEUE, except that no check is made using  the
value in val3.  (The argument val3 is ignored.)
  

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Darren Hart
On Tue, Jul 28, 2015 at 09:11:41PM -0700, Darren Hart wrote:
 On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
  On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:
 
 ...
 
  FUTEX_REQUEUE (since Linux 2.6.0)
   .\ FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken
   .\ in general, or is this comment implicitly speaking about the
   .\ condvar (?) use case? If the latter we might want to weaken the
   .\ advice below a little.
   .\ [Anyone else have input on this?]
  
  The condvar use case exposes the flaw nicely, but that's pretty much
  true for everything which wants a sane requeue operation.
 
 In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken 
 and
 should not be used in new code) and someone argued strongly against... stating
 that there were legitimate uses for it. Of course I'm struggling to find the
 thread and the reference at the moment - immensely useful, I know.
 
 I'll continue trying to find it and see if it can be useful here. I believe
 Torvald was on the thread as well.
 

Found it on libc-alpha, here it is for reference:

From: Rich Felker dal...@libc.org
Date: Wed, 29 Oct 2014 22:43:17 -0400
To: Darren Hart dvh...@infradead.org
Cc: Carlos O'Donell car...@redhat.com, Roland McGrath 
rol...@hack.frob.com,
Torvald Riegel trie...@redhat.com, GLIBC Devel 
libc-al...@sourceware.org,
Michael Kerrisk mtk.manpa...@gmail.com
Subject: Re: Add futex wrapper to glibc?

On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote:
  We are IMO at the stage where futex is stable, few things are
  changing, and with documentation in place, I would consider adding a
  futex wrapper.
 
 Yes, at least for the defined OP codes. New OPs may be added of
 course, but that isn't a concern for supporting what exists today, and
 doesn't break compatibility.
 
 I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally
 broken.  FUTEX_CMP_REQUEUE should *always* be used instead. The glibc
 wrapper is one way to encourage developers to do the right thing
 (don't expose the bad op in the header).

You're mistaken here. There are plenty of valid ways to use
FUTEX_REQUEUE - for example if the calling thread is requeuing the
target(s) to a lock that the calling thread owns. Just because it
doesn't meet the needs of the way glibc was using it internally
doesn't mean it's useless for other applications.

In any case, I don't think there's a proposal to intercept/modify the
commands to futex, just to pass them through (and possibly do a
cancellable syscall for some of them).

Rich


  
 Avoid using this operation.  It is broken for its intended
 purpose.  Use FUTEX_CMP_REQUEUE instead.
   
 Thisoperationperformsthesametaskas
 FUTEX_CMP_REQUEUE, except that no check is made using  the
 value in val3.  (The argument val3 is ignored.)
   
 
 -- 
 Darren Hart
 Intel Open Source Technology Center

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-28 Thread Michael Kerrisk (man-pages)
On 07/28/2015 07:52 PM, Davidlohr Bueso wrote:
 On Tue, 2015-07-28 at 09:44 +0200, Michael Kerrisk (man-pages) wrote:
 Maybe you still have some further improvements for the paragraph?
 
 Nah, this is fine enough. Looks good.

Okay. Thanks. I added a Reviewed-by: for you.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-28 Thread Davidlohr Bueso
On Tue, 2015-07-28 at 09:44 +0200, Michael Kerrisk (man-pages) wrote:
 Maybe you still have some further improvements for the paragraph?

Nah, this is fine enough. Looks good.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-28 Thread Michael Kerrisk (man-pages)
On 07/28/2015 04:52 AM, Davidlohr Bueso wrote:
 On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
 SEE ALSO
get_robust_list(2), restart_syscall(2), futex(7)
 
 For pi futexes, I also suggest pthread_mutexattr_getprotocol(3), which
 is a common entry point.

Thanks. Added.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Thomas Gleixner
On Tue, 28 Jul 2015, Peter Zijlstra wrote:

 On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
 
  FUTEX_WAKE (since Linux 2.6.0)
 This  operation  wakes at most val of the waiters that are
 waiting (e.g., inside FUTEX_WAIT) on the futex word at the
 address  uaddr.  Most commonly, val is specified as either
 1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
 ers).   No  guarantee  is provided about which waiters are
 awoken (e.g., a waiter with a higher  scheduling  priority
 is  not  guaranteed to be awoken in preference to a waiter
 with a lower priority).
  
  That's only correct up to Linux 2.6.21.
  
  Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
  threads this takes the nice level into account. Threads with the same
  priority are woken in FIFO order.
 
 Maybe don't mention the effects of SCHED_OTHER, order by nice value is
 'wrong'.

Indeed.
 
 Also, this code seems to use plist, which means it won't do the right
 thing for SCHED_DEADLINE either.
 
 Do we want to go fix that?

I think so.

Thanks,

tglx


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Peter Zijlstra
On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:

 FUTEX_WAKE (since Linux 2.6.0)
This  operation  wakes at most val of the waiters that are
waiting (e.g., inside FUTEX_WAIT) on the futex word at the
address  uaddr.  Most commonly, val is specified as either
1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
ers).   No  guarantee  is provided about which waiters are
awoken (e.g., a waiter with a higher  scheduling  priority
is  not  guaranteed to be awoken in preference to a waiter
with a lower priority).
 
 That's only correct up to Linux 2.6.21.
 
 Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
 threads this takes the nice level into account. Threads with the same
 priority are woken in FIFO order.

Maybe don't mention the effects of SCHED_OTHER, order by nice value is
'wrong'.

Also, this code seems to use plist, which means it won't do the right
thing for SCHED_DEADLINE either.

Do we want to go fix that?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Thomas Gleixner
On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:
FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
   This   option   bit   can   be   employed  only  with  the
   FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI operations.
 
   If this option is set, the kernel  treats  timeout  as  an
   absolute time based on CLOCK_REALTIME.
 
 .\ FIXME XXX I added CLOCK_MONOTONIC below. Okay?
   If  this  option  is not set, the kernel treats timeout as
   relative time, measured against the CLOCK_MONOTONIC clock.

That's correct.

The operation specified in futex_op is one of the following:
 
FUTEX_WAIT (since Linux 2.6.0)
   This operation tests that the  value  at  the  futex  word
   pointed  to  by  the  address  uaddr  still  contains  the
   expected value  val,  and  if  so,  then  sleeps  awaiting
   FUTEX_WAKE  on  the  futex word.  The load of the value of
   the futex word is an atomic  memory  access  (i.e.,  using
   atomic  machine  instructions  of the respective architec‐
   ture).  This load, the comparison with the expected value,
   and starting to sleep are performed atomically and totally
   ordered with respect to other futex operations on the same
   futex  word.  If the thread starts to sleep, it is consid‐
   ered a waiter on this futex word.  If the futex value does
   not  match  val,  then the call fails immediately with the
   error EAGAIN.
 
   The purpose of the comparison with the expected  value  is
   to  prevent  lost  wake-ups: If another thread changed the
   value of the futex word after the calling  thread  decided
   to block based on the prior value, and if the other thread
   executed a FUTEX_WAKE operation (or similar wake-up) after
   the  value  change  and  before this FUTEX_WAIT operation,
   then the latter will observe the value change and will not
   start to sleep.
 
   If  the timeout argument is non-NULL, its contents specify
   a relative timeout for the wait, measured according to the
 .\ FIXME XXX I added CLOCK_MONOTONIC below. Okay?

Yes.

   CLOCK_MONOTONIC  clock.  (This interval will be rounded up
   to the system clock  granularity,  and  kernel  scheduling
   delays  mean  that  the blocking interval may overrun by a
   small amount.)

The given wait time will be rounded up to the system
clock granularity and is guaranteed not to expire
early.

There are a gazillion reasons why it can expire late, but the
guarantee is that it never expires prematurely.

 If timeout is NULL, the call blocks indef‐
   initely.

Right.
 
   The arguments uaddr2 and val3 are ignored.
 
 
FUTEX_WAKE (since Linux 2.6.0)
   This  operation  wakes at most val of the waiters that are
   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
   address  uaddr.  Most commonly, val is specified as either
   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
   ers).   No  guarantee  is provided about which waiters are
   awoken (e.g., a waiter with a higher  scheduling  priority
   is  not  guaranteed to be awoken in preference to a waiter
   with a lower priority).

That's only correct up to Linux 2.6.21.

Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
threads this takes the nice level into account. Threads with the same
priority are woken in FIFO order.
 
   The arguments timeout, uaddr2, and val3 are ignored.
 
 
FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25)
   This operation creates a file descriptor that  is  associ‐
   ated  with  the futex at uaddr.  The caller must close the
   returned file descriptor after use.  When another  process
   or  thread  performs  a  FUTEX_WAKE on the futex word, the
   file  descriptor  indicates   as   being   readable   with
   select(2), poll(2), and epoll(7)
 
   The  file  descriptor  can  be used to obtain asynchronous
   notifications:  if  val  is  nonzero,  then  when  another
   process  or  thread executes a FUTEX_WAKE, the caller will
   receive the signal number that was passed in val.
 
   The arguments timeout, uaddr2 and val3 are ignored.
 
 .\ FIXME(Torvald) We never define upped.  Maybe just remove the
 .\  following sentence?
   To prevent race conditions, the caller should test if  the
   futex has been upped 

Re: Revised futex(2) man page for review

2015-07-27 Thread Davidlohr Bueso
On Mon, 2015-07-27 at 13:10 +0200, Michael Kerrisk (man-pages) wrote:
> Hi David,
> 
> On 03/31/2015 04:45 PM, Davidlohr Bueso wrote:
> > On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
> > 
> >>The condition is represented by the futex word, which is an address 
> >>  in
> >>memory  supplied to the futex() system call, and the value at this 
> >> mem‐
> >>ory location.  (While the virtual addresses for the same memory in 
> >> sep‐
> >>arate  processes  may  not be equal, the kernel maps them 
> >> internally so
> >>that the same memory mapped in different locations will correspond  
> >> for
> >>futex() calls.)
> >>
> >>When  executing  a futex operation that requests to block a thread, 
> >> the
> >>kernel will only block if the futex word has the value that the 
> >> calling
> > 
> > Given the use of "word", you should probably state right away that
> > futexes are only 32bit.
> 
> So, I made the opening sentence here:
> 
>The  condition  is  represented  by  the  futex word, which is an
>address in memory supplied to the futex() system  call,  and  the
>32-bit  value  at  this  memory  location. 
> 
> Okay?

I think we can still improve :)

I've re-read the whole first paragraphs, and have a few comments that
touch upon this specific wording. Lets see. You have:

>The  futex()  system call provides a method for waiting until a certain
>condition becomes true.  It is typically used as a  blocking  construct
>in the context of shared-memory synchronization: The program implements
>the majority of the synchronization in user  space,  and  uses  one  of
>operations  of  the  system call when it is likely that it has to block
>for a longer time until the condition becomes true.  The  program  uses
>another  operation of the system call to wake anyone waiting for a par‐
>ticular condition.

I've rephrased the next paragraph. How about adding this to follow?

   A futex is in essence a 32-bit user-space address. All futex operations 
and
   conditions are governed by this variable, from now on referred to as 
'futex
   word'. As such, a futex is identified by the address in shared memory, 
which
   may or may not be shared between different processes. For virtual 
memory, the
   kernel will internally handle and resolve the later. This futex word, 
along
   with the value at its the memory location, are supplied to the futex() 
system
   call.

Feel free to reword however you think is better.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Davidlohr Bueso
On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
> SEE ALSO
>get_robust_list(2), restart_syscall(2), futex(7)

For pi futexes, I also suggest pthread_mutexattr_getprotocol(3), which
is a common entry point.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
On 07/27/2015 04:17 PM, Heinrich Schuchardt wrote:
> instruction. A thread maybe unable
> 
> to << missing word
> 
> acquire a lock because it is
> already acquired by another thread. It then may pass the lock's
> flag as futex word and the value representing the acquired state
> as the expected value to a futex() wait operation.

Thanks, Heinrich. Fixed.

Cheers,

Michael




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Next round: revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hello all,

>From a draft sent out in March, I got a few useful comments that
I've now incorporated into this draft. And I got some complaints
from people who did not want to read groff source. My point
was that there are a bunch of FIXMEs in the page source that I
wanted people to look at... Anyway, this time, I will take
a different tack, interspersing the FIXMEs in a rendered 
version of the page. I'd greatly appreciate help with those FIXMEs.

The current page source can be found at in a branch at
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex

===

As becomes quickly obvious upon reading it, the current futex(2) 
man page is in a sorry state, lacking many important details, and
also the various additions that have been made to the interface
over the last years. I've been working on revising it, first
of all based on input I got in response to a request for help
last year (http://thread.gmane.org/gmane.linux.kernel/1703405), 
especially taking Thomas Gleixner's input 
(http://thread.gmane.org/gmane.linux.kernel/1703405/focus=2952) 
into account. I also got some further offlist input from Darren
 Hart, Torvald Riegel, and Davidlohr Bueso that has been
incorporated into the revised draft. Other than that, I got
some useful info out of Ulrich Drepper's paper (cited at the
end of the page) and one or two web pages (cited in the page
source).

The page has now increased in size by a factor of about 5, but
is far from complete. In particular, as I reworked the page, 
there were many details that I was not 100% certain of, and I
have added FIXME markers to the page source. In addition,
Torvald added some text, and a few more FIXMEs. Some of
the FIXMEs are trivial, as in: I'd like confirmation that
I have correctly captured a technical detail. Others are more 
substantial, probably requiring the addition of further text.

I appreciate that there are probably other things that can be
improved in the page. (Torvald and Darren have some ideas.)
However, before growing the page any further, I would like to
resolve as many of the FIXMEs (and any other problems that people
see) as possible in the existing text. I need help with that. 
(And I know that dealing with that help, if I get it, will in 
itself will be quite a task to deal with, which is why I have 
been delaying it for many weeks now, as my time has been 
rather limited recently.)

So, please take a look at the page below. At this point,
I would most especially appreciate help with the FIXMEs.

Cheers,

Michael



FUTEX(2)Linux Programmer's Manual   FUTEX(2)

NAME
   futex - fast user-space locking

SYNOPSIS
   #include 
   #include 

   int futex(int *uaddr, int futex_op, int val,
 const struct timespec *timeout,   /* or: uint32_t val2 */
 int *uaddr2, int val3);

   Note: There is no glibc wrapper for this system call; see NOTES.

DESCRIPTION
   The  futex()  system  call  provides a method for waiting until a
   certain condition becomes true.  It is typically used as a block‐
   ing  construct  in  the context of shared-memory synchronization:
   The program implements the majority  of  the  synchronization  in
   user  space,  and  uses  one of the operations of the system call
   when it is likely that it has to block for a  longer  time  until
   the  condition  becomes true.  The program uses another operation
   of the system call to wake anyone waiting for a particular condi‐
   tion.

   The  condition  is  represented  by  the  futex word, which is an
   address in memory supplied to the futex() system  call,  and  the
   32-bit  value  at  this  memory  location.   (While  the  virtual
   addresses for the same physical memory address in  separate  pro‐
   cesses  may be different, the same physical address may be shared
   by the processes using mmap(2).)

   When executing a futex operation that requests to block a thread,
   the  kernel  will block only if the futex word has the value that
   the calling thread supplied as expected value.  The load from the
   futex  word,  the  comparison  with  the  expected value, and the
   actual blocking will happen atomically and totally  ordered  with
   respect  to  concurrently  executing futex operations on the same
   futex word.  Thus, the futex word is used to connect the synchro‐
   nization in user space with the implementation of blocking by the
   kernel; similar to an atomic compare-and-exchange operation  that
   potentially  changes  shared  memory,  blocking via a futex is an
   atomic compare-and-block operation.

   One example use of futexes is implementing locks.  The  state  of
   the  lock  (i.e., acquired or not acquired) can be represented as
   an atomically accessed flag in shared memory.  In the uncontended
   case,  a  thread  can access or modify 

Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hi Peter,

On 03/28/2015 01:03 PM, Peter Zijlstra wrote:
> On Sat, Mar 28, 2015 at 12:47:25PM +0100, Peter Zijlstra wrote:
>>FUTEX_WAIT (since Linux 2.6.0)
>>   This operation tests that the value at the futex word pointed 
>> to
>>   by the address uaddr still contains the expected value val,  
>> and
>>   if  so,  then sleeps awaiting FUTEX_WAKE on the futex word.  
>> The
>>   load of the value of the futex word is an atomic  memory  
>> access
>>   (i.e.,  using  atomic  machine  instructions  of  the 
>> respective
>>   architecture).  This load,  the  comparison  with  the  
>> expected
>>   value,  and  starting  to  sleep  are  performed  atomically 
>> and
>>   totally ordered with respect to other futex  operations  on  
>> the
>>   same  futex  word.  If the thread starts to sleep, it is 
>> consid‐
>>   ered a waiter on this futex word.  If the futex value  does  
>> not
>>   match  val,  then  the  call  fails  immediately  with the 
>> error
>>   EAGAIN.
>>
>>   The purpose of the comparison with the expected value is to 
>> pre‐
>>   vent  lost  wake-ups: If another thread changed the value of 
>> the
>>   futex word after the calling thread decided to  block  based  
>> on
>>   the  prior  value, and if the other thread executed a 
>> FUTEX_WAKE
>>   operation (or similar wake-up) after the value change and 
>> before
>>   this  FUTEX_WAIT  operation,  then  the  latter will observe 
>> the
>>   value change and will not start to sleep.
>>
>>   If the timeout argument is non-NULL, its contents specify a 
>> rel‐
>>   ative   timeout   for   the  wait,  measured  according  to  
>> the
>>   CLOCK_MONOTONIC clock.  (This interval will be rounded up to 
>> the
>>   system clock granularity, and kernel scheduling delays mean 
>> that
>>   the blocking interval may overrun by a small amount.)  If  
>> time‐
>>   out is NULL, the call blocks indefinitely.
> 
> Would it not be better to only state that the wait will not return
> before the timeout -- unless woken -- and not bother with clock
> granularity and scheduling delays?

Many of the pages that talk about system calls that have timeouts
carry similar language, since people often have confusions about what time
timeout (e.g., that it's an upper limit, not a minimum; or that it's precise
to some very small granularity). Why do you think the language here is a
problem?

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
On 04/15/2015 12:28 PM, Torvald Riegel wrote:
> On Tue, 2015-04-14 at 23:40 +0200, Thomas Gleixner wrote:
>> On Sat, 28 Mar 2015, Peter Zijlstra wrote:
>>> On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
 So, please take a look at the page below. At this point,
 I would most especially appreciate help with the FIXMEs.
>>>
>>> For people who cannot read that troff gibberish (me)..
>>
>> Ditto :)
>>  
>>> NOTES
>>>Glibc does not provide a wrapper for this system call;  call  it  
>>> using
>>>syscall(2).
>>
>> You might mention that pthread_mutex, pthread_condvar interfaces are
>> high level wrappers for the syscall and recommended to be used for
>> normal use cases. IIRC unnamed semaphores are implemented with futexes
>> as well.
> 
> If we add this, I'd rephrase it to something like that there are
> high-level programming abstractions such as the pthread_condvar
> interfaces or semaphores that are implemented using the syscall and that
> are typically a better fit for normal use cases.  I'd consider only the
> condvars as something like a wrapper, or targeting a similar use case.

I added this under NOTES:

   Various higher-level programming abstractions are implemented via
   futexes, including POSIX threads mutexes and condition variables,
   as well as POSIX semaphores.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hello Pavel,

On 04/27/2015 10:37 PM, Pavel Machek wrote:
> Hi!
> 
>>   The FUTEX_WAIT_OP operation is equivalent to execute the 
>> follow???
>>   ing  code  atomically  and totally ordered with respect to 
>> other
>>   futex operations on any of the two supplied futex words:
> 
> "to executing"?

Yep. Fixed.

>>   The  operation  and  comparison  that  are  to  be performed 
>> are
>>   encoded in the bits of  the  argument  val3.   Pictorially,  
>> the
>>   encoding is:
>>
>>   +---+---+---+---+
>>   |op |cmp|   oparg   |  cmparg   |
>>   +---+---+---+---+
>> 4   4   12  12<== # of bits
>>
> 
> :-)
> 
>> RETURN VALUE
>>In the event of an error, all operations return -1  and  set  errno  
>> to
>>indicate  the  cause of the error.  The return value on success 
>> depends
>>on the operation, as described in the following list:
> 
> Did you say (at the begining) that there is no glibc wrapper?

Yes, this could be clearer. I changed it to

RETURN VALUE
   In the event of an error (and assuming that futex()  was  invoked
   via  syscall(2)), all operations return -1 and set errno to indi‐
   cate the cause of the error.

>>EINVAL The operation in futex_op is one of those that employs  a  
>> time???
>>   out,  but  the supplied timeout argument was invalid (tv_sec 
>> was
>>   less than zero, or tv_nsec was not less than 1000,000,000).
> 
> 1,000...?

Fixed.

Thanks for the comments!

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hi David,

On 03/31/2015 04:45 PM, Davidlohr Bueso wrote:
> On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
> 
>>The condition is represented by the futex word, which is an address  
>> in
>>memory  supplied to the futex() system call, and the value at this 
>> mem‐
>>ory location.  (While the virtual addresses for the same memory in 
>> sep‐
>>arate  processes  may  not be equal, the kernel maps them internally 
>> so
>>that the same memory mapped in different locations will correspond  
>> for
>>futex() calls.)
>>
>>When  executing  a futex operation that requests to block a thread, 
>> the
>>kernel will only block if the futex word has the value that the 
>> calling
> 
> Given the use of "word", you should probably state right away that
> futexes are only 32bit.

So, I made the opening sentence here:

   The  condition  is  represented  by  the  futex word, which is an
   address in memory supplied to the futex() system  call,  and  the
   32-bit  value  at  this  memory  location. 

Okay?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
On 03/31/2015 03:48 AM, Rusty Russell wrote:
> "Michael Kerrisk (man-pages)"  writes:
>> When executing a futex operation that requests to block a thread,
>> the kernel will only block if the futex word has the value that the
>> calling thread supplied as expected value.
>> The load from the futex word, the comparison with
>> the expected value,
>> and the actual blocking will happen atomically and totally
>> ordered with respect to concurrently executing futex operations
>> on the same futex word,
>> such as operations that wake threads blocked on this futex word.
>> Thus, the futex word is used to connect the synchronization in user spac
> 
> Missing 'e' in "space".

Already fixed.

>> .\" FIXME Please confirm that the following is correct:
>> No guarantee is provided about which waiters are awoken
>> (e.g., a waiter with a higher scheduling priority is not guaranteed
>> to be awoken in preference to a waiter with a lower priority).
> 
> This is true.

Thanks! FIXME removed.

Cheers,

Michael



> I didn't read the rest, as that stuff was all written by others.
> Documenting them is pretty heroic; good job!
> 
> Thanks,
> Rusty.
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hi David,

On 03/31/2015 10:36 PM, Davidlohr Bueso wrote:
> On Sat, 2015-03-28 at 13:03 +0100, Peter Zijlstra wrote:
>>>   If the timeout argument is non-NULL, its contents specify a 
>>> rel‐
>>>   ative   timeout   for   the  wait,  measured  according  to  
>>> the
>>>   CLOCK_MONOTONIC clock.  (This interval will be rounded up to 
>>> the
>>>   system clock granularity, and kernel scheduling delays mean 
>>> that
>>>   the blocking interval may overrun by a small amount.)  If  
>>> time‐
>>>   out is NULL, the call blocks indefinitely.
>>
>> Would it not be better to only state that the wait will not return
>> before the timeout -- unless woken -- and not bother with clock
>> granularity and scheduling delays?
> 
> Yeah, similarly we also have this:
> 
>  FUTEX_PRIVATE_FLAG (since Linux 2.6.22)
>   This option bit can be employed with all futex  operations.   It
>   tells  the  kernel  that  the  futex  is process-private and not
>   shared with another process (i.e., it is  only  being  used  for
>   synchronization  between  threads  of  the  same process).  This
>   allows the kernel to choose the fast  path  for  validating  the
>   user-space address and avoids expensive VMA lookups, taking ref‐
>   erence counts on file backing store, and so on.
> 
> This to me reads a bit too much into the kernel (fastpath, refcnt,
> vmas). Why not just mention that it avoids overhead in the kernel or
> something? I don't recall any manpage mentioning such details, but I
> could be wrong. 

Thanks. Agreed. I changed this to

This allows the kernel to make some additional performance optimizations.

> In any case its a nit, the whole doc is pretty good and
> I hope you can merge it soon and then just increment ;)

I ran out of time and energy at a certain point. And also got a little
disheartened that I got more people complaining about groff markup
than actually looked looked at the FIXMEs in the page source :-). 
I'll try to reboot the process.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hi David,

On 03/31/2015 04:45 PM, Davidlohr Bueso wrote:
 On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
 
The condition is represented by the futex word, which is an address  
 in
memory  supplied to the futex() system call, and the value at this 
 mem‐
ory location.  (While the virtual addresses for the same memory in 
 sep‐
arate  processes  may  not be equal, the kernel maps them internally 
 so
that the same memory mapped in different locations will correspond  
 for
futex() calls.)

When  executing  a futex operation that requests to block a thread, 
 the
kernel will only block if the futex word has the value that the 
 calling
 
 Given the use of word, you should probably state right away that
 futexes are only 32bit.

So, I made the opening sentence here:

   The  condition  is  represented  by  the  futex word, which is an
   address in memory supplied to the futex() system  call,  and  the
   32-bit  value  at  this  memory  location. 

Okay?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hello Pavel,

On 04/27/2015 10:37 PM, Pavel Machek wrote:
 Hi!
 
   The FUTEX_WAIT_OP operation is equivalent to execute the 
 follow???
   ing  code  atomically  and totally ordered with respect to 
 other
   futex operations on any of the two supplied futex words:
 
 to executing?

Yep. Fixed.

   The  operation  and  comparison  that  are  to  be performed 
 are
   encoded in the bits of  the  argument  val3.   Pictorially,  
 the
   encoding is:

   +---+---+---+---+
   |op |cmp|   oparg   |  cmparg   |
   +---+---+---+---+
 4   4   12  12== # of bits

 
 :-)
 
 RETURN VALUE
In the event of an error, all operations return -1  and  set  errno  
 to
indicate  the  cause of the error.  The return value on success 
 depends
on the operation, as described in the following list:
 
 Did you say (at the begining) that there is no glibc wrapper?

Yes, this could be clearer. I changed it to

RETURN VALUE
   In the event of an error (and assuming that futex()  was  invoked
   via  syscall(2)), all operations return -1 and set errno to indi‐
   cate the cause of the error.

EINVAL The operation in futex_op is one of those that employs  a  
 time???
   out,  but  the supplied timeout argument was invalid (tv_sec 
 was
   less than zero, or tv_nsec was not less than 1000,000,000).
 
 1,000...?

Fixed.

Thanks for the comments!

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
On 03/31/2015 03:48 AM, Rusty Russell wrote:
 Michael Kerrisk (man-pages) mtk.manpa...@gmail.com writes:
 When executing a futex operation that requests to block a thread,
 the kernel will only block if the futex word has the value that the
 calling thread supplied as expected value.
 The load from the futex word, the comparison with
 the expected value,
 and the actual blocking will happen atomically and totally
 ordered with respect to concurrently executing futex operations
 on the same futex word,
 such as operations that wake threads blocked on this futex word.
 Thus, the futex word is used to connect the synchronization in user spac
 
 Missing 'e' in space.

Already fixed.

 .\ FIXME Please confirm that the following is correct:
 No guarantee is provided about which waiters are awoken
 (e.g., a waiter with a higher scheduling priority is not guaranteed
 to be awoken in preference to a waiter with a lower priority).
 
 This is true.

Thanks! FIXME removed.

Cheers,

Michael



 I didn't read the rest, as that stuff was all written by others.
 Documenting them is pretty heroic; good job!
 
 Thanks,
 Rusty.
 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hi David,

On 03/31/2015 10:36 PM, Davidlohr Bueso wrote:
 On Sat, 2015-03-28 at 13:03 +0100, Peter Zijlstra wrote:
   If the timeout argument is non-NULL, its contents specify a 
 rel‐
   ative   timeout   for   the  wait,  measured  according  to  
 the
   CLOCK_MONOTONIC clock.  (This interval will be rounded up to 
 the
   system clock granularity, and kernel scheduling delays mean 
 that
   the blocking interval may overrun by a small amount.)  If  
 time‐
   out is NULL, the call blocks indefinitely.

 Would it not be better to only state that the wait will not return
 before the timeout -- unless woken -- and not bother with clock
 granularity and scheduling delays?
 
 Yeah, similarly we also have this:
 
  FUTEX_PRIVATE_FLAG (since Linux 2.6.22)
   This option bit can be employed with all futex  operations.   It
   tells  the  kernel  that  the  futex  is process-private and not
   shared with another process (i.e., it is  only  being  used  for
   synchronization  between  threads  of  the  same process).  This
   allows the kernel to choose the fast  path  for  validating  the
   user-space address and avoids expensive VMA lookups, taking ref‐
   erence counts on file backing store, and so on.
 
 This to me reads a bit too much into the kernel (fastpath, refcnt,
 vmas). Why not just mention that it avoids overhead in the kernel or
 something? I don't recall any manpage mentioning such details, but I
 could be wrong. 

Thanks. Agreed. I changed this to

This allows the kernel to make some additional performance optimizations.

 In any case its a nit, the whole doc is pretty good and
 I hope you can merge it soon and then just increment ;)

I ran out of time and energy at a certain point. And also got a little
disheartened that I got more people complaining about groff markup
than actually looked looked at the FIXMEs in the page source :-). 
I'll try to reboot the process.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Next round: revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hello all,

From a draft sent out in March, I got a few useful comments that
I've now incorporated into this draft. And I got some complaints
from people who did not want to read groff source. My point
was that there are a bunch of FIXMEs in the page source that I
wanted people to look at... Anyway, this time, I will take
a different tack, interspersing the FIXMEs in a rendered 
version of the page. I'd greatly appreciate help with those FIXMEs.

The current page source can be found at in a branch at
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex

===

As becomes quickly obvious upon reading it, the current futex(2) 
man page is in a sorry state, lacking many important details, and
also the various additions that have been made to the interface
over the last years. I've been working on revising it, first
of all based on input I got in response to a request for help
last year (http://thread.gmane.org/gmane.linux.kernel/1703405), 
especially taking Thomas Gleixner's input 
(http://thread.gmane.org/gmane.linux.kernel/1703405/focus=2952) 
into account. I also got some further offlist input from Darren
 Hart, Torvald Riegel, and Davidlohr Bueso that has been
incorporated into the revised draft. Other than that, I got
some useful info out of Ulrich Drepper's paper (cited at the
end of the page) and one or two web pages (cited in the page
source).

The page has now increased in size by a factor of about 5, but
is far from complete. In particular, as I reworked the page, 
there were many details that I was not 100% certain of, and I
have added FIXME markers to the page source. In addition,
Torvald added some text, and a few more FIXMEs. Some of
the FIXMEs are trivial, as in: I'd like confirmation that
I have correctly captured a technical detail. Others are more 
substantial, probably requiring the addition of further text.

I appreciate that there are probably other things that can be
improved in the page. (Torvald and Darren have some ideas.)
However, before growing the page any further, I would like to
resolve as many of the FIXMEs (and any other problems that people
see) as possible in the existing text. I need help with that. 
(And I know that dealing with that help, if I get it, will in 
itself will be quite a task to deal with, which is why I have 
been delaying it for many weeks now, as my time has been 
rather limited recently.)

So, please take a look at the page below. At this point,
I would most especially appreciate help with the FIXMEs.

Cheers,

Michael



FUTEX(2)Linux Programmer's Manual   FUTEX(2)

NAME
   futex - fast user-space locking

SYNOPSIS
   #include linux/futex.h
   #include sys/time.h

   int futex(int *uaddr, int futex_op, int val,
 const struct timespec *timeout,   /* or: uint32_t val2 */
 int *uaddr2, int val3);

   Note: There is no glibc wrapper for this system call; see NOTES.

DESCRIPTION
   The  futex()  system  call  provides a method for waiting until a
   certain condition becomes true.  It is typically used as a block‐
   ing  construct  in  the context of shared-memory synchronization:
   The program implements the majority  of  the  synchronization  in
   user  space,  and  uses  one of the operations of the system call
   when it is likely that it has to block for a  longer  time  until
   the  condition  becomes true.  The program uses another operation
   of the system call to wake anyone waiting for a particular condi‐
   tion.

   The  condition  is  represented  by  the  futex word, which is an
   address in memory supplied to the futex() system  call,  and  the
   32-bit  value  at  this  memory  location.   (While  the  virtual
   addresses for the same physical memory address in  separate  pro‐
   cesses  may be different, the same physical address may be shared
   by the processes using mmap(2).)

   When executing a futex operation that requests to block a thread,
   the  kernel  will block only if the futex word has the value that
   the calling thread supplied as expected value.  The load from the
   futex  word,  the  comparison  with  the  expected value, and the
   actual blocking will happen atomically and totally  ordered  with
   respect  to  concurrently  executing futex operations on the same
   futex word.  Thus, the futex word is used to connect the synchro‐
   nization in user space with the implementation of blocking by the
   kernel; similar to an atomic compare-and-exchange operation  that
   potentially  changes  shared  memory,  blocking via a futex is an
   atomic compare-and-block operation.

   One example use of futexes is implementing locks.  The  state  of
   the  lock  (i.e., acquired or not acquired) can be represented as
   an atomically accessed flag in shared memory.  In the uncontended
   case,  a  thread  

Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
On 04/15/2015 12:28 PM, Torvald Riegel wrote:
 On Tue, 2015-04-14 at 23:40 +0200, Thomas Gleixner wrote:
 On Sat, 28 Mar 2015, Peter Zijlstra wrote:
 On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
 So, please take a look at the page below. At this point,
 I would most especially appreciate help with the FIXMEs.

 For people who cannot read that troff gibberish (me)..

 Ditto :)
  
 NOTES
Glibc does not provide a wrapper for this system call;  call  it  
 using
syscall(2).

 You might mention that pthread_mutex, pthread_condvar interfaces are
 high level wrappers for the syscall and recommended to be used for
 normal use cases. IIRC unnamed semaphores are implemented with futexes
 as well.
 
 If we add this, I'd rephrase it to something like that there are
 high-level programming abstractions such as the pthread_condvar
 interfaces or semaphores that are implemented using the syscall and that
 are typically a better fit for normal use cases.  I'd consider only the
 condvars as something like a wrapper, or targeting a similar use case.

I added this under NOTES:

   Various higher-level programming abstractions are implemented via
   futexes, including POSIX threads mutexes and condition variables,
   as well as POSIX semaphores.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hi Peter,

On 03/28/2015 01:03 PM, Peter Zijlstra wrote:
 On Sat, Mar 28, 2015 at 12:47:25PM +0100, Peter Zijlstra wrote:
FUTEX_WAIT (since Linux 2.6.0)
   This operation tests that the value at the futex word pointed 
 to
   by the address uaddr still contains the expected value val,  
 and
   if  so,  then sleeps awaiting FUTEX_WAKE on the futex word.  
 The
   load of the value of the futex word is an atomic  memory  
 access
   (i.e.,  using  atomic  machine  instructions  of  the 
 respective
   architecture).  This load,  the  comparison  with  the  
 expected
   value,  and  starting  to  sleep  are  performed  atomically 
 and
   totally ordered with respect to other futex  operations  on  
 the
   same  futex  word.  If the thread starts to sleep, it is 
 consid‐
   ered a waiter on this futex word.  If the futex value  does  
 not
   match  val,  then  the  call  fails  immediately  with the 
 error
   EAGAIN.

   The purpose of the comparison with the expected value is to 
 pre‐
   vent  lost  wake-ups: If another thread changed the value of 
 the
   futex word after the calling thread decided to  block  based  
 on
   the  prior  value, and if the other thread executed a 
 FUTEX_WAKE
   operation (or similar wake-up) after the value change and 
 before
   this  FUTEX_WAIT  operation,  then  the  latter will observe 
 the
   value change and will not start to sleep.

   If the timeout argument is non-NULL, its contents specify a 
 rel‐
   ative   timeout   for   the  wait,  measured  according  to  
 the
   CLOCK_MONOTONIC clock.  (This interval will be rounded up to 
 the
   system clock granularity, and kernel scheduling delays mean 
 that
   the blocking interval may overrun by a small amount.)  If  
 time‐
   out is NULL, the call blocks indefinitely.
 
 Would it not be better to only state that the wait will not return
 before the timeout -- unless woken -- and not bother with clock
 granularity and scheduling delays?

Many of the pages that talk about system calls that have timeouts
carry similar language, since people often have confusions about what time
timeout (e.g., that it's an upper limit, not a minimum; or that it's precise
to some very small granularity). Why do you think the language here is a
problem?

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
On 07/27/2015 04:17 PM, Heinrich Schuchardt wrote:
 instruction. A thread maybe unable
 
 to  missing word
 
 acquire a lock because it is
 already acquired by another thread. It then may pass the lock's
 flag as futex word and the value representing the acquired state
 as the expected value to a futex() wait operation.

Thanks, Heinrich. Fixed.

Cheers,

Michael




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Davidlohr Bueso
On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
 SEE ALSO
get_robust_list(2), restart_syscall(2), futex(7)

For pi futexes, I also suggest pthread_mutexattr_getprotocol(3), which
is a common entry point.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Davidlohr Bueso
On Mon, 2015-07-27 at 13:10 +0200, Michael Kerrisk (man-pages) wrote:
 Hi David,
 
 On 03/31/2015 04:45 PM, Davidlohr Bueso wrote:
  On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
  
 The condition is represented by the futex word, which is an address 
   in
 memory  supplied to the futex() system call, and the value at this 
  mem‐
 ory location.  (While the virtual addresses for the same memory in 
  sep‐
 arate  processes  may  not be equal, the kernel maps them 
  internally so
 that the same memory mapped in different locations will correspond  
  for
 futex() calls.)
 
 When  executing  a futex operation that requests to block a thread, 
  the
 kernel will only block if the futex word has the value that the 
  calling
  
  Given the use of word, you should probably state right away that
  futexes are only 32bit.
 
 So, I made the opening sentence here:
 
The  condition  is  represented  by  the  futex word, which is an
address in memory supplied to the futex() system  call,  and  the
32-bit  value  at  this  memory  location. 
 
 Okay?

I think we can still improve :)

I've re-read the whole first paragraphs, and have a few comments that
touch upon this specific wording. Lets see. You have:

The  futex()  system call provides a method for waiting until a certain
condition becomes true.  It is typically used as a  blocking  construct
in the context of shared-memory synchronization: The program implements
the majority of the synchronization in user  space,  and  uses  one  of
operations  of  the  system call when it is likely that it has to block
for a longer time until the condition becomes true.  The  program  uses
another  operation of the system call to wake anyone waiting for a par‐
ticular condition.

I've rephrased the next paragraph. How about adding this to follow?

   A futex is in essence a 32-bit user-space address. All futex operations 
and
   conditions are governed by this variable, from now on referred to as 
'futex
   word'. As such, a futex is identified by the address in shared memory, 
which
   may or may not be shared between different processes. For virtual 
memory, the
   kernel will internally handle and resolve the later. This futex word, 
along
   with the value at its the memory location, are supplied to the futex() 
system
   call.

Feel free to reword however you think is better.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-04-27 Thread Pavel Machek
Hi!

>   The FUTEX_WAIT_OP operation is equivalent to execute the 
> follow???
>   ing  code  atomically  and totally ordered with respect to other
>   futex operations on any of the two supplied futex words:

"to executing"?

>   The  operation  and  comparison  that  are  to  be performed are
>   encoded in the bits of  the  argument  val3.   Pictorially,  the
>   encoding is:
> 
>   +---+---+---+---+
>   |op |cmp|   oparg   |  cmparg   |
>   +---+---+---+---+
> 4   4   12  12<== # of bits
> 

:-)

> RETURN VALUE
>In the event of an error, all operations return -1  and  set  errno  to
>indicate  the  cause of the error.  The return value on success depends
>on the operation, as described in the following list:

Did you say (at the begining) that there is no glibc wrapper?

>EINVAL The operation in futex_op is one of those that employs  a  
> time???
>   out,  but  the supplied timeout argument was invalid (tv_sec was
>   less than zero, or tv_nsec was not less than 1000,000,000).

1,000...?

> NOTES
>Glibc does not provide a wrapper for this system call;  call  it  using
>syscall(2).

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-04-27 Thread Pavel Machek
Hi!

   The FUTEX_WAIT_OP operation is equivalent to execute the 
 follow???
   ing  code  atomically  and totally ordered with respect to other
   futex operations on any of the two supplied futex words:

to executing?

   The  operation  and  comparison  that  are  to  be performed are
   encoded in the bits of  the  argument  val3.   Pictorially,  the
   encoding is:
 
   +---+---+---+---+
   |op |cmp|   oparg   |  cmparg   |
   +---+---+---+---+
 4   4   12  12== # of bits
 

:-)

 RETURN VALUE
In the event of an error, all operations return -1  and  set  errno  to
indicate  the  cause of the error.  The return value on success depends
on the operation, as described in the following list:

Did you say (at the begining) that there is no glibc wrapper?

EINVAL The operation in futex_op is one of those that employs  a  
 time???
   out,  but  the supplied timeout argument was invalid (tv_sec was
   less than zero, or tv_nsec was not less than 1000,000,000).

1,000...?

 NOTES
Glibc does not provide a wrapper for this system call;  call  it  using
syscall(2).

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-04-15 Thread Torvald Riegel
On Tue, 2015-04-14 at 23:40 +0200, Thomas Gleixner wrote:
> On Sat, 28 Mar 2015, Peter Zijlstra wrote:
> > On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
> > > So, please take a look at the page below. At this point,
> > > I would most especially appreciate help with the FIXMEs.
> > 
> > For people who cannot read that troff gibberish (me)..
> 
> Ditto :)
>  
> > NOTES
> >Glibc does not provide a wrapper for this system call;  call  it  
> > using
> >syscall(2).
> 
> You might mention that pthread_mutex, pthread_condvar interfaces are
> high level wrappers for the syscall and recommended to be used for
> normal use cases. IIRC unnamed semaphores are implemented with futexes
> as well.

If we add this, I'd rephrase it to something like that there are
high-level programming abstractions such as the pthread_condvar
interfaces or semaphores that are implemented using the syscall and that
are typically a better fit for normal use cases.  I'd consider only the
condvars as something like a wrapper, or targeting a similar use case.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-04-15 Thread Torvald Riegel
On Tue, 2015-04-14 at 23:40 +0200, Thomas Gleixner wrote:
 On Sat, 28 Mar 2015, Peter Zijlstra wrote:
  On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
   So, please take a look at the page below. At this point,
   I would most especially appreciate help with the FIXMEs.
  
  For people who cannot read that troff gibberish (me)..
 
 Ditto :)
  
  NOTES
 Glibc does not provide a wrapper for this system call;  call  it  
  using
 syscall(2).
 
 You might mention that pthread_mutex, pthread_condvar interfaces are
 high level wrappers for the syscall and recommended to be used for
 normal use cases. IIRC unnamed semaphores are implemented with futexes
 as well.

If we add this, I'd rephrase it to something like that there are
high-level programming abstractions such as the pthread_condvar
interfaces or semaphores that are implemented using the syscall and that
are typically a better fit for normal use cases.  I'd consider only the
condvars as something like a wrapper, or targeting a similar use case.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-04-14 Thread Thomas Gleixner
On Sat, 28 Mar 2015, Peter Zijlstra wrote:
> On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
> > So, please take a look at the page below. At this point,
> > I would most especially appreciate help with the FIXMEs.
> 
> For people who cannot read that troff gibberish (me)..

Ditto :)
 
> NOTES
>Glibc does not provide a wrapper for this system call;  call  it  using
>syscall(2).

You might mention that pthread_mutex, pthread_condvar interfaces are
high level wrappers for the syscall and recommended to be used for
normal use cases. IIRC unnamed semaphores are implemented with futexes
as well.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-04-14 Thread Thomas Gleixner
On Sat, 28 Mar 2015, Peter Zijlstra wrote:
 On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
  So, please take a look at the page below. At this point,
  I would most especially appreciate help with the FIXMEs.
 
 For people who cannot read that troff gibberish (me)..

Ditto :)
 
 NOTES
Glibc does not provide a wrapper for this system call;  call  it  using
syscall(2).

You might mention that pthread_mutex, pthread_condvar interfaces are
high level wrappers for the syscall and recommended to be used for
normal use cases. IIRC unnamed semaphores are implemented with futexes
as well.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-31 Thread Davidlohr Bueso
On Sat, 2015-03-28 at 13:03 +0100, Peter Zijlstra wrote:
> >   If the timeout argument is non-NULL, its contents specify a 
> > rel‐
> >   ative   timeout   for   the  wait,  measured  according  to  
> > the
> >   CLOCK_MONOTONIC clock.  (This interval will be rounded up to 
> > the
> >   system clock granularity, and kernel scheduling delays mean 
> > that
> >   the blocking interval may overrun by a small amount.)  If  
> > time‐
> >   out is NULL, the call blocks indefinitely.
> 
> Would it not be better to only state that the wait will not return
> before the timeout -- unless woken -- and not bother with clock
> granularity and scheduling delays?

Yeah, similarly we also have this:

 FUTEX_PRIVATE_FLAG (since Linux 2.6.22)
  This option bit can be employed with all futex  operations.   It
  tells  the  kernel  that  the  futex  is process-private and not
  shared with another process (i.e., it is  only  being  used  for
  synchronization  between  threads  of  the  same process).  This
  allows the kernel to choose the fast  path  for  validating  the
  user-space address and avoids expensive VMA lookups, taking ref‐
  erence counts on file backing store, and so on.

This to me reads a bit too much into the kernel (fastpath, refcnt,
vmas). Why not just mention that it avoids overhead in the kernel or
something? I don't recall any manpage mentioning such details, but I
could be wrong. In any case its a nit, the whole doc is pretty good and
I hope you can merge it soon and then just increment ;)

Thanks,
Davidlohr


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-31 Thread Davidlohr Bueso
On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:

>The condition is represented by the futex word, which is an address  in
>memory  supplied to the futex() system call, and the value at this mem‐
>ory location.  (While the virtual addresses for the same memory in sep‐
>arate  processes  may  not be equal, the kernel maps them internally so
>that the same memory mapped in different locations will correspond  for
>futex() calls.)
> 
>When  executing  a futex operation that requests to block a thread, the
>kernel will only block if the futex word has the value that the calling

Given the use of "word", you should probably state right away that
futexes are only 32bit.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-31 Thread Davidlohr Bueso
On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:

The condition is represented by the futex word, which is an address  in
memory  supplied to the futex() system call, and the value at this mem‐
ory location.  (While the virtual addresses for the same memory in sep‐
arate  processes  may  not be equal, the kernel maps them internally so
that the same memory mapped in different locations will correspond  for
futex() calls.)
 
When  executing  a futex operation that requests to block a thread, the
kernel will only block if the futex word has the value that the calling

Given the use of word, you should probably state right away that
futexes are only 32bit.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-31 Thread Davidlohr Bueso
On Sat, 2015-03-28 at 13:03 +0100, Peter Zijlstra wrote:
If the timeout argument is non-NULL, its contents specify a 
  rel‐
ative   timeout   for   the  wait,  measured  according  to  
  the
CLOCK_MONOTONIC clock.  (This interval will be rounded up to 
  the
system clock granularity, and kernel scheduling delays mean 
  that
the blocking interval may overrun by a small amount.)  If  
  time‐
out is NULL, the call blocks indefinitely.
 
 Would it not be better to only state that the wait will not return
 before the timeout -- unless woken -- and not bother with clock
 granularity and scheduling delays?

Yeah, similarly we also have this:

 FUTEX_PRIVATE_FLAG (since Linux 2.6.22)
  This option bit can be employed with all futex  operations.   It
  tells  the  kernel  that  the  futex  is process-private and not
  shared with another process (i.e., it is  only  being  used  for
  synchronization  between  threads  of  the  same process).  This
  allows the kernel to choose the fast  path  for  validating  the
  user-space address and avoids expensive VMA lookups, taking ref‐
  erence counts on file backing store, and so on.

This to me reads a bit too much into the kernel (fastpath, refcnt,
vmas). Why not just mention that it avoids overhead in the kernel or
something? I don't recall any manpage mentioning such details, but I
could be wrong. In any case its a nit, the whole doc is pretty good and
I hope you can merge it soon and then just increment ;)

Thanks,
Davidlohr


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-30 Thread Rusty Russell
"Michael Kerrisk (man-pages)"  writes:
> When executing a futex operation that requests to block a thread,
> the kernel will only block if the futex word has the value that the
> calling thread supplied as expected value.
> The load from the futex word, the comparison with
> the expected value,
> and the actual blocking will happen atomically and totally
> ordered with respect to concurrently executing futex operations
> on the same futex word,
> such as operations that wake threads blocked on this futex word.
> Thus, the futex word is used to connect the synchronization in user spac

Missing 'e' in "space".

> .\" FIXME Please confirm that the following is correct:
> No guarantee is provided about which waiters are awoken
> (e.g., a waiter with a higher scheduling priority is not guaranteed
> to be awoken in preference to a waiter with a lower priority).

This is true.

I didn't read the rest, as that stuff was all written by others.
Documenting them is pretty heroic; good job!

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-30 Thread Rusty Russell
Michael Kerrisk (man-pages) mtk.manpa...@gmail.com writes:
 When executing a futex operation that requests to block a thread,
 the kernel will only block if the futex word has the value that the
 calling thread supplied as expected value.
 The load from the futex word, the comparison with
 the expected value,
 and the actual blocking will happen atomically and totally
 ordered with respect to concurrently executing futex operations
 on the same futex word,
 such as operations that wake threads blocked on this futex word.
 Thus, the futex word is used to connect the synchronization in user spac

Missing 'e' in space.

 .\ FIXME Please confirm that the following is correct:
 No guarantee is provided about which waiters are awoken
 (e.g., a waiter with a higher scheduling priority is not guaranteed
 to be awoken in preference to a waiter with a lower priority).

This is true.

I didn't read the rest, as that stuff was all written by others.
Documenting them is pretty heroic; good job!

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-28 Thread Peter Zijlstra
On Sat, Mar 28, 2015 at 12:47:25PM +0100, Peter Zijlstra wrote:
>FUTEX_WAIT (since Linux 2.6.0)
>   This operation tests that the value at the futex word pointed to
>   by the address uaddr still contains the expected value val,  and
>   if  so,  then sleeps awaiting FUTEX_WAKE on the futex word.  The
>   load of the value of the futex word is an atomic  memory  access
>   (i.e.,  using  atomic  machine  instructions  of  the respective
>   architecture).  This load,  the  comparison  with  the  expected
>   value,  and  starting  to  sleep  are  performed  atomically and
>   totally ordered with respect to other futex  operations  on  the
>   same  futex  word.  If the thread starts to sleep, it is consid‐
>   ered a waiter on this futex word.  If the futex value  does  not
>   match  val,  then  the  call  fails  immediately  with the error
>   EAGAIN.
> 
>   The purpose of the comparison with the expected value is to pre‐
>   vent  lost  wake-ups: If another thread changed the value of the
>   futex word after the calling thread decided to  block  based  on
>   the  prior  value, and if the other thread executed a FUTEX_WAKE
>   operation (or similar wake-up) after the value change and before
>   this  FUTEX_WAIT  operation,  then  the  latter will observe the
>   value change and will not start to sleep.
> 
>   If the timeout argument is non-NULL, its contents specify a rel‐
>   ative   timeout   for   the  wait,  measured  according  to  the
>   CLOCK_MONOTONIC clock.  (This interval will be rounded up to the
>   system clock granularity, and kernel scheduling delays mean that
>   the blocking interval may overrun by a small amount.)  If  time‐
>   out is NULL, the call blocks indefinitely.

Would it not be better to only state that the wait will not return
before the timeout -- unless woken -- and not bother with clock
granularity and scheduling delays?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-28 Thread Peter Zijlstra
On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
> So, please take a look at the page below. At this point,
> I would most especially appreciate help with the FIXMEs.

For people who cannot read that troff gibberish (me)..

---
FUTEX(2)   Linux Programmer's Manual  FUTEX(2)




NAME
   futex - fast user-space locking

SYNOPSIS
   #include 
   #include 

   int futex(int *uaddr, int futex_op, int val,
 const struct timespec *timeout,   /* or: u32 val2 */
 int *uaddr2, int val3);

   Note: There is no glibc wrapper for this system call; see NOTES.

DESCRIPTION
   The  futex()  system call provides a method for waiting until a certain
   condition becomes true.  It is typically used as a  blocking  construct
   in the context of shared-memory synchronization: The program implements
   the majority of the synchronization in user  space,  and  uses  one  of
   operations  of  the  system call when it is likely that it has to block
   for a longer time until the condition becomes true.  The  program  uses
   another  operation of the system call to wake anyone waiting for a par‐
   ticular condition.

   The condition is represented by the futex word, which is an address  in
   memory  supplied to the futex() system call, and the value at this mem‐
   ory location.  (While the virtual addresses for the same memory in sep‐
   arate  processes  may  not be equal, the kernel maps them internally so
   that the same memory mapped in different locations will correspond  for
   futex() calls.)

   When  executing  a futex operation that requests to block a thread, the
   kernel will only block if the futex word has the value that the calling
   thread  supplied  as expected value.  The load from the futex word, the
   comparison with the expected value, and the actual blocking will happen
   atomically  and  totally ordered with respect to concurrently executing
   futex operations on the same futex word, such as operations  that  wake
   threads  blocked  on  this futex word.  Thus, the futex word is used to
   connect the synchronization in user spac  with  the  implementation  of
   blocking by the kernel; similar to an atomic compare-and-exchange oper‐
   ation that potentially changes shared memory, blocking via a  futex  is
   an atomic compare-and-block operation.  See NOTES for a detailed speci‐
   fication of the synchronization semantics.

   One example use of futexes is implementing locks.   The  state  of  the
   lock  (i.e.,  acquired or not acquired) can be represented as an atomi‐
   cally accessed flag in shared  memory.   In  the  uncontended  case,  a
   thread  can  access  or modify the lock state with atomic instructions,
   for example atomically changing it from not acquired to acquired  using
   an atomic compare-and-exchange instruction.  If a thread cannot acquire
   a lock because it is already acquired by another thread, it can request
   to  block  if  and  only the lock is still acquired by using the lock's
   flag as futex word and expecting a value that represents  the  acquired
   state.   When  releasing the lock, a thread has to first reset the lock
   state to not acquired and then execute the futex operation  that  wakes
   one  thread blocked on the futex word that is the lock's flag (this can
   be be further optimized to avoid unnecessary wake-ups).   See  futex(7)
   for more detail on how to use futexes.

   Besides  the basic wait and wake-up futex functionality, there are fur‐
   ther futex operations aimed at supporting more complex use cases.  Also
   note  that  no  explicit initialization or destruction are necessary to
   use futexes; the kernel maintains a futex  (i.e.,  the  kernel-internal
   implementation  artifact)  only  while  operations  such as FUTEX_WAIT,
   described below, are being performed on a particular futex word.

   Arguments
   The uaddr argument points to the futex word.  On all platforms, futexes
   are  four-byte  integers  that must be aligned on a four-byte boundary.
   The operation to perform on the futex  is  specified  in  the  futex_op
   argument; val is a value whose meaning and purpose depends on futex_op.

   The  remaining  arguments (timeout, uaddr2, and val3) are required only
   for certain of the futex operations  described  below.   Where  one  of
   these arguments is not required, it is ignored.

   For several blocking operations, the timeout argument is a pointer to a
   timespec structure that specifies a timeout for  the  operation.   How‐
   ever,   notwithstanding the prototype shown above, for some operations,
   this argument is instead a four-byte integer whose  meaning  is  deter‐
 

Re: Revised futex(2) man page for review

2015-03-28 Thread Michael Kerrisk (man-pages)
On 03/28/2015 09:53 AM, Michael Kerrisk (man-pages) wrote:
> Hello all,
[...]
> So, please take a look at the page below. At this point,
> I would most especially appreciate help with the FIXMEs.

One more point I should have added. The revised page
currently sits in a Git branch, here:
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Revised futex(2) man page for review

2015-03-28 Thread Michael Kerrisk (man-pages)
Hello all,

As becomes quickly obvious upon reading it, the current futex(2) 
man page is in a sorry state, lacking many important details, and
also the various additions that have been made to the interface
over the last years. I've been working on revising it, first
of all based on input I got in response to a request for help
last year (http://thread.gmane.org/gmane.linux.kernel/1703405), 
especially taking Thomas Gleixner's input 
(http://thread.gmane.org/gmane.linux.kernel/1703405/focus=2952) 
into account. I also got some further offlist input from Darren
 Hart, Torvald Riegel, and Davidlohr Bueso that has been
incorporated into the revised draft. Other than that, I got
some useful info out of Ulrich Drepper's paper (cited at the
end of the page) and one or two web pages (cited in the page
source).

The page has now increased in size by a factor of about 5, but
is far from complete. In particular, as I reworked the page, 
there were many details that I was not 100% certain of, and I
have added FIXME markers to the page source. In addition,
Torvald added some text, and a few more FIXMEs. Some of
the FIXMEs are trivial, as in: I'd like confirmation that
I have correctly captured a technical detail. Others are more 
substantial, probably requiring the addition of further text.

I appreciate that there are probably other things that can be
improved in the page. (Torvald and Darren have some ideas.)
However, before growing the page any further, I would like to
resolve as many of the FIXMEs (and any other problems that people
see) as possible in the existing text. I need help with that. 
(And I know that dealing with that help, if I get it, will in 
itself will be quite a task to deal with, which is why I have 
been delaying it for many weeks now, as my time has been 
rather limited recently.)

So, please take a look at the page below. At this point,
I would most especially appreciate help with the FIXMEs.

Cheers,

Michael

=
.\" Page by b.hubert
.\" and Copyright (C) 2015, Thomas Gleixner 
.\" and Copyright (C) 2015, Michael Kerrisk 
.\"
.\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE)
.\" may be freely modified and distributed
.\" %%%LICENSE_END
.\"
.\" Niki A. Rahimi (LTC Security Development, narah...@us.ibm.com)
.\" added ERRORS section.
.\"
.\" Modified 2004-06-17 mtk
.\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE
.\"
.\" FIXME Still to integrate are some points from Torvald Riegel's mail of
.\"   2015-01-23:
.\"   http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977
.\"
.\" FIXME Do we need add some text regarding Torvald Riegel's 2015-01-24 mail
.\"   at http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242
.\"
.TH FUTEX 2 2014-05-21 "Linux" "Linux Programmer's Manual"
.SH NAME
futex \- fast user-space locking
.SH SYNOPSIS
.nf
.sp
.B "#include "
.B "#include "
.sp
.BI "int futex(int *" uaddr ", int " futex_op ", int " val ,
.BI "  const struct timespec *" timeout , \
" \fR  /* or: \fBu32 \fIval2\fP */ 
.BI "  int *" uaddr2 ", int " val3 );
.fi

.IR Note :
There is no glibc wrapper for this system call; see NOTES.
.SH DESCRIPTION
.PP
The
.BR futex ()
system call provides a method for waiting until a certain condition becomes
true.
It is typically used as a blocking construct in the context of
shared-memory synchronization: The program implements the majority of
the synchronization in user space, and uses one of operations of
the system call when it is likely that it has to block for
a longer time until the condition becomes true.
The program uses another operation of the system call to wake
anyone waiting for a particular condition.

The condition is represented by the futex word, which is an address
in memory supplied to the
.BR futex ()
system call, and the value at this memory location.
(While the virtual addresses for the same memory in separate
processes may not be equal,
the kernel maps them internally so that the same memory mapped
in different locations will correspond for
.BR futex ()
calls.)

When executing a futex operation that requests to block a thread,
the kernel will only block if the futex word has the value that the
calling thread supplied as expected value.
The load from the futex word, the comparison with
the expected value,
and the actual blocking will happen atomically and totally
ordered with respect to concurrently executing futex operations
on the same futex word,
such as operations that wake threads blocked on this futex word.
Thus, the futex word is used to connect the synchronization in user spac
with the implementation of blocking by the kernel; similar to an atomic
compare-and-exchange operation that potentially changes shared memory,
blocking via a futex is an atomic compare-and-block operation.
See NOTES for
a detailed specification of the synchronization semantics.

One example use of futexes is implementing locks.
The state of the lock (i.e.,
acquired or not acquired) can be represented as an 

Re: Revised futex(2) man page for review

2015-03-28 Thread Peter Zijlstra
On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
 So, please take a look at the page below. At this point,
 I would most especially appreciate help with the FIXMEs.

For people who cannot read that troff gibberish (me)..

---
FUTEX(2)   Linux Programmer's Manual  FUTEX(2)




NAME
   futex - fast user-space locking

SYNOPSIS
   #include linux/futex.h
   #include sys/time.h

   int futex(int *uaddr, int futex_op, int val,
 const struct timespec *timeout,   /* or: u32 val2 */
 int *uaddr2, int val3);

   Note: There is no glibc wrapper for this system call; see NOTES.

DESCRIPTION
   The  futex()  system call provides a method for waiting until a certain
   condition becomes true.  It is typically used as a  blocking  construct
   in the context of shared-memory synchronization: The program implements
   the majority of the synchronization in user  space,  and  uses  one  of
   operations  of  the  system call when it is likely that it has to block
   for a longer time until the condition becomes true.  The  program  uses
   another  operation of the system call to wake anyone waiting for a par‐
   ticular condition.

   The condition is represented by the futex word, which is an address  in
   memory  supplied to the futex() system call, and the value at this mem‐
   ory location.  (While the virtual addresses for the same memory in sep‐
   arate  processes  may  not be equal, the kernel maps them internally so
   that the same memory mapped in different locations will correspond  for
   futex() calls.)

   When  executing  a futex operation that requests to block a thread, the
   kernel will only block if the futex word has the value that the calling
   thread  supplied  as expected value.  The load from the futex word, the
   comparison with the expected value, and the actual blocking will happen
   atomically  and  totally ordered with respect to concurrently executing
   futex operations on the same futex word, such as operations  that  wake
   threads  blocked  on  this futex word.  Thus, the futex word is used to
   connect the synchronization in user spac  with  the  implementation  of
   blocking by the kernel; similar to an atomic compare-and-exchange oper‐
   ation that potentially changes shared memory, blocking via a  futex  is
   an atomic compare-and-block operation.  See NOTES for a detailed speci‐
   fication of the synchronization semantics.

   One example use of futexes is implementing locks.   The  state  of  the
   lock  (i.e.,  acquired or not acquired) can be represented as an atomi‐
   cally accessed flag in shared  memory.   In  the  uncontended  case,  a
   thread  can  access  or modify the lock state with atomic instructions,
   for example atomically changing it from not acquired to acquired  using
   an atomic compare-and-exchange instruction.  If a thread cannot acquire
   a lock because it is already acquired by another thread, it can request
   to  block  if  and  only the lock is still acquired by using the lock's
   flag as futex word and expecting a value that represents  the  acquired
   state.   When  releasing the lock, a thread has to first reset the lock
   state to not acquired and then execute the futex operation  that  wakes
   one  thread blocked on the futex word that is the lock's flag (this can
   be be further optimized to avoid unnecessary wake-ups).   See  futex(7)
   for more detail on how to use futexes.

   Besides  the basic wait and wake-up futex functionality, there are fur‐
   ther futex operations aimed at supporting more complex use cases.  Also
   note  that  no  explicit initialization or destruction are necessary to
   use futexes; the kernel maintains a futex  (i.e.,  the  kernel-internal
   implementation  artifact)  only  while  operations  such as FUTEX_WAIT,
   described below, are being performed on a particular futex word.

   Arguments
   The uaddr argument points to the futex word.  On all platforms, futexes
   are  four-byte  integers  that must be aligned on a four-byte boundary.
   The operation to perform on the futex  is  specified  in  the  futex_op
   argument; val is a value whose meaning and purpose depends on futex_op.

   The  remaining  arguments (timeout, uaddr2, and val3) are required only
   for certain of the futex operations  described  below.   Where  one  of
   these arguments is not required, it is ignored.

   For several blocking operations, the timeout argument is a pointer to a
   timespec structure that specifies a timeout for  the  operation.   How‐
   ever,   notwithstanding the prototype shown above, for some operations,
   this argument is instead a four-byte integer whose  

Re: Revised futex(2) man page for review

2015-03-28 Thread Peter Zijlstra
On Sat, Mar 28, 2015 at 12:47:25PM +0100, Peter Zijlstra wrote:
FUTEX_WAIT (since Linux 2.6.0)
   This operation tests that the value at the futex word pointed to
   by the address uaddr still contains the expected value val,  and
   if  so,  then sleeps awaiting FUTEX_WAKE on the futex word.  The
   load of the value of the futex word is an atomic  memory  access
   (i.e.,  using  atomic  machine  instructions  of  the respective
   architecture).  This load,  the  comparison  with  the  expected
   value,  and  starting  to  sleep  are  performed  atomically and
   totally ordered with respect to other futex  operations  on  the
   same  futex  word.  If the thread starts to sleep, it is consid‐
   ered a waiter on this futex word.  If the futex value  does  not
   match  val,  then  the  call  fails  immediately  with the error
   EAGAIN.
 
   The purpose of the comparison with the expected value is to pre‐
   vent  lost  wake-ups: If another thread changed the value of the
   futex word after the calling thread decided to  block  based  on
   the  prior  value, and if the other thread executed a FUTEX_WAKE
   operation (or similar wake-up) after the value change and before
   this  FUTEX_WAIT  operation,  then  the  latter will observe the
   value change and will not start to sleep.
 
   If the timeout argument is non-NULL, its contents specify a rel‐
   ative   timeout   for   the  wait,  measured  according  to  the
   CLOCK_MONOTONIC clock.  (This interval will be rounded up to the
   system clock granularity, and kernel scheduling delays mean that
   the blocking interval may overrun by a small amount.)  If  time‐
   out is NULL, the call blocks indefinitely.

Would it not be better to only state that the wait will not return
before the timeout -- unless woken -- and not bother with clock
granularity and scheduling delays?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-28 Thread Michael Kerrisk (man-pages)
On 03/28/2015 09:53 AM, Michael Kerrisk (man-pages) wrote:
 Hello all,
[...]
 So, please take a look at the page below. At this point,
 I would most especially appreciate help with the FIXMEs.

One more point I should have added. The revised page
currently sits in a Git branch, here:
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Revised futex(2) man page for review

2015-03-28 Thread Michael Kerrisk (man-pages)
Hello all,

As becomes quickly obvious upon reading it, the current futex(2) 
man page is in a sorry state, lacking many important details, and
also the various additions that have been made to the interface
over the last years. I've been working on revising it, first
of all based on input I got in response to a request for help
last year (http://thread.gmane.org/gmane.linux.kernel/1703405), 
especially taking Thomas Gleixner's input 
(http://thread.gmane.org/gmane.linux.kernel/1703405/focus=2952) 
into account. I also got some further offlist input from Darren
 Hart, Torvald Riegel, and Davidlohr Bueso that has been
incorporated into the revised draft. Other than that, I got
some useful info out of Ulrich Drepper's paper (cited at the
end of the page) and one or two web pages (cited in the page
source).

The page has now increased in size by a factor of about 5, but
is far from complete. In particular, as I reworked the page, 
there were many details that I was not 100% certain of, and I
have added FIXME markers to the page source. In addition,
Torvald added some text, and a few more FIXMEs. Some of
the FIXMEs are trivial, as in: I'd like confirmation that
I have correctly captured a technical detail. Others are more 
substantial, probably requiring the addition of further text.

I appreciate that there are probably other things that can be
improved in the page. (Torvald and Darren have some ideas.)
However, before growing the page any further, I would like to
resolve as many of the FIXMEs (and any other problems that people
see) as possible in the existing text. I need help with that. 
(And I know that dealing with that help, if I get it, will in 
itself will be quite a task to deal with, which is why I have 
been delaying it for many weeks now, as my time has been 
rather limited recently.)

So, please take a look at the page below. At this point,
I would most especially appreciate help with the FIXMEs.

Cheers,

Michael

=
.\ Page by b.hubert
.\ and Copyright (C) 2015, Thomas Gleixner t...@linutronix.de
.\ and Copyright (C) 2015, Michael Kerrisk mtk.manpa...@gmail.com
.\
.\ %%%LICENSE_START(FREELY_REDISTRIBUTABLE)
.\ may be freely modified and distributed
.\ %%%LICENSE_END
.\
.\ Niki A. Rahimi (LTC Security Development, narah...@us.ibm.com)
.\ added ERRORS section.
.\
.\ Modified 2004-06-17 mtk
.\ Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE
.\
.\ FIXME Still to integrate are some points from Torvald Riegel's mail of
.\   2015-01-23:
.\   http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977
.\
.\ FIXME Do we need add some text regarding Torvald Riegel's 2015-01-24 mail
.\   at http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242
.\
.TH FUTEX 2 2014-05-21 Linux Linux Programmer's Manual
.SH NAME
futex \- fast user-space locking
.SH SYNOPSIS
.nf
.sp
.B #include linux/futex.h
.B #include sys/time.h
.sp
.BI int futex(int * uaddr , int  futex_op , int  val ,
.BI   const struct timespec * timeout , \
 \fR  /* or: \fBu32 \fIval2\fP */ 
.BI   int * uaddr2 , int  val3 );
.fi

.IR Note :
There is no glibc wrapper for this system call; see NOTES.
.SH DESCRIPTION
.PP
The
.BR futex ()
system call provides a method for waiting until a certain condition becomes
true.
It is typically used as a blocking construct in the context of
shared-memory synchronization: The program implements the majority of
the synchronization in user space, and uses one of operations of
the system call when it is likely that it has to block for
a longer time until the condition becomes true.
The program uses another operation of the system call to wake
anyone waiting for a particular condition.

The condition is represented by the futex word, which is an address
in memory supplied to the
.BR futex ()
system call, and the value at this memory location.
(While the virtual addresses for the same memory in separate
processes may not be equal,
the kernel maps them internally so that the same memory mapped
in different locations will correspond for
.BR futex ()
calls.)

When executing a futex operation that requests to block a thread,
the kernel will only block if the futex word has the value that the
calling thread supplied as expected value.
The load from the futex word, the comparison with
the expected value,
and the actual blocking will happen atomically and totally
ordered with respect to concurrently executing futex operations
on the same futex word,
such as operations that wake threads blocked on this futex word.
Thus, the futex word is used to connect the synchronization in user spac
with the implementation of blocking by the kernel; similar to an atomic
compare-and-exchange operation that potentially changes shared memory,
blocking via a futex is an atomic compare-and-block operation.
See NOTES for
a detailed specification of the synchronization semantics.

One example use of futexes is implementing locks.
The state of the lock (i.e.,
acquired or not acquired) can be