Re: Next round: revised futex(2) man page for review

2015-10-08 Thread Darren Hart
On Wed, Oct 07, 2015 at 10:34:19AM +0100, Michael Kerrisk (man-pages) wrote:
> On 08/19/2015 03:40 PM, Thomas Gleixner wrote:
> > On Wed, 5 Aug 2015, Darren Hart wrote:
> >> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) 
> >> wrote:
> >>> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
> >>> .\"   The following text is drawn from the Hart/Guniguntala paper
> >>> .\"   (listed in SEE ALSO), but I have reworded some pieces
> >>> .\"   significantly. Please check it.
> >>>
> >>>The PI futex operations described below  differ  from  the  other
> >>>futex  operations  in  that  they impose policy on the use of the
> >>>value of the futex word:
> >>>
> >>>*  If the lock is not acquired, the futex word's value  shall  be
> >>>   0.
> >>>
> >>>*  If  the  lock is acquired, the futex word's value shall be the
> >>>   thread ID (TID; see gettid(2)) of the owning thread.
> >>>
> >>>*  If the lock is owned and there are threads contending for  the
> >>>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
> >>>   word's value; in other words, this value is:
> >>>
> >>>   FUTEX_WAITERS | TID
> >>>
> >>>
> >>>Note that a PI futex word never just has the value FUTEX_WAITERS,
> >>>which is a permissible state for non-PI futexes.
> >>
> >> The second clause is inappropriate. I don't know if that was yours or
> >> mine, but non-PI futexes do not have a kernel defined value policy, so
> >> ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> >> permissible for non-PI futexes, and none have a kernel defined state.
> > 
> > Depends. If the regular futex is configured as robust, then we have a
> > kernel defined value policy as well.
> 

Right.

> Okay -- so do we need a change to the text here?

Hrm. We probably need a way to indicate that kernel-defined futex word
value policy only applies to PI and or ROBUST futexes.


> 
> >>> .\" FIXME I'm not quite clear on the meaning of the following sentence.
> >>> .\"   Is this trying to say that while blocked in a
> >>> .\"   FUTEX_WAIT_REQUEUE_PI, it could happen that another
> >>> .\"   task does a FUTEX_WAKE on uaddr that simply causes
> >>> .\"   a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
> >>> .\"   does not complete? What happens then to the 
> >>> FUTEX_WAIT_REQUEUE_PI
> >>> .\"   opertion? Does it remain blocked, or does it unblock
> >>> .\"   In which case, what does user space see?
> >>>
> >>>   The
> >>>   waiter   can  be  removed  from  the  wait  on  uaddr  via
> >>>   FUTEX_WAKE without requeueing on uaddr2.
> >>
> >> Userspace should see the task wake and continue executing. This would
> >> effectively be a cancelation operation - which I didn't think was
> >> supported. Thomas?
> > 
> > We probably never intended to support it, but looking at the code it
> > works (did not try it though). It returns to user space with
> > -EWOULDBLOCK. So it basically behaves like any other spurious wakeup.
> 
> Again, I assume no changes are required to the man page(?).

I'd rather not document this as supported or intended behavior.
FUTEX_WAIT_REQUEUE_PI is documented as being paired with and only with
FUTEX_CMP_REQUEUE_PI. Anything else is undefined behavior.

If we want to support a cancelation, it should be deliberate - and we should
probably test it ;-)


-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-10-08 Thread Darren Hart
On Wed, Oct 07, 2015 at 09:30:46AM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Thomas,
> 
> Thanks for the follow up!
> 
> Some open questions below are marked with the string ###.

A couple of comments from me below, although I suspect you have this much
covered already.

> 
> On 08/19/2015 04:17 PM, Thomas Gleixner wrote:
> > On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote:
> FUTEX_CMP_REQUEUE (since Linux 2.6.7)
>    This  operation  first  checks  whether the location uaddr
>    still contains the value  val3.   If  not,  the  operation
>    fails  with  the  error  EAGAIN.  Otherwise, the operation
>    wakes up a maximum of val waiters that are waiting on  the
>    futex  at uaddr.  If there are more than val waiters, then
>    the remaining waiters are removed from the wait  queue  of
>    the  source  futex at uaddr and added to the wait queue of
>    the target futex at uaddr2.  The val2  argument  specifies
>    an  upper limit on the number of waiters that are requeued
>    to the futex at uaddr2.
> 
>  .\" FIXME(Torvald) Is the following correct?  Or is just the decision
>  .\" which threads to wake or requeue part of the atomic operation?
> 
>    The load from uaddr is  an  atomic  memory  access  (i.e.,
>    using atomic machine instructions of the respective archi‐
>    tecture).  This load, the comparison with  val3,  and  the
>    requeueing  of  any  waiters  are performed atomically and
>    totally ordered with respect to other  operations  on  the
>    same futex word.
> >>>
> >>> It's atomic as the other atomic operations on the futex word. It's
> >>> always performed with the proper lock(s) held in the kernel. That
> >>> means any concurrent operation will serialize on that lock(s). User
> >>> space has to make sure, that depending on the observed value no
> >>> concurrent operations happen, but that's something the kernel cannot
> >>> control.
> >>
> >> ???
> >> Sorry, I'm not clear here. Is the current text correct then? Or is some
> >> change needed.
> > 
> > I think we need some change here because the meaning of atomic is
> > unclear. The basic rules of futexes are:
> > 
> >  - All modifying operations on the futex value have to be done with
> >atomic instructions, usually cmpxchg. That applies to both kernel
> >and user space.
> > 
> >That's the atomicity at the futex value level.
> > 
> >  - In the kernel we have to create/modify/destroy state in order to
> >provide the blocking/requeueing etc.
> > 
> >This state needs protection as well. So all operations related to
> >the kernel internal state are serialized on the hash bucket
> >locks. The hash buckets are a scalability mechanism to avoid
> >contention on a single lock protecting all kernel internal
> >state. For simplicity reasons you can just think of a global lock
> >protecting all kernel internal state.
> > 
> >If the kernel creates/modifies state then it can be necessary to
> >either reread the futex value or modify it. That happens under the
> >locks as well.
> > 
> >So in the case of requeue, we take the proper locks and perform the
> >comparison with val3 and the requeueing with the locks held.
> >
> >So that lock protection makes these operations 'atomic'. The
> >correct expression is 'serialized'.
> 
> ###
> So, here, i think I need some specific pointers on the precise text
> changes that are required. Let's talk about this f2f. For convenience,
> here's the relevant text once again quoted:

Not speaking for tglx, but I think the point here is to distinguish between
atomic (as in cmpxchg comparison tests performed on the futex word) and
serialized (as in the management of futex hashbuckets and task states).

> 
>FUTEX_CMP_REQUEUE (since Linux 2.6.7)
>   This  operation  first  checks  whether the location uaddr
>   still contains the value  val3.   If  not,  the  operation
>   fails  with  the  error  EAGAIN.  Otherwise, the operation

Here you might explain the _CMP_ qualifier and note atomicity of the operation:

The _CMP_ refers to the verification of the userspace state as specified by
through the arguments. This operation first atomically compares the value at
uaddr with the value val3 ...


>   wakes up a maximum of val waiters that are waiting on  the
>   futex  at uaddr.  If there are more than val waiters, then
>   the remaining waiters are removed from the wait  queue  of
>   the  source  futex at uaddr and added to the wait queue of
>   the target futex at uaddr2.  The val2  argument  specifies
>   an  upper limit on the number

Re: Next round: revised futex(2) man page for review

2015-10-07 Thread Michael Kerrisk (man-pages)
On 08/19/2015 03:40 PM, Thomas Gleixner wrote:
> On Wed, 5 Aug 2015, Darren Hart wrote:
>> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
>>> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
>>> .\"   The following text is drawn from the Hart/Guniguntala paper
>>> .\"   (listed in SEE ALSO), but I have reworded some pieces
>>> .\"   significantly. Please check it.
>>>
>>>The PI futex operations described below  differ  from  the  other
>>>futex  operations  in  that  they impose policy on the use of the
>>>value of the futex word:
>>>
>>>*  If the lock is not acquired, the futex word's value  shall  be
>>>   0.
>>>
>>>*  If  the  lock is acquired, the futex word's value shall be the
>>>   thread ID (TID; see gettid(2)) of the owning thread.
>>>
>>>*  If the lock is owned and there are threads contending for  the
>>>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
>>>   word's value; in other words, this value is:
>>>
>>>   FUTEX_WAITERS | TID
>>>
>>>
>>>Note that a PI futex word never just has the value FUTEX_WAITERS,
>>>which is a permissible state for non-PI futexes.
>>
>> The second clause is inappropriate. I don't know if that was yours or
>> mine, but non-PI futexes do not have a kernel defined value policy, so
>> ==FUTEX_WAITERS cannot be a "permissible state" as any value is
>> permissible for non-PI futexes, and none have a kernel defined state.
> 
> Depends. If the regular futex is configured as robust, then we have a
> kernel defined value policy as well.

Okay -- so do we need a change to the text here?

>>> .\" FIXME I'm not quite clear on the meaning of the following sentence.
>>> .\"   Is this trying to say that while blocked in a
>>> .\"   FUTEX_WAIT_REQUEUE_PI, it could happen that another
>>> .\"   task does a FUTEX_WAKE on uaddr that simply causes
>>> .\"   a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
>>> .\"   does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI
>>> .\"   opertion? Does it remain blocked, or does it unblock
>>> .\"   In which case, what does user space see?
>>>
>>>   The
>>>   waiter   can  be  removed  from  the  wait  on  uaddr  via
>>>   FUTEX_WAKE without requeueing on uaddr2.
>>
>> Userspace should see the task wake and continue executing. This would
>> effectively be a cancelation operation - which I didn't think was
>> supported. Thomas?
> 
> We probably never intended to support it, but looking at the code it
> works (did not try it though). It returns to user space with
> -EWOULDBLOCK. So it basically behaves like any other spurious wakeup.

Again, I assume no changes are required to the man page(?).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-10-07 Thread Michael Kerrisk (man-pages)
Hello Thomas,

Thanks for the follow up!

Some open questions below are marked with the string ###.

On 08/19/2015 04:17 PM, Thomas Gleixner wrote:
> On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote:
FUTEX_CMP_REQUEUE (since Linux 2.6.7)
   This  operation  first  checks  whether the location uaddr
   still contains the value  val3.   If  not,  the  operation
   fails  with  the  error  EAGAIN.  Otherwise, the operation
   wakes up a maximum of val waiters that are waiting on  the
   futex  at uaddr.  If there are more than val waiters, then
   the remaining waiters are removed from the wait  queue  of
   the  source  futex at uaddr and added to the wait queue of
   the target futex at uaddr2.  The val2  argument  specifies
   an  upper limit on the number of waiters that are requeued
   to the futex at uaddr2.

 .\" FIXME(Torvald) Is the following correct?  Or is just the decision
 .\" which threads to wake or requeue part of the atomic operation?

   The load from uaddr is  an  atomic  memory  access  (i.e.,
   using atomic machine instructions of the respective archi‐
   tecture).  This load, the comparison with  val3,  and  the
   requeueing  of  any  waiters  are performed atomically and
   totally ordered with respect to other  operations  on  the
   same futex word.
>>>
>>> It's atomic as the other atomic operations on the futex word. It's
>>> always performed with the proper lock(s) held in the kernel. That
>>> means any concurrent operation will serialize on that lock(s). User
>>> space has to make sure, that depending on the observed value no
>>> concurrent operations happen, but that's something the kernel cannot
>>> control.
>>
>> ???
>> Sorry, I'm not clear here. Is the current text correct then? Or is some
>> change needed.
> 
> I think we need some change here because the meaning of atomic is
> unclear. The basic rules of futexes are:
> 
>  - All modifying operations on the futex value have to be done with
>atomic instructions, usually cmpxchg. That applies to both kernel
>and user space.
> 
>That's the atomicity at the futex value level.
> 
>  - In the kernel we have to create/modify/destroy state in order to
>provide the blocking/requeueing etc.
> 
>This state needs protection as well. So all operations related to
>the kernel internal state are serialized on the hash bucket
>locks. The hash buckets are a scalability mechanism to avoid
>contention on a single lock protecting all kernel internal
>state. For simplicity reasons you can just think of a global lock
>protecting all kernel internal state.
> 
>If the kernel creates/modifies state then it can be necessary to
>either reread the futex value or modify it. That happens under the
>locks as well.
> 
>So in the case of requeue, we take the proper locks and perform the
>comparison with val3 and the requeueing with the locks held.
>
>So that lock protection makes these operations 'atomic'. The
>correct expression is 'serialized'.

###
So, here, i think I need some specific pointers on the precise text
changes that are required. Let's talk about this f2f. For convenience,
here's the relevant text once again quoted:

   FUTEX_CMP_REQUEUE (since Linux 2.6.7)
  This  operation  first  checks  whether the location uaddr
  still contains the value  val3.   If  not,  the  operation
  fails  with  the  error  EAGAIN.  Otherwise, the operation
  wakes up a maximum of val waiters that are waiting on  the
  futex  at uaddr.  If there are more than val waiters, then
  the remaining waiters are removed from the wait  queue  of
  the  source  futex at uaddr and added to the wait queue of
  the target futex at uaddr2.  The val2  argument  specifies
  an  upper limit on the number of waiters that are requeued
  to the futex at uaddr2.

  The load from uaddr is  an  atomic  memory  access  (i.e.,
  using atomic machine instructions of the respective archi‐
  tecture).  This load, the comparison with  val3,  and  the
  requeueing  of  any  waiters  are performed atomically and
  totally ordered with respect to other  operations  on  the
  same futex word.


 .\" FIXME We need some explanation in the following paragraph of *why*
 .\"   it is important to note that "the kernel will update the
 .\"   futex word's value prior
It is important to note to returning to user space" . Can someone
explain?   that  the  kernel  will  update the futex word's value
prior 

Re: Next round: revised futex(2) man page for review

2015-08-25 Thread Darren Hart
On Thu, Aug 20, 2015 at 01:17:03AM +0200, Thomas Gleixner wrote:

...

> > >> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
> > >> .\"   made the observation that "EINVAL is returned if the non-pi 
> > >> .\"   to pi or op pairing semantics are violated."
> > >> .\"   Probably there needs to be a general statement about this
> > >> .\"   requirement, probably located at about this point in the page.
> > >> .\"   Darren (or someone else), care to take a shot at this?
> > > 
> > > Well, that's hard to describe because the kernel only has a limited
> > > way of detecting such mismatches. It only can detect it when there are
> > > non PI waiters on a futex and a PI function is called or vice versa.
> > 
> > Hmmm. Okay, I filed your comments away for reference, but
> > hopefully someone can help with some actual text.
> 
> I let Darren come up with something sensible :)

Heh, right, no pressure then...

I responded to Michael on this recently, copied here for reference:


FUTEX_WAIT_REQUEUE_PI can return -EINVAL if called with invalid parameters, such
as uaddr==uaddr2, or (in the case of SHARED futexes), the associated keys match
(meaning it's the same futex word - shared memory, inode, etc.). This can't
happen if the stated policy of requeueing from non-pi to pi is followed as the
same word cannot be both non-pi and pi at the same time, requiring them to be
unique futex words.

FUTEX_CMP_REQUEUE_PI will fail similarly if uaddr and uaddr2 are the same futex
word. Also, if nr_wake != 1.

But, to the point I was making above, FUTEX_CMP_REQUEUE_PI must requeue uaddr to
the same uaddr2 specified in the previous FUTEX_WAIT_REQUEUE_PI call.
FUTEX_WAIT_REQUEUE_PI sets up the operation, FUTEX_CMP_REQUEUE_PI completes it,
and they must agree on uaddr and uaddr2.


Michael, are you still looking for something more from me, or is this FIXME now
complete?



-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-24 Thread Darren Hart
On Sat, Aug 08, 2015 at 08:57:35AM +0200, Michael Kerrisk (man-pages) wrote:

...

> >> .\" FIXME = End of adapted Hart/Guniguntala text =
> >>
> >>
> >>
> >> .\" FIXME We need some explanation in the following paragraph of *why*
> >> .\"   it is important to note that "the kernel will update the
> >> .\"   futex word's value prior
> >>It is important to note to returning to user space" . Can someone
> >>explain?   that  the  kernel  will  update the futex word's value
> >>prior to returning to user space.  Unlike the other futex  opera‐
> >>tions  described  above, the PI futex operations are designed for
> >>the implementation of very specific IPC mechanisms.
> > 
> > If the kernel didn't perform the update prior to returning to userspace,
> > we could end up in an invalid state. Such as having an owner, but the
> > value being 0. Or having waiters, but not having FUTEX_WAITERS set.
> 
> So I've now reworked this passage to read:
> 
>It  is  important  to  note that the kernel will update the futex
>word's value prior to returning to user  space.   (This  prevents
>the possibility of the futex word's value ending up in an invalid
>state, such as having an owner but the value being 0,  or  having
>waiters but not having the FUTEX_WAITERS bit set.)
> 
> Okay?

Yes.

> 
> >> .\"
> >> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
> >> .\"   made the observation that "EINVAL is returned if the non-pi 
> >> .\"   to pi or op pairing semantics are violated."
> >> .\"   Probably there needs to be a general statement about this
> >> .\"   requirement, probably located at about this point in the page.
> >> .\"   Darren (or someone else), care to take a shot at this?
> > 
> > We can probably borrow from either the futex.c comments or the
> > futex-requeue-pi.txt in Documentation. Also, it is important to note
> > that the PI requeue operations require two distinct uadders (although
> > that is implied by requiring "non-pi to pi" as a futex cannot be both.
> > 
> > Or... perhaps something like:
> > 
> > Due to the kernel imposed futex word value policy, PI futex
> > operations have additional usage requirements:
> > 
> > FUTEX_WAIT_REQUEUE_PI must be paired with FUTEX_CMP_REQUEUE_PI
> > and be performed from a non-pi futex to a distinct pi futex.
> > Failing to do so will return EINVAL. 
> 
> For which operation does the EINVAL occur: FUTEX_WAIT_REQUEUE_PI or 
> FUTEX_CMP_REQUEUE_PI?

FUTEX_WAIT_REQUEUE_PI can return -EINVAL if called with invalid parameters, such
as uaddr==uaddr2, or (in the case of SHARED futexes), the associated keys match
(meaning it's the same futex word - shared memory, inode, etc.). This can't
happen if the stated policy of requeueing from non-pi to pi is followed as the
same word cannot be both non-pi and pi at the same time, requiring them to be
unique futex words.

FUTEX_CMP_REQUEUE_PI will fail similarly if uaddr and uaddr2 are the same futex
word. Also, if nr_wake != 1.

But, to the point I was making above, FUTEX_CMP_REQUEUE_PI must reque uaddr to
same uaddr2 specified in the previous FUTEX_WAIT_REQUEUE_PI call.
FUTEX_WAIT_REQUEUE_PI sets up the operation, FUTEX_CMP_REQUEUE_PI completes it,
and they must agree on uaddr and uaddr2.

...

> > And their PRIVATE counterparts of course (which is assumed as it is a
> > flag to the opcode).
> 
> Yes. But I don't think that needs to be called out explicitly here (?).


Agreed.

> 
> >> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
> >> .\"   The following text is drawn from the Hart/Guniguntala paper
> >> .\"   (listed in SEE ALSO), but I have reworded some pieces
> >> .\"   significantly. Please check it.
> >>
> >>The PI futex operations described below  differ  from  the  other
> >>futex  operations  in  that  they impose policy on the use of the
> >>value of the futex word:
> >>
> >>*  If the lock is not acquired, the futex word's value  shall  be
> >>   0.
> >>
> >>*  If  the  lock is acquired, the futex word's value shall be the
> >>   thread ID (TID; see gettid(2)) of the owning thread.
> >>
> >>*  If the lock is owned and there are threads contending for  the
> >>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
> >>   word's value; in other words, this value is:
> >>
> >>   FUTEX_WAITERS | TID
> >>
> >>
> >>Note that a PI futex word never just has the value FUTEX_WAITERS,
> >>which is a permissible state for non-PI futexes.
> > 
> > The second clause is inappropriate. I don't know if that was yours or
> > mine, but non-PI futexes do not have a kernel defined value policy, so
> > ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> > permissible for non-PI futexes, and none have a kernel defined state.
> > 
>

Re: Next round: revised futex(2) man page for review

2015-08-24 Thread Darren Hart
On Thu, Aug 20, 2015 at 12:40:46AM +0200, Thomas Gleixner wrote:
> On Wed, 5 Aug 2015, Darren Hart wrote:
> > On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
> > > .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
> > > .\"   The following text is drawn from the Hart/Guniguntala paper
> > > .\"   (listed in SEE ALSO), but I have reworded some pieces
> > > .\"   significantly. Please check it.
> > > 
> > >The PI futex operations described below  differ  from  the  other
> > >futex  operations  in  that  they impose policy on the use of the
> > >value of the futex word:
> > > 
> > >*  If the lock is not acquired, the futex word's value  shall  be
> > >   0.
> > > 
> > >*  If  the  lock is acquired, the futex word's value shall be the
> > >   thread ID (TID; see gettid(2)) of the owning thread.
> > > 
> > >*  If the lock is owned and there are threads contending for  the
> > >   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
> > >   word's value; in other words, this value is:
> > > 
> > >   FUTEX_WAITERS | TID
> > > 
> > > 
> > >Note that a PI futex word never just has the value FUTEX_WAITERS,
> > >which is a permissible state for non-PI futexes.
> > 
> > The second clause is inappropriate. I don't know if that was yours or
> > mine, but non-PI futexes do not have a kernel defined value policy, so
> > ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> > permissible for non-PI futexes, and none have a kernel defined state.
> 
> Depends. If the regular futex is configured as robust, then we have a
> kernel defined value policy as well.

Indeed, thanks for catching that.

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-19 Thread Thomas Gleixner
On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote:
> >>FUTEX_CMP_REQUEUE (since Linux 2.6.7)
> >>   This  operation  first  checks  whether the location uaddr
> >>   still contains the value  val3.   If  not,  the  operation
> >>   fails  with  the  error  EAGAIN.  Otherwise, the operation
> >>   wakes up a maximum of val waiters that are waiting on  the
> >>   futex  at uaddr.  If there are more than val waiters, then
> >>   the remaining waiters are removed from the wait  queue  of
> >>   the  source  futex at uaddr and added to the wait queue of
> >>   the target futex at uaddr2.  The val2  argument  specifies
> >>   an  upper limit on the number of waiters that are requeued
> >>   to the futex at uaddr2.
> >>
> >> .\" FIXME(Torvald) Is the following correct?  Or is just the decision
> >> .\" which threads to wake or requeue part of the atomic operation?
> >>
> >>   The load from uaddr is  an  atomic  memory  access  (i.e.,
> >>   using atomic machine instructions of the respective archi‐
> >>   tecture).  This load, the comparison with  val3,  and  the
> >>   requeueing  of  any  waiters  are performed atomically and
> >>   totally ordered with respect to other  operations  on  the
> >>   same futex word.
> > 
> > It's atomic as the other atomic operations on the futex word. It's
> > always performed with the proper lock(s) held in the kernel. That
> > means any concurrent operation will serialize on that lock(s). User
> > space has to make sure, that depending on the observed value no
> > concurrent operations happen, but that's something the kernel cannot
> > control.
> 
> ???
> Sorry, I'm not clear here. Is the current text correct then? Or is some
> change needed.

I think we need some change here because the meaning of atomic is
unclear. The basic rules of futexes are:

 - All modifying operations on the futex value have to be done with
   atomic instructions, usually cmpxchg. That applies to both kernel
   and user space.

   That's the atomicity at the futex value level.

 - In the kernel we have to create/modify/destroy state in order to
   provide the blocking/requeueing etc.

   This state needs protection as well. So all operations related to
   the kernel internal state are serialized on the hash bucket
   locks. The hash buckets are a scalability mechanism to avoid
   contention on a single lock protecting all kernel internal
   state. For simplicity reasons you can just think of a global lock
   protecting all kernel internal state.

   If the kernel creates/modifies state then it can be necessary to
   either reread the futex value or modify it. That happens under the
   locks as well.

   So in the case of requeue, we take the proper locks and perform the
   comparison with val3 and the requeueing with the locks held.
   
   So that lock protection makes these operations 'atomic'. The
   correct expression is 'serialized'.
 
> >> .\" FIXME We need some explanation in the following paragraph of *why*
> >> .\"   it is important to note that "the kernel will update the
> >> .\"   futex word's value prior
> >>It is important to note to returning to user space" . Can someone
> >>explain?   that  the  kernel  will  update the futex word's value
> >>prior to returning to user space.  Unlike the other futex  opera‐
> >>tions  described  above, the PI futex operations are designed for
> >>the implementation of very specific IPC mechanisms.
> > 
> > If there are multiple waiters on a pi futex then a wake pi operation
> > will wake the first waiter and hand over the lock to this waiter. This
> > includes handing over the rtmutex which represents the futex in the
> > kernel. The strict requirement is that the futex owner and the rtmutex
> > owner must be the same, except for the update period which is
> > serialized by the futex internal locking. That means the kernel must
> > update the user space value prior to returning to user space.

And as noted above: It must update while holding the proper locks.

> >> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
> >> .\"   made the observation that "EINVAL is returned if the non-pi 
> >> .\"   to pi or op pairing semantics are violated."
> >> .\"   Probably there needs to be a general statement about this
> >> .\"   requirement, probably located at about this point in the page.
> >> .\"   Darren (or someone else), care to take a shot at this?
> > 
> > Well, that's hard to describe because the kernel only has a limited
> > way of detecting such mismatches. It only can detect it when there are
> > non PI waiters on a futex and a PI function is called or vice versa.
> 
> Hmmm. Okay, I filed your comments away for reference, but
> hopefully someone can help with som

Re: Next round: revised futex(2) man page for review

2015-08-19 Thread Thomas Gleixner
On Wed, 5 Aug 2015, Darren Hart wrote:
> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
> > .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
> > .\"   The following text is drawn from the Hart/Guniguntala paper
> > .\"   (listed in SEE ALSO), but I have reworded some pieces
> > .\"   significantly. Please check it.
> > 
> >The PI futex operations described below  differ  from  the  other
> >futex  operations  in  that  they impose policy on the use of the
> >value of the futex word:
> > 
> >*  If the lock is not acquired, the futex word's value  shall  be
> >   0.
> > 
> >*  If  the  lock is acquired, the futex word's value shall be the
> >   thread ID (TID; see gettid(2)) of the owning thread.
> > 
> >*  If the lock is owned and there are threads contending for  the
> >   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
> >   word's value; in other words, this value is:
> > 
> >   FUTEX_WAITERS | TID
> > 
> > 
> >Note that a PI futex word never just has the value FUTEX_WAITERS,
> >which is a permissible state for non-PI futexes.
> 
> The second clause is inappropriate. I don't know if that was yours or
> mine, but non-PI futexes do not have a kernel defined value policy, so
> ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> permissible for non-PI futexes, and none have a kernel defined state.

Depends. If the regular futex is configured as robust, then we have a
kernel defined value policy as well.

> > .\" FIXME I'm not quite clear on the meaning of the following sentence.
> > .\"   Is this trying to say that while blocked in a
> > .\"   FUTEX_WAIT_REQUEUE_PI, it could happen that another
> > .\"   task does a FUTEX_WAKE on uaddr that simply causes
> > .\"   a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
> > .\"   does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI
> > .\"   opertion? Does it remain blocked, or does it unblock
> > .\"   In which case, what does user space see?
> > 
> >   The
> >   waiter   can  be  removed  from  the  wait  on  uaddr  via
> >   FUTEX_WAKE without requeueing on uaddr2.
> 
> Userspace should see the task wake and continue executing. This would
> effectively be a cancelation operation - which I didn't think was
> supported. Thomas?

We probably never intended to support it, but looking at the code it
works (did not try it though). It returns to user space with
-EWOULDBLOCK. So it basically behaves like any other spurious wakeup.
 
Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-07 Thread Michael Kerrisk (man-pages)
Hi Darren,

Some of my comments below will refer to the reply I just sent
to tglx (and the list) a few minutes ago.

On 08/06/2015 12:21 AM, Darren Hart wrote:
> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello all,
>>
> 
> Michael, thank you for your diligence in following up and collecting
> reviews. I've attempted to respond to what I was specifically called out
> in or where I had something specific to add in addition to other
> replies.

Thanks!

> After this, will you send another version (numbered for reference
> maybe?) with any remaining FIXMEs that haven't yet been addressed
> according to your accounting?

Yes, I'll be sending out another draft (probably after a short delay,
while I see what further responses come back on the mails I just sent.)
In any case, the latest version of the page can be found at
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex

>>Priority-inheritance futexes
>>Linux supports priority-inheritance (PI) futexes in order to han‐
>>dle priority-inversion problems that can be encountered with nor‐
>>mal  futex  locks.  Priority inversion is the problem that occurs
>>when a high-priority task is blocked waiting to  acquire  a  lock
>>held  by a low-priority task, while tasks at an intermediate pri‐
>>ority continuously preempt the low-priority task  from  the  CPU.
>>Consequently,  the  low-priority  task  makes  no progress toward
>>releasing the lock, and the high-priority task remains blocked.
>>
>>Priority inheritance is a mechanism for dealing with  the  prior‐
>>ity-inversion problem.  With this mechanism, when a high-priority
>>task becomes blocked by a lock held by a low-priority  task,  the
>>latter's priority is temporarily raised to that of the former, so
>>that it is not preempted by any intermediate level tasks, and can
>>thus  make  progress toward releasing the lock.  To be effective,
>>priority inheritance must be transitive, meaning that if a  high-
>>priority task blocks on a lock held by a lower-priority task that
>>is itself blocked by lock held by  another  intermediate-priority
>>task  (and  so  on, for chains of arbitrary length), then both of
>>those task (or more generally, all of the tasks in a lock  chain)
>>have  their priorities raised to be the same as the high-priority
>>task.
>>
>> .\" FIXME XXX The following is my attempt at a definition of PI futexes,
>> .\"   based on mail discussions with Darren Hart. Does it seem okay?
>>
>>From a user-space perspective, what makes a futex PI-aware  is  a
>>policy  agreement  between  user  space  and the kernel about the
>>value of the futex word (described in a moment), coupled with the
>>use  of  the  PI futex operations described below (in particular,
>>FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, and FUTEX_CMP_REQUEUE_PI).
> 
> Yes. Was this intended to be a complete opcode list? 

No. I'll remove that list, in case its misunderstood that way.

> PI operations must
> use paired operations.
> 
> (FUTEX_LOCK_PI | FUTEX_TRYLOCK_PI) : FUTEX_UNLOCK_PI
> FUTEX_WAIT_REQUEUE_PI : FUTEX_CMP_REQUEUE_PI

And now I've made that point explicitly in the page. See my comment 
lower down.

> And their PRIVATE counterparts of course (which is assumed as it is a
> flag to the opcode).

Yes. But I don't think that needs to be called out explicitly here (?).

>> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
>> .\"   The following text is drawn from the Hart/Guniguntala paper
>> .\"   (listed in SEE ALSO), but I have reworded some pieces
>> .\"   significantly. Please check it.
>>
>>The PI futex operations described below  differ  from  the  other
>>futex  operations  in  that  they impose policy on the use of the
>>value of the futex word:
>>
>>*  If the lock is not acquired, the futex word's value  shall  be
>>   0.
>>
>>*  If  the  lock is acquired, the futex word's value shall be the
>>   thread ID (TID; see gettid(2)) of the owning thread.
>>
>>*  If the lock is owned and there are threads contending for  the
>>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
>>   word's value; in other words, this value is:
>>
>>   FUTEX_WAITERS | TID
>>
>>
>>Note that a PI futex word never just has the value FUTEX_WAITERS,
>>which is a permissible state for non-PI futexes.
> 
> The second clause is inappropriate. I don't know if that was yours or
> mine, but non-PI futexes do not have a kernel defined value policy, so
> ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> permissible for non-PI futexes, and none have a kernel defined state.
> 
> Perhaps include a Note under the third bullet as:
> 
>  

Re: Next round: revised futex(2) man page for review

2015-08-07 Thread Michael Kerrisk (man-pages)
On 07/28/2015 11:03 PM, Thomas Gleixner wrote:
> On Tue, 28 Jul 2015, Peter Zijlstra wrote:
> 
>> On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
>>
FUTEX_WAKE (since Linux 2.6.0)
   This  operation  wakes at most val of the waiters that are
   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
   address  uaddr.  Most commonly, val is specified as either
   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
   ers).   No  guarantee  is provided about which waiters are
   awoken (e.g., a waiter with a higher  scheduling  priority
   is  not  guaranteed to be awoken in preference to a waiter
   with a lower priority).
>>>
>>> That's only correct up to Linux 2.6.21.
>>>
>>> Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
>>> threads this takes the nice level into account. Threads with the same
>>> priority are woken in FIFO order.
>>
>> Maybe don't mention the effects of SCHED_OTHER, order by nice value is
>> 'wrong'.
> 
> Indeed.
>  
>> Also, this code seems to use plist, which means it won't do the right
>> thing for SCHED_DEADLINE either.
>>
>> Do we want to go fix that?
> 
> I think so.

So, no change to this piece of text then?

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-08-07 Thread Michael Kerrisk (man-pages)
Hi Thomas,

Thank you for the comments below. This helps hugely:
more than 30 of my FIXMEs have now gone away!

I have a few open questions, which you can find
by searching for the string "???". If you would have
a chance to look at those, I'd appreciate it.

On 07/28/2015 10:23 PM, Thomas Gleixner wrote:
> On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:
>>FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
>>   This   option   bit   can   be   employed  only  with  the
>>   FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI operations.
>>
>>   If this option is set, the kernel  treats  timeout  as  an
>>   absolute time based on CLOCK_REALTIME.
>>
>> .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay?
>>   If  this  option  is not set, the kernel treats timeout as
>>   relative time, measured against the CLOCK_MONOTONIC clock.
> 
> That's correct.

Thanks.

>>The operation specified in futex_op is one of the following:
>>
>>FUTEX_WAIT (since Linux 2.6.0)
>>   This operation tests that the  value  at  the  futex  word
>>   pointed  to  by  the  address  uaddr  still  contains  the
>>   expected value  val,  and  if  so,  then  sleeps  awaiting
>>   FUTEX_WAKE  on  the  futex word.  The load of the value of
>>   the futex word is an atomic  memory  access  (i.e.,  using
>>   atomic  machine  instructions  of the respective architec‐
>>   ture).  This load, the comparison with the expected value,
>>   and starting to sleep are performed atomically and totally
>>   ordered with respect to other futex operations on the same
>>   futex  word.  If the thread starts to sleep, it is consid‐
>>   ered a waiter on this futex word.  If the futex value does
>>   not  match  val,  then the call fails immediately with the
>>   error EAGAIN.
>>
>>   The purpose of the comparison with the expected  value  is
>>   to  prevent  lost  wake-ups: If another thread changed the
>>   value of the futex word after the calling  thread  decided
>>   to block based on the prior value, and if the other thread
>>   executed a FUTEX_WAKE operation (or similar wake-up) after
>>   the  value  change  and  before this FUTEX_WAIT operation,
>>   then the latter will observe the value change and will not
>>   start to sleep.
>>
>>   If  the timeout argument is non-NULL, its contents specify
>>   a relative timeout for the wait, measured according to the
>> .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay?
> 
> Yes.

Thanks.

> 
>>   CLOCK_MONOTONIC  clock.  (This interval will be rounded up
>>   to the system clock  granularity,  and  kernel  scheduling
>>   delays  mean  that  the blocking interval may overrun by a
>>   small amount.)
> 
>   The given wait time will be rounded up to the system
>   clock granularity and is guaranteed not to expire
>   early.
> 
> There are a gazillion reasons why it can expire late, but the
> guarantee is that it never expires prematurely.
> 
>>If timeout is NULL, the call blocks indef‐
>>   initely.
> 
> Right.

Thanks. Reworded as you suggest. 

>>   The arguments uaddr2 and val3 are ignored.
>>
>>
>>FUTEX_WAKE (since Linux 2.6.0)
>>   This  operation  wakes at most val of the waiters that are
>>   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
>>   address  uaddr.  Most commonly, val is specified as either
>>   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
>>   ers).   No  guarantee  is provided about which waiters are
>>   awoken (e.g., a waiter with a higher  scheduling  priority
>>   is  not  guaranteed to be awoken in preference to a waiter
>>   with a lower priority).
> 
> That's only correct up to Linux 2.6.21.
> 
> Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
> threads this takes the nice level into account. Threads with the same
> priority are woken in FIFO order.

So, this got picked up in a little subthread by Peter Zijsltra. I'll
reply there.

>>   The arguments timeout, uaddr2, and val3 are ignored.
>  
>>
>>FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25)
>>   This operation creates a file descriptor that  is  associ‐
>>   ated  with  the futex at uaddr.  The caller must close the
>>   returned file descriptor after use.  When another  process
>>   or  thread  performs  a  FUTEX_WAKE on the futex word, the
>>   file  descriptor  indicates   as   being   readable   with
>>

Re: Next round: revised futex(2) man page for review

2015-08-05 Thread Darren Hart
On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
> Hello all,
> 

Michael, thank you for your diligence in following up and collecting
reviews. I've attempted to respond to what I was specifically called out
in or where I had something specific to add in addition to other
replies.

After this, will you send another version (numbered for reference
maybe?) with any remaining FIXMEs that haven't yet been addressed
according to your accounting?

...

>Priority-inheritance futexes
>Linux supports priority-inheritance (PI) futexes in order to han‐
>dle priority-inversion problems that can be encountered with nor‐
>mal  futex  locks.  Priority inversion is the problem that occurs
>when a high-priority task is blocked waiting to  acquire  a  lock
>held  by a low-priority task, while tasks at an intermediate pri‐
>ority continuously preempt the low-priority task  from  the  CPU.
>Consequently,  the  low-priority  task  makes  no progress toward
>releasing the lock, and the high-priority task remains blocked.
> 
>Priority inheritance is a mechanism for dealing with  the  prior‐
>ity-inversion problem.  With this mechanism, when a high-priority
>task becomes blocked by a lock held by a low-priority  task,  the
>latter's priority is temporarily raised to that of the former, so
>that it is not preempted by any intermediate level tasks, and can
>thus  make  progress toward releasing the lock.  To be effective,
>priority inheritance must be transitive, meaning that if a  high-
>priority task blocks on a lock held by a lower-priority task that
>is itself blocked by lock held by  another  intermediate-priority
>task  (and  so  on, for chains of arbitrary length), then both of
>those task (or more generally, all of the tasks in a lock  chain)
>have  their priorities raised to be the same as the high-priority
>task.
> 
> .\" FIXME XXX The following is my attempt at a definition of PI futexes,
> .\"   based on mail discussions with Darren Hart. Does it seem okay?
> 
>From a user-space perspective, what makes a futex PI-aware  is  a
>policy  agreement  between  user  space  and the kernel about the
>value of the futex word (described in a moment), coupled with the
>use  of  the  PI futex operations described below (in particular,
>FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, and FUTEX_CMP_REQUEUE_PI).

Yes. Was this intended to be a complete opcode list? PI operations must
use paired operations.

(FUTEX_LOCK_PI | FUTEX_TRYLOCK_PI) : FUTEX_UNLOCK_PI
FUTEX_WAIT_REQUEUE_PI : FUTEX_CMP_REQUEUE_PI

And their PRIVATE counterparts of course (which is assumed as it is a
flag to the opcode).

> 
> .\" FIXME XXX = Start of adapted Hart/Guniguntala text =
> .\"   The following text is drawn from the Hart/Guniguntala paper
> .\"   (listed in SEE ALSO), but I have reworded some pieces
> .\"   significantly. Please check it.
> 
>The PI futex operations described below  differ  from  the  other
>futex  operations  in  that  they impose policy on the use of the
>value of the futex word:
> 
>*  If the lock is not acquired, the futex word's value  shall  be
>   0.
> 
>*  If  the  lock is acquired, the futex word's value shall be the
>   thread ID (TID; see gettid(2)) of the owning thread.
> 
>*  If the lock is owned and there are threads contending for  the
>   lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
>   word's value; in other words, this value is:
> 
>   FUTEX_WAITERS | TID
> 
> 
>Note that a PI futex word never just has the value FUTEX_WAITERS,
>which is a permissible state for non-PI futexes.

The second clause is inappropriate. I don't know if that was yours or
mine, but non-PI futexes do not have a kernel defined value policy, so
==FUTEX_WAITERS cannot be a "permissible state" as any value is
permissible for non-PI futexes, and none have a kernel defined state.

Perhaps include a Note under the third bullet as:

  Note: It is invalid for a PI futex word to have no owner and
FUTEX_WAITERS set.

> 
>With this policy in place, a user-space application can acquire a
>not-acquired lock or release a lock that no other threads try  to

"that no other threads try to acquire" seems out of place. I think
"atomic instructions" is sufficient to express how contention is
handled.

>acquire using atomic instructions executed in user space (e.g., a
>compare-and-swap operation such as cmpxchg on the  x86  architec‐
>ture).   Acquiring  a  lock simply consists of using compare-and-
>swap to atomically set the futex word's value to the caller's TID
>if  its  previous  value  was 0.  Releasing a lock req

Re: Next round: revised futex(2) man page for review

2015-07-30 Thread Michael Kerrisk (man-pages)
On 07/29/2015 06:21 AM, Darren Hart wrote:
> On Tue, Jul 28, 2015 at 09:11:41PM -0700, Darren Hart wrote:
>> On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
>>> On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:
>>
>> ...
>>
FUTEX_REQUEUE (since Linux 2.6.0)
 .\" FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken
 .\" in general, or is this comment implicitly speaking about the
 .\" condvar (?) use case? If the latter we might want to weaken the
 .\" advice below a little.
 .\" [Anyone else have input on this?]
>>>
>>> The condvar use case exposes the flaw nicely, but that's pretty much
>>> true for everything which wants a sane requeue operation.
>>
>> In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken 
>> and
>> should not be used in new code) and someone argued strongly against... 
>> stating
>> that there were legitimate uses for it. Of course I'm struggling to find the
>> thread and the reference at the moment - immensely useful, I know.
>>
>> I'll continue trying to find it and see if it can be useful here. I believe
>> Torvald was on the thread as well.
>>
> 
> Found it on libc-alpha, here it is for reference:
> 
>   From: Rich Felker 
>   Date: Wed, 29 Oct 2014 22:43:17 -0400
>   To: Darren Hart 
>   Cc: Carlos O'Donell , Roland McGrath 
> ,
>   Torvald Riegel , GLIBC Devel 
> ,
>   Michael Kerrisk 
>   Subject: Re: Add futex wrapper to glibc?
> 
>   On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote:
>   > > We are IMO at the stage where futex is stable, few things are
>   > > changing, and with documentation in place, I would consider adding a
>   > > futex wrapper.
>   > 
>   > Yes, at least for the defined OP codes. New OPs may be added of
>   > course, but that isn't a concern for supporting what exists today, and
>   > doesn't break compatibility.
>   > 
>   > I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally
>   > broken.  FUTEX_CMP_REQUEUE should *always* be used instead. The glibc
>   > wrapper is one way to encourage developers to do the right thing
>   > (don't expose the bad op in the header).
> 
>   You're mistaken here. There are plenty of valid ways to use
>   FUTEX_REQUEUE - for example if the calling thread is requeuing the
>   target(s) to a lock that the calling thread owns. Just because it
>   doesn't meet the needs of the way glibc was using it internally
>   doesn't mean it's useless for other applications.
> 
>   In any case, I don't think there's a proposal to intercept/modify the
>   commands to futex, just to pass them through (and possibly do a
>   cancellable syscall for some of them).
> 
>   Rich
> 
> 
>>>
   Avoid using this operation.  It is broken for its intended
   purpose.  Use FUTEX_CMP_REQUEUE instead.

   Thisoperationperformsthesametaskas
   FUTEX_CMP_REQUEUE, except that no check is made using  the
   value in val3.  (The argument val3 is ignored.)

Thanks, Darren, that's really helpful! I've removed the statement in the man
page that FUTEX_REQUEUE is broken.

By the way, Darren. There were a couple of FIXMEs in the page where you are
explicitly mentioned by name. Could you take a look at those? Specifically,
the large block of text starting at:

>> .\" FIXME XXX The following is my attempt at a definition of PI futexes,
>> .\"   based on mail discussions with Darren Hart. Does it seem okay?

   (tglx looked at this and blessed it, but I'd like you also to check.)

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-29 Thread Thomas Gleixner
On Tue, 28 Jul 2015, Darren Hart wrote:
> Found it on libc-alpha, here it is for reference:
> 
>   From: Rich Felker 
>   Date: Wed, 29 Oct 2014 22:43:17 -0400
>   To: Darren Hart 
>   Cc: Carlos O'Donell , Roland McGrath 
> ,
>   Torvald Riegel , GLIBC Devel 
> ,
>   Michael Kerrisk 
>   Subject: Re: Add futex wrapper to glibc?
> 
>   On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote:
>   > > We are IMO at the stage where futex is stable, few things are
>   > > changing, and with documentation in place, I would consider adding a
>   > > futex wrapper.
>   > 
>   > Yes, at least for the defined OP codes. New OPs may be added of
>   > course, but that isn't a concern for supporting what exists today, and
>   > doesn't break compatibility.
>   > 
>   > I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally
>   > broken.  FUTEX_CMP_REQUEUE should *always* be used instead. The glibc
>   > wrapper is one way to encourage developers to do the right thing
>   > (don't expose the bad op in the header).
> 
>   You're mistaken here. There are plenty of valid ways to use
>   FUTEX_REQUEUE - for example if the calling thread is requeuing the
>   target(s) to a lock that the calling thread owns. Just because it
>   doesn't meet the needs of the way glibc was using it internally
>   doesn't mean it's useless for other applications.
> 
>   In any case, I don't think there's a proposal to intercept/modify the
>   commands to futex, just to pass them through (and possibly do a
>   cancellable syscall for some of them).

Fair enough. Did not think about the requeue to futex held by the
caller case. In that case FUTEX_REQUEUE works as advertised.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Darren Hart
On Tue, Jul 28, 2015 at 09:11:41PM -0700, Darren Hart wrote:
> On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
> > On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:
> 
> ...
> 
> > >FUTEX_REQUEUE (since Linux 2.6.0)
> > > .\" FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken
> > > .\" in general, or is this comment implicitly speaking about the
> > > .\" condvar (?) use case? If the latter we might want to weaken the
> > > .\" advice below a little.
> > > .\" [Anyone else have input on this?]
> > 
> > The condvar use case exposes the flaw nicely, but that's pretty much
> > true for everything which wants a sane requeue operation.
> 
> In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken 
> and
> should not be used in new code) and someone argued strongly against... stating
> that there were legitimate uses for it. Of course I'm struggling to find the
> thread and the reference at the moment - immensely useful, I know.
> 
> I'll continue trying to find it and see if it can be useful here. I believe
> Torvald was on the thread as well.
> 

Found it on libc-alpha, here it is for reference:

From: Rich Felker 
Date: Wed, 29 Oct 2014 22:43:17 -0400
To: Darren Hart 
Cc: Carlos O'Donell , Roland McGrath 
,
Torvald Riegel , GLIBC Devel 
,
Michael Kerrisk 
Subject: Re: Add futex wrapper to glibc?

On Wed, Oct 29, 2014 at 06:59:15PM -0700, Darren Hart wrote:
> > We are IMO at the stage where futex is stable, few things are
> > changing, and with documentation in place, I would consider adding a
> > futex wrapper.
> 
> Yes, at least for the defined OP codes. New OPs may be added of
> course, but that isn't a concern for supporting what exists today, and
> doesn't break compatibility.
> 
> I wonder though... can we not wrap FUTEX_REQUEUE? It's fundamentally
> broken.  FUTEX_CMP_REQUEUE should *always* be used instead. The glibc
> wrapper is one way to encourage developers to do the right thing
> (don't expose the bad op in the header).

You're mistaken here. There are plenty of valid ways to use
FUTEX_REQUEUE - for example if the calling thread is requeuing the
target(s) to a lock that the calling thread owns. Just because it
doesn't meet the needs of the way glibc was using it internally
doesn't mean it's useless for other applications.

In any case, I don't think there's a proposal to intercept/modify the
commands to futex, just to pass them through (and possibly do a
cancellable syscall for some of them).

Rich


> > 
> > >   Avoid using this operation.  It is broken for its intended
> > >   purpose.  Use FUTEX_CMP_REQUEUE instead.
> > > 
> > >   Thisoperationperformsthesametaskas
> > >   FUTEX_CMP_REQUEUE, except that no check is made using  the
> > >   value in val3.  (The argument val3 is ignored.)
> > > 
> 
> -- 
> Darren Hart
> Intel Open Source Technology Center

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Darren Hart
On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
> On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:

...

> >FUTEX_REQUEUE (since Linux 2.6.0)
> > .\" FIXME(Torvald) Is there some indication that FUTEX_REQUEUE is broken
> > .\" in general, or is this comment implicitly speaking about the
> > .\" condvar (?) use case? If the latter we might want to weaken the
> > .\" advice below a little.
> > .\" [Anyone else have input on this?]
> 
> The condvar use case exposes the flaw nicely, but that's pretty much
> true for everything which wants a sane requeue operation.

In an earlier discussion I argued this point (that FUTURE_REQUEUE is broken and
should not be used in new code) and someone argued strongly against... stating
that there were legitimate uses for it. Of course I'm struggling to find the
thread and the reference at the moment - immensely useful, I know.

I'll continue trying to find it and see if it can be useful here. I believe
Torvald was on the thread as well.

> 
> >   Avoid using this operation.  It is broken for its intended
> >   purpose.  Use FUTEX_CMP_REQUEUE instead.
> > 
> >   Thisoperationperformsthesametaskas
> >   FUTEX_CMP_REQUEUE, except that no check is made using  the
> >   value in val3.  (The argument val3 is ignored.)
> > 

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Davidlohr Bueso
On Tue, 2015-07-28 at 22:45 +0200, Peter Zijlstra wrote:
> Also, this code seems to use plist, which means it won't do the right
> thing for SCHED_DEADLINE either.

Ick, I don't look forward to seeing nice futex plists converted into
rbtrees. As opposed to, eg. rtmutexes, there are a few caveats:

- Dealing with the top_waiter in rtmutexes is always easy, but in
futexes we need to deal with keys, so caching the leftmost won't work as
nicely.

- This will bloat things like futex_wake, where O(logN) is not suited
for FIFO iteration. And iterating linked lists is, in essence, all that
we really do when calling futex(2).

I have to wonder about the extra overhead added by these points.  I do
understand the dl concern, nonetheless.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Thomas Gleixner
On Tue, 28 Jul 2015, Peter Zijlstra wrote:

> On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:
> 
> > >FUTEX_WAKE (since Linux 2.6.0)
> > >   This  operation  wakes at most val of the waiters that are
> > >   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
> > >   address  uaddr.  Most commonly, val is specified as either
> > >   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
> > >   ers).   No  guarantee  is provided about which waiters are
> > >   awoken (e.g., a waiter with a higher  scheduling  priority
> > >   is  not  guaranteed to be awoken in preference to a waiter
> > >   with a lower priority).
> > 
> > That's only correct up to Linux 2.6.21.
> > 
> > Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
> > threads this takes the nice level into account. Threads with the same
> > priority are woken in FIFO order.
> 
> Maybe don't mention the effects of SCHED_OTHER, order by nice value is
> 'wrong'.

Indeed.
 
> Also, this code seems to use plist, which means it won't do the right
> thing for SCHED_DEADLINE either.
> 
> Do we want to go fix that?

I think so.

Thanks,

tglx


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Peter Zijlstra
On Tue, Jul 28, 2015 at 10:23:51PM +0200, Thomas Gleixner wrote:

> >FUTEX_WAKE (since Linux 2.6.0)
> >   This  operation  wakes at most val of the waiters that are
> >   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
> >   address  uaddr.  Most commonly, val is specified as either
> >   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
> >   ers).   No  guarantee  is provided about which waiters are
> >   awoken (e.g., a waiter with a higher  scheduling  priority
> >   is  not  guaranteed to be awoken in preference to a waiter
> >   with a lower priority).
> 
> That's only correct up to Linux 2.6.21.
> 
> Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
> threads this takes the nice level into account. Threads with the same
> priority are woken in FIFO order.

Maybe don't mention the effects of SCHED_OTHER, order by nice value is
'wrong'.

Also, this code seems to use plist, which means it won't do the right
thing for SCHED_DEADLINE either.

Do we want to go fix that?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-28 Thread Thomas Gleixner
On Mon, 27 Jul 2015, Michael Kerrisk (man-pages) wrote:
>FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
>   This   option   bit   can   be   employed  only  with  the
>   FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI operations.
> 
>   If this option is set, the kernel  treats  timeout  as  an
>   absolute time based on CLOCK_REALTIME.
> 
> .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay?
>   If  this  option  is not set, the kernel treats timeout as
>   relative time, measured against the CLOCK_MONOTONIC clock.

That's correct.

>The operation specified in futex_op is one of the following:
> 
>FUTEX_WAIT (since Linux 2.6.0)
>   This operation tests that the  value  at  the  futex  word
>   pointed  to  by  the  address  uaddr  still  contains  the
>   expected value  val,  and  if  so,  then  sleeps  awaiting
>   FUTEX_WAKE  on  the  futex word.  The load of the value of
>   the futex word is an atomic  memory  access  (i.e.,  using
>   atomic  machine  instructions  of the respective architec‐
>   ture).  This load, the comparison with the expected value,
>   and starting to sleep are performed atomically and totally
>   ordered with respect to other futex operations on the same
>   futex  word.  If the thread starts to sleep, it is consid‐
>   ered a waiter on this futex word.  If the futex value does
>   not  match  val,  then the call fails immediately with the
>   error EAGAIN.
> 
>   The purpose of the comparison with the expected  value  is
>   to  prevent  lost  wake-ups: If another thread changed the
>   value of the futex word after the calling  thread  decided
>   to block based on the prior value, and if the other thread
>   executed a FUTEX_WAKE operation (or similar wake-up) after
>   the  value  change  and  before this FUTEX_WAIT operation,
>   then the latter will observe the value change and will not
>   start to sleep.
> 
>   If  the timeout argument is non-NULL, its contents specify
>   a relative timeout for the wait, measured according to the
> .\" FIXME XXX I added CLOCK_MONOTONIC below. Okay?

Yes.

>   CLOCK_MONOTONIC  clock.  (This interval will be rounded up
>   to the system clock  granularity,  and  kernel  scheduling
>   delays  mean  that  the blocking interval may overrun by a
>   small amount.)

The given wait time will be rounded up to the system
clock granularity and is guaranteed not to expire
early.

There are a gazillion reasons why it can expire late, but the
guarantee is that it never expires prematurely.

> If timeout is NULL, the call blocks indef‐
>   initely.

Right.
 
>   The arguments uaddr2 and val3 are ignored.
> 
> 
>FUTEX_WAKE (since Linux 2.6.0)
>   This  operation  wakes at most val of the waiters that are
>   waiting (e.g., inside FUTEX_WAIT) on the futex word at the
>   address  uaddr.  Most commonly, val is specified as either
>   1 (wake up a single waiter) or INT_MAX (wake up all  wait‐
>   ers).   No  guarantee  is provided about which waiters are
>   awoken (e.g., a waiter with a higher  scheduling  priority
>   is  not  guaranteed to be awoken in preference to a waiter
>   with a lower priority).

That's only correct up to Linux 2.6.21.

Since 2.6.22 we have a priority ordered wakeup. For SCHED_OTHER
threads this takes the nice level into account. Threads with the same
priority are woken in FIFO order.
 
>   The arguments timeout, uaddr2, and val3 are ignored.
 
> 
>FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25)
>   This operation creates a file descriptor that  is  associ‐
>   ated  with  the futex at uaddr.  The caller must close the
>   returned file descriptor after use.  When another  process
>   or  thread  performs  a  FUTEX_WAKE on the futex word, the
>   file  descriptor  indicates   as   being   readable   with
>   select(2), poll(2), and epoll(7)
> 
>   The  file  descriptor  can  be used to obtain asynchronous
>   notifications:  if  val  is  nonzero,  then  when  another
>   process  or  thread executes a FUTEX_WAKE, the caller will
>   receive the signal number that was passed in val.
> 
>   The arguments timeout, uaddr2 and val3 are ignored.
> 
> .\" FIXME(Torvald) We never define "upped".  Maybe just remove the
> .\"  following sentence?
>   To prevent race

Re: Revised futex(2) man page for review

2015-07-28 Thread Michael Kerrisk (man-pages)
On 07/28/2015 07:52 PM, Davidlohr Bueso wrote:
> On Tue, 2015-07-28 at 09:44 +0200, Michael Kerrisk (man-pages) wrote:
>> Maybe you still have some further improvements for the paragraph?
> 
> Nah, this is fine enough. Looks good.

Okay. Thanks. I added a Reviewed-by: for you.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-28 Thread Davidlohr Bueso
On Tue, 2015-07-28 at 09:44 +0200, Michael Kerrisk (man-pages) wrote:
> Maybe you still have some further improvements for the paragraph?

Nah, this is fine enough. Looks good.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-28 Thread Michael Kerrisk (man-pages)
Hi David,

On 07/28/2015 05:16 AM, Davidlohr Bueso wrote:
> On Mon, 2015-07-27 at 13:10 +0200, Michael Kerrisk (man-pages) wrote:
>> Hi David,
>>
>> On 03/31/2015 04:45 PM, Davidlohr Bueso wrote:
>>> On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
>>>
The condition is represented by the futex word, which is an address 
  in
memory  supplied to the futex() system call, and the value at this 
 mem‐
ory location.  (While the virtual addresses for the same memory in 
 sep‐
arate  processes  may  not be equal, the kernel maps them 
 internally so
that the same memory mapped in different locations will correspond  
 for
futex() calls.)

When  executing  a futex operation that requests to block a thread, 
 the
kernel will only block if the futex word has the value that the 
 calling
>>>
>>> Given the use of "word", you should probably state right away that
>>> futexes are only 32bit.
>>
>> So, I made the opening sentence here:
>>
>>The  condition  is  represented  by  the  futex word, which is an
>>address in memory supplied to the futex() system  call,  and  the
>>32-bit  value  at  this  memory  location. 
>>
>> Okay?
> 
> I think we can still improve :)
> 
> I've re-read the whole first paragraphs, and have a few comments that
> touch upon this specific wording. Lets see. You have:
> 
>>The  futex()  system call provides a method for waiting until a 
>> certain
>>condition becomes true.  It is typically used as a  blocking  
>> construct
>>in the context of shared-memory synchronization: The program 
>> implements
>>the majority of the synchronization in user  space,  and  uses  one  
>> of
>>operations  of  the  system call when it is likely that it has to 
>> block
>>for a longer time until the condition becomes true.  The  program  
>> uses
>>another  operation of the system call to wake anyone waiting for a 
>> par‐
>>ticular condition.
> 
> I've rephrased the next paragraph. How about adding this to follow?
> 
>A futex is in essence a 32-bit user-space address. All futex 
> operations and
>conditions are governed by this variable, from now on referred to as 
> 'futex
>word'. As such, a futex is identified by the address in shared memory, 
> which
>may or may not be shared between different processes. For virtual 
> memory, the
>kernel will internally handle and resolve the later. This futex word, 
> along
>with the value at its the memory location, are supplied to the futex() 
> system
>call.
> 
> Feel free to reword however you think is better.


I agree with you that that second paragraph is a bit broken. But, like Heinrich,
I'm confused by this term "32-bit ... address".

I've rewritten the paragraph as:

   A futex is a 32-bit value—referred to below as a futex word—whose
   address is supplied to the futex()  system  call.   (Futexes  are
   32-bits in size on all platforms, including 64-bit systems.)  All
   futex operations are governed by this value.  In order to share a
   futex  between  processes,  the  futex  is  placed in a region of
   shared memory, created using (for example) mmap(2)  or  shmat(2).
   (Thus the futex word may have different virtual addresses in dif‐
   ferent processes, but these addresses all refer to the same loca‐
   tion in physical memory.)

Maybe you still have some further improvements for the paragraph?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Aw: Re: Revised futex(2) man page for review

2015-07-27 Thread Davidlohr Bueso
On Tue, 2015-07-28 at 07:44 +0200, Heinrich Schuchardt wrote:
> Hello David,
> 
> >> A futex is in essence a 32-bit user-space address.
> I know what a 32 bit integer is.
> I am not aware of 32 bit addresses on my 64 bit operating system.

Well I am referring to in the context of a user-space address, such as a
32-bit lock ('int'), but yes, my text is misleading. In fact we
obviously need to cast to the word size for doing gup_fast, among other
tasks.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
On 07/28/2015 04:52 AM, Davidlohr Bueso wrote:
> On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
>> SEE ALSO
>>get_robust_list(2), restart_syscall(2), futex(7)
> 
> For pi futexes, I also suggest pthread_mutexattr_getprotocol(3), which
> is a common entry point.

Thanks. Added.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Davidlohr Bueso
On Mon, 2015-07-27 at 13:10 +0200, Michael Kerrisk (man-pages) wrote:
> Hi David,
> 
> On 03/31/2015 04:45 PM, Davidlohr Bueso wrote:
> > On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
> > 
> >>The condition is represented by the futex word, which is an address 
> >>  in
> >>memory  supplied to the futex() system call, and the value at this 
> >> mem‐
> >>ory location.  (While the virtual addresses for the same memory in 
> >> sep‐
> >>arate  processes  may  not be equal, the kernel maps them 
> >> internally so
> >>that the same memory mapped in different locations will correspond  
> >> for
> >>futex() calls.)
> >>
> >>When  executing  a futex operation that requests to block a thread, 
> >> the
> >>kernel will only block if the futex word has the value that the 
> >> calling
> > 
> > Given the use of "word", you should probably state right away that
> > futexes are only 32bit.
> 
> So, I made the opening sentence here:
> 
>The  condition  is  represented  by  the  futex word, which is an
>address in memory supplied to the futex() system  call,  and  the
>32-bit  value  at  this  memory  location. 
> 
> Okay?

I think we can still improve :)

I've re-read the whole first paragraphs, and have a few comments that
touch upon this specific wording. Lets see. You have:

>The  futex()  system call provides a method for waiting until a certain
>condition becomes true.  It is typically used as a  blocking  construct
>in the context of shared-memory synchronization: The program implements
>the majority of the synchronization in user  space,  and  uses  one  of
>operations  of  the  system call when it is likely that it has to block
>for a longer time until the condition becomes true.  The  program  uses
>another  operation of the system call to wake anyone waiting for a par‐
>ticular condition.

I've rephrased the next paragraph. How about adding this to follow?

   A futex is in essence a 32-bit user-space address. All futex operations 
and
   conditions are governed by this variable, from now on referred to as 
'futex
   word'. As such, a futex is identified by the address in shared memory, 
which
   may or may not be shared between different processes. For virtual 
memory, the
   kernel will internally handle and resolve the later. This futex word, 
along
   with the value at its the memory location, are supplied to the futex() 
system
   call.

Feel free to reword however you think is better.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Davidlohr Bueso
On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
> SEE ALSO
>get_robust_list(2), restart_syscall(2), futex(7)

For pi futexes, I also suggest pthread_mutexattr_getprotocol(3), which
is a common entry point.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Next round: revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
On 07/27/2015 04:17 PM, Heinrich Schuchardt wrote:
> instruction. A thread maybe unable
> 
> to << missing word
> 
> acquire a lock because it is
> already acquired by another thread. It then may pass the lock's
> flag as futex word and the value representing the acquired state
> as the expected value to a futex() wait operation.

Thanks, Heinrich. Fixed.

Cheers,

Michael




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Next round: revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hello all,

>From a draft sent out in March, I got a few useful comments that
I've now incorporated into this draft. And I got some complaints
from people who did not want to read groff source. My point
was that there are a bunch of FIXMEs in the page source that I
wanted people to look at... Anyway, this time, I will take
a different tack, interspersing the FIXMEs in a rendered 
version of the page. I'd greatly appreciate help with those FIXMEs.

The current page source can be found at in a branch at
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex

===

As becomes quickly obvious upon reading it, the current futex(2) 
man page is in a sorry state, lacking many important details, and
also the various additions that have been made to the interface
over the last years. I've been working on revising it, first
of all based on input I got in response to a request for help
last year (http://thread.gmane.org/gmane.linux.kernel/1703405), 
especially taking Thomas Gleixner's input 
(http://thread.gmane.org/gmane.linux.kernel/1703405/focus=2952) 
into account. I also got some further offlist input from Darren
 Hart, Torvald Riegel, and Davidlohr Bueso that has been
incorporated into the revised draft. Other than that, I got
some useful info out of Ulrich Drepper's paper (cited at the
end of the page) and one or two web pages (cited in the page
source).

The page has now increased in size by a factor of about 5, but
is far from complete. In particular, as I reworked the page, 
there were many details that I was not 100% certain of, and I
have added FIXME markers to the page source. In addition,
Torvald added some text, and a few more FIXMEs. Some of
the FIXMEs are trivial, as in: I'd like confirmation that
I have correctly captured a technical detail. Others are more 
substantial, probably requiring the addition of further text.

I appreciate that there are probably other things that can be
improved in the page. (Torvald and Darren have some ideas.)
However, before growing the page any further, I would like to
resolve as many of the FIXMEs (and any other problems that people
see) as possible in the existing text. I need help with that. 
(And I know that dealing with that help, if I get it, will in 
itself will be quite a task to deal with, which is why I have 
been delaying it for many weeks now, as my time has been 
rather limited recently.)

So, please take a look at the page below. At this point,
I would most especially appreciate help with the FIXMEs.

Cheers,

Michael



FUTEX(2)Linux Programmer's Manual   FUTEX(2)

NAME
   futex - fast user-space locking

SYNOPSIS
   #include 
   #include 

   int futex(int *uaddr, int futex_op, int val,
 const struct timespec *timeout,   /* or: uint32_t val2 */
 int *uaddr2, int val3);

   Note: There is no glibc wrapper for this system call; see NOTES.

DESCRIPTION
   The  futex()  system  call  provides a method for waiting until a
   certain condition becomes true.  It is typically used as a block‐
   ing  construct  in  the context of shared-memory synchronization:
   The program implements the majority  of  the  synchronization  in
   user  space,  and  uses  one of the operations of the system call
   when it is likely that it has to block for a  longer  time  until
   the  condition  becomes true.  The program uses another operation
   of the system call to wake anyone waiting for a particular condi‐
   tion.

   The  condition  is  represented  by  the  futex word, which is an
   address in memory supplied to the futex() system  call,  and  the
   32-bit  value  at  this  memory  location.   (While  the  virtual
   addresses for the same physical memory address in  separate  pro‐
   cesses  may be different, the same physical address may be shared
   by the processes using mmap(2).)

   When executing a futex operation that requests to block a thread,
   the  kernel  will block only if the futex word has the value that
   the calling thread supplied as expected value.  The load from the
   futex  word,  the  comparison  with  the  expected value, and the
   actual blocking will happen atomically and totally  ordered  with
   respect  to  concurrently  executing futex operations on the same
   futex word.  Thus, the futex word is used to connect the synchro‐
   nization in user space with the implementation of blocking by the
   kernel; similar to an atomic compare-and-exchange operation  that
   potentially  changes  shared  memory,  blocking via a futex is an
   atomic compare-and-block operation.

   One example use of futexes is implementing locks.  The  state  of
   the  lock  (i.e., acquired or not acquired) can be represented as
   an atomically accessed flag in shared memory.  In the uncontended
   case,  a  thread  can access or modify th

Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hi Peter,

On 03/28/2015 01:03 PM, Peter Zijlstra wrote:
> On Sat, Mar 28, 2015 at 12:47:25PM +0100, Peter Zijlstra wrote:
>>FUTEX_WAIT (since Linux 2.6.0)
>>   This operation tests that the value at the futex word pointed 
>> to
>>   by the address uaddr still contains the expected value val,  
>> and
>>   if  so,  then sleeps awaiting FUTEX_WAKE on the futex word.  
>> The
>>   load of the value of the futex word is an atomic  memory  
>> access
>>   (i.e.,  using  atomic  machine  instructions  of  the 
>> respective
>>   architecture).  This load,  the  comparison  with  the  
>> expected
>>   value,  and  starting  to  sleep  are  performed  atomically 
>> and
>>   totally ordered with respect to other futex  operations  on  
>> the
>>   same  futex  word.  If the thread starts to sleep, it is 
>> consid‐
>>   ered a waiter on this futex word.  If the futex value  does  
>> not
>>   match  val,  then  the  call  fails  immediately  with the 
>> error
>>   EAGAIN.
>>
>>   The purpose of the comparison with the expected value is to 
>> pre‐
>>   vent  lost  wake-ups: If another thread changed the value of 
>> the
>>   futex word after the calling thread decided to  block  based  
>> on
>>   the  prior  value, and if the other thread executed a 
>> FUTEX_WAKE
>>   operation (or similar wake-up) after the value change and 
>> before
>>   this  FUTEX_WAIT  operation,  then  the  latter will observe 
>> the
>>   value change and will not start to sleep.
>>
>>   If the timeout argument is non-NULL, its contents specify a 
>> rel‐
>>   ative   timeout   for   the  wait,  measured  according  to  
>> the
>>   CLOCK_MONOTONIC clock.  (This interval will be rounded up to 
>> the
>>   system clock granularity, and kernel scheduling delays mean 
>> that
>>   the blocking interval may overrun by a small amount.)  If  
>> time‐
>>   out is NULL, the call blocks indefinitely.
> 
> Would it not be better to only state that the wait will not return
> before the timeout -- unless woken -- and not bother with clock
> granularity and scheduling delays?

Many of the pages that talk about system calls that have timeouts
carry similar language, since people often have confusions about what time
timeout (e.g., that it's an upper limit, not a minimum; or that it's precise
to some very small granularity). Why do you think the language here is a
problem?

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
On 04/15/2015 12:28 PM, Torvald Riegel wrote:
> On Tue, 2015-04-14 at 23:40 +0200, Thomas Gleixner wrote:
>> On Sat, 28 Mar 2015, Peter Zijlstra wrote:
>>> On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
 So, please take a look at the page below. At this point,
 I would most especially appreciate help with the FIXMEs.
>>>
>>> For people who cannot read that troff gibberish (me)..
>>
>> Ditto :)
>>  
>>> NOTES
>>>Glibc does not provide a wrapper for this system call;  call  it  
>>> using
>>>syscall(2).
>>
>> You might mention that pthread_mutex, pthread_condvar interfaces are
>> high level wrappers for the syscall and recommended to be used for
>> normal use cases. IIRC unnamed semaphores are implemented with futexes
>> as well.
> 
> If we add this, I'd rephrase it to something like that there are
> high-level programming abstractions such as the pthread_condvar
> interfaces or semaphores that are implemented using the syscall and that
> are typically a better fit for normal use cases.  I'd consider only the
> condvars as something like a wrapper, or targeting a similar use case.

I added this under NOTES:

   Various higher-level programming abstractions are implemented via
   futexes, including POSIX threads mutexes and condition variables,
   as well as POSIX semaphores.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hello Pavel,

On 04/27/2015 10:37 PM, Pavel Machek wrote:
> Hi!
> 
>>   The FUTEX_WAIT_OP operation is equivalent to execute the 
>> follow???
>>   ing  code  atomically  and totally ordered with respect to 
>> other
>>   futex operations on any of the two supplied futex words:
> 
> "to executing"?

Yep. Fixed.

>>   The  operation  and  comparison  that  are  to  be performed 
>> are
>>   encoded in the bits of  the  argument  val3.   Pictorially,  
>> the
>>   encoding is:
>>
>>   +---+---+---+---+
>>   |op |cmp|   oparg   |  cmparg   |
>>   +---+---+---+---+
>> 4   4   12  12<== # of bits
>>
> 
> :-)
> 
>> RETURN VALUE
>>In the event of an error, all operations return -1  and  set  errno  
>> to
>>indicate  the  cause of the error.  The return value on success 
>> depends
>>on the operation, as described in the following list:
> 
> Did you say (at the begining) that there is no glibc wrapper?

Yes, this could be clearer. I changed it to

RETURN VALUE
   In the event of an error (and assuming that futex()  was  invoked
   via  syscall(2)), all operations return -1 and set errno to indi‐
   cate the cause of the error.

>>EINVAL The operation in futex_op is one of those that employs  a  
>> time???
>>   out,  but  the supplied timeout argument was invalid (tv_sec 
>> was
>>   less than zero, or tv_nsec was not less than 1000,000,000).
> 
> 1,000...?

Fixed.

Thanks for the comments!

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hi David,

On 03/31/2015 04:45 PM, Davidlohr Bueso wrote:
> On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:
> 
>>The condition is represented by the futex word, which is an address  
>> in
>>memory  supplied to the futex() system call, and the value at this 
>> mem‐
>>ory location.  (While the virtual addresses for the same memory in 
>> sep‐
>>arate  processes  may  not be equal, the kernel maps them internally 
>> so
>>that the same memory mapped in different locations will correspond  
>> for
>>futex() calls.)
>>
>>When  executing  a futex operation that requests to block a thread, 
>> the
>>kernel will only block if the futex word has the value that the 
>> calling
> 
> Given the use of "word", you should probably state right away that
> futexes are only 32bit.

So, I made the opening sentence here:

   The  condition  is  represented  by  the  futex word, which is an
   address in memory supplied to the futex() system  call,  and  the
   32-bit  value  at  this  memory  location. 

Okay?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
On 03/31/2015 03:48 AM, Rusty Russell wrote:
> "Michael Kerrisk (man-pages)"  writes:
>> When executing a futex operation that requests to block a thread,
>> the kernel will only block if the futex word has the value that the
>> calling thread supplied as expected value.
>> The load from the futex word, the comparison with
>> the expected value,
>> and the actual blocking will happen atomically and totally
>> ordered with respect to concurrently executing futex operations
>> on the same futex word,
>> such as operations that wake threads blocked on this futex word.
>> Thus, the futex word is used to connect the synchronization in user spac
> 
> Missing 'e' in "space".

Already fixed.

>> .\" FIXME Please confirm that the following is correct:
>> No guarantee is provided about which waiters are awoken
>> (e.g., a waiter with a higher scheduling priority is not guaranteed
>> to be awoken in preference to a waiter with a lower priority).
> 
> This is true.

Thanks! FIXME removed.

Cheers,

Michael



> I didn't read the rest, as that stuff was all written by others.
> Documenting them is pretty heroic; good job!
> 
> Thanks,
> Rusty.
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-07-27 Thread Michael Kerrisk (man-pages)
Hi David,

On 03/31/2015 10:36 PM, Davidlohr Bueso wrote:
> On Sat, 2015-03-28 at 13:03 +0100, Peter Zijlstra wrote:
>>>   If the timeout argument is non-NULL, its contents specify a 
>>> rel‐
>>>   ative   timeout   for   the  wait,  measured  according  to  
>>> the
>>>   CLOCK_MONOTONIC clock.  (This interval will be rounded up to 
>>> the
>>>   system clock granularity, and kernel scheduling delays mean 
>>> that
>>>   the blocking interval may overrun by a small amount.)  If  
>>> time‐
>>>   out is NULL, the call blocks indefinitely.
>>
>> Would it not be better to only state that the wait will not return
>> before the timeout -- unless woken -- and not bother with clock
>> granularity and scheduling delays?
> 
> Yeah, similarly we also have this:
> 
>  FUTEX_PRIVATE_FLAG (since Linux 2.6.22)
>   This option bit can be employed with all futex  operations.   It
>   tells  the  kernel  that  the  futex  is process-private and not
>   shared with another process (i.e., it is  only  being  used  for
>   synchronization  between  threads  of  the  same process).  This
>   allows the kernel to choose the fast  path  for  validating  the
>   user-space address and avoids expensive VMA lookups, taking ref‐
>   erence counts on file backing store, and so on.
> 
> This to me reads a bit too much into the kernel (fastpath, refcnt,
> vmas). Why not just mention that it avoids overhead in the kernel or
> something? I don't recall any manpage mentioning such details, but I
> could be wrong. 

Thanks. Agreed. I changed this to

This allows the kernel to make some additional performance optimizations.

> In any case its a nit, the whole doc is pretty good and
> I hope you can merge it soon and then just increment ;)

I ran out of time and energy at a certain point. And also got a little
disheartened that I got more people complaining about groff markup
than actually looked looked at the FIXMEs in the page source :-). 
I'll try to reboot the process.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-04-27 Thread Pavel Machek
Hi!

>   The FUTEX_WAIT_OP operation is equivalent to execute the 
> follow???
>   ing  code  atomically  and totally ordered with respect to other
>   futex operations on any of the two supplied futex words:

"to executing"?

>   The  operation  and  comparison  that  are  to  be performed are
>   encoded in the bits of  the  argument  val3.   Pictorially,  the
>   encoding is:
> 
>   +---+---+---+---+
>   |op |cmp|   oparg   |  cmparg   |
>   +---+---+---+---+
> 4   4   12  12<== # of bits
> 

:-)

> RETURN VALUE
>In the event of an error, all operations return -1  and  set  errno  to
>indicate  the  cause of the error.  The return value on success depends
>on the operation, as described in the following list:

Did you say (at the begining) that there is no glibc wrapper?

>EINVAL The operation in futex_op is one of those that employs  a  
> time???
>   out,  but  the supplied timeout argument was invalid (tv_sec was
>   less than zero, or tv_nsec was not less than 1000,000,000).

1,000...?

> NOTES
>Glibc does not provide a wrapper for this system call;  call  it  using
>syscall(2).

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-04-15 Thread Torvald Riegel
On Tue, 2015-04-14 at 23:40 +0200, Thomas Gleixner wrote:
> On Sat, 28 Mar 2015, Peter Zijlstra wrote:
> > On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
> > > So, please take a look at the page below. At this point,
> > > I would most especially appreciate help with the FIXMEs.
> > 
> > For people who cannot read that troff gibberish (me)..
> 
> Ditto :)
>  
> > NOTES
> >Glibc does not provide a wrapper for this system call;  call  it  
> > using
> >syscall(2).
> 
> You might mention that pthread_mutex, pthread_condvar interfaces are
> high level wrappers for the syscall and recommended to be used for
> normal use cases. IIRC unnamed semaphores are implemented with futexes
> as well.

If we add this, I'd rephrase it to something like that there are
high-level programming abstractions such as the pthread_condvar
interfaces or semaphores that are implemented using the syscall and that
are typically a better fit for normal use cases.  I'd consider only the
condvars as something like a wrapper, or targeting a similar use case.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-04-14 Thread Thomas Gleixner
On Sat, 28 Mar 2015, Peter Zijlstra wrote:
> On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
> > So, please take a look at the page below. At this point,
> > I would most especially appreciate help with the FIXMEs.
> 
> For people who cannot read that troff gibberish (me)..

Ditto :)
 
> NOTES
>Glibc does not provide a wrapper for this system call;  call  it  using
>syscall(2).

You might mention that pthread_mutex, pthread_condvar interfaces are
high level wrappers for the syscall and recommended to be used for
normal use cases. IIRC unnamed semaphores are implemented with futexes
as well.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-31 Thread Davidlohr Bueso
On Sat, 2015-03-28 at 13:03 +0100, Peter Zijlstra wrote:
> >   If the timeout argument is non-NULL, its contents specify a 
> > rel‐
> >   ative   timeout   for   the  wait,  measured  according  to  
> > the
> >   CLOCK_MONOTONIC clock.  (This interval will be rounded up to 
> > the
> >   system clock granularity, and kernel scheduling delays mean 
> > that
> >   the blocking interval may overrun by a small amount.)  If  
> > time‐
> >   out is NULL, the call blocks indefinitely.
> 
> Would it not be better to only state that the wait will not return
> before the timeout -- unless woken -- and not bother with clock
> granularity and scheduling delays?

Yeah, similarly we also have this:

 FUTEX_PRIVATE_FLAG (since Linux 2.6.22)
  This option bit can be employed with all futex  operations.   It
  tells  the  kernel  that  the  futex  is process-private and not
  shared with another process (i.e., it is  only  being  used  for
  synchronization  between  threads  of  the  same process).  This
  allows the kernel to choose the fast  path  for  validating  the
  user-space address and avoids expensive VMA lookups, taking ref‐
  erence counts on file backing store, and so on.

This to me reads a bit too much into the kernel (fastpath, refcnt,
vmas). Why not just mention that it avoids overhead in the kernel or
something? I don't recall any manpage mentioning such details, but I
could be wrong. In any case its a nit, the whole doc is pretty good and
I hope you can merge it soon and then just increment ;)

Thanks,
Davidlohr


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-31 Thread Davidlohr Bueso
On Sat, 2015-03-28 at 12:47 +0100, Peter Zijlstra wrote:

>The condition is represented by the futex word, which is an address  in
>memory  supplied to the futex() system call, and the value at this mem‐
>ory location.  (While the virtual addresses for the same memory in sep‐
>arate  processes  may  not be equal, the kernel maps them internally so
>that the same memory mapped in different locations will correspond  for
>futex() calls.)
> 
>When  executing  a futex operation that requests to block a thread, the
>kernel will only block if the futex word has the value that the calling

Given the use of "word", you should probably state right away that
futexes are only 32bit.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-30 Thread Rusty Russell
"Michael Kerrisk (man-pages)"  writes:
> When executing a futex operation that requests to block a thread,
> the kernel will only block if the futex word has the value that the
> calling thread supplied as expected value.
> The load from the futex word, the comparison with
> the expected value,
> and the actual blocking will happen atomically and totally
> ordered with respect to concurrently executing futex operations
> on the same futex word,
> such as operations that wake threads blocked on this futex word.
> Thus, the futex word is used to connect the synchronization in user spac

Missing 'e' in "space".

> .\" FIXME Please confirm that the following is correct:
> No guarantee is provided about which waiters are awoken
> (e.g., a waiter with a higher scheduling priority is not guaranteed
> to be awoken in preference to a waiter with a lower priority).

This is true.

I didn't read the rest, as that stuff was all written by others.
Documenting them is pretty heroic; good job!

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-28 Thread Peter Zijlstra
On Sat, Mar 28, 2015 at 12:47:25PM +0100, Peter Zijlstra wrote:
>FUTEX_WAIT (since Linux 2.6.0)
>   This operation tests that the value at the futex word pointed to
>   by the address uaddr still contains the expected value val,  and
>   if  so,  then sleeps awaiting FUTEX_WAKE on the futex word.  The
>   load of the value of the futex word is an atomic  memory  access
>   (i.e.,  using  atomic  machine  instructions  of  the respective
>   architecture).  This load,  the  comparison  with  the  expected
>   value,  and  starting  to  sleep  are  performed  atomically and
>   totally ordered with respect to other futex  operations  on  the
>   same  futex  word.  If the thread starts to sleep, it is consid‐
>   ered a waiter on this futex word.  If the futex value  does  not
>   match  val,  then  the  call  fails  immediately  with the error
>   EAGAIN.
> 
>   The purpose of the comparison with the expected value is to pre‐
>   vent  lost  wake-ups: If another thread changed the value of the
>   futex word after the calling thread decided to  block  based  on
>   the  prior  value, and if the other thread executed a FUTEX_WAKE
>   operation (or similar wake-up) after the value change and before
>   this  FUTEX_WAIT  operation,  then  the  latter will observe the
>   value change and will not start to sleep.
> 
>   If the timeout argument is non-NULL, its contents specify a rel‐
>   ative   timeout   for   the  wait,  measured  according  to  the
>   CLOCK_MONOTONIC clock.  (This interval will be rounded up to the
>   system clock granularity, and kernel scheduling delays mean that
>   the blocking interval may overrun by a small amount.)  If  time‐
>   out is NULL, the call blocks indefinitely.

Would it not be better to only state that the wait will not return
before the timeout -- unless woken -- and not bother with clock
granularity and scheduling delays?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised futex(2) man page for review

2015-03-28 Thread Peter Zijlstra
On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
> So, please take a look at the page below. At this point,
> I would most especially appreciate help with the FIXMEs.

For people who cannot read that troff gibberish (me)..

---
FUTEX(2)   Linux Programmer's Manual  FUTEX(2)




NAME
   futex - fast user-space locking

SYNOPSIS
   #include 
   #include 

   int futex(int *uaddr, int futex_op, int val,
 const struct timespec *timeout,   /* or: u32 val2 */
 int *uaddr2, int val3);

   Note: There is no glibc wrapper for this system call; see NOTES.

DESCRIPTION
   The  futex()  system call provides a method for waiting until a certain
   condition becomes true.  It is typically used as a  blocking  construct
   in the context of shared-memory synchronization: The program implements
   the majority of the synchronization in user  space,  and  uses  one  of
   operations  of  the  system call when it is likely that it has to block
   for a longer time until the condition becomes true.  The  program  uses
   another  operation of the system call to wake anyone waiting for a par‐
   ticular condition.

   The condition is represented by the futex word, which is an address  in
   memory  supplied to the futex() system call, and the value at this mem‐
   ory location.  (While the virtual addresses for the same memory in sep‐
   arate  processes  may  not be equal, the kernel maps them internally so
   that the same memory mapped in different locations will correspond  for
   futex() calls.)

   When  executing  a futex operation that requests to block a thread, the
   kernel will only block if the futex word has the value that the calling
   thread  supplied  as expected value.  The load from the futex word, the
   comparison with the expected value, and the actual blocking will happen
   atomically  and  totally ordered with respect to concurrently executing
   futex operations on the same futex word, such as operations  that  wake
   threads  blocked  on  this futex word.  Thus, the futex word is used to
   connect the synchronization in user spac  with  the  implementation  of
   blocking by the kernel; similar to an atomic compare-and-exchange oper‐
   ation that potentially changes shared memory, blocking via a  futex  is
   an atomic compare-and-block operation.  See NOTES for a detailed speci‐
   fication of the synchronization semantics.

   One example use of futexes is implementing locks.   The  state  of  the
   lock  (i.e.,  acquired or not acquired) can be represented as an atomi‐
   cally accessed flag in shared  memory.   In  the  uncontended  case,  a
   thread  can  access  or modify the lock state with atomic instructions,
   for example atomically changing it from not acquired to acquired  using
   an atomic compare-and-exchange instruction.  If a thread cannot acquire
   a lock because it is already acquired by another thread, it can request
   to  block  if  and  only the lock is still acquired by using the lock's
   flag as futex word and expecting a value that represents  the  acquired
   state.   When  releasing the lock, a thread has to first reset the lock
   state to not acquired and then execute the futex operation  that  wakes
   one  thread blocked on the futex word that is the lock's flag (this can
   be be further optimized to avoid unnecessary wake-ups).   See  futex(7)
   for more detail on how to use futexes.

   Besides  the basic wait and wake-up futex functionality, there are fur‐
   ther futex operations aimed at supporting more complex use cases.  Also
   note  that  no  explicit initialization or destruction are necessary to
   use futexes; the kernel maintains a futex  (i.e.,  the  kernel-internal
   implementation  artifact)  only  while  operations  such as FUTEX_WAIT,
   described below, are being performed on a particular futex word.

   Arguments
   The uaddr argument points to the futex word.  On all platforms, futexes
   are  four-byte  integers  that must be aligned on a four-byte boundary.
   The operation to perform on the futex  is  specified  in  the  futex_op
   argument; val is a value whose meaning and purpose depends on futex_op.

   The  remaining  arguments (timeout, uaddr2, and val3) are required only
   for certain of the futex operations  described  below.   Where  one  of
   these arguments is not required, it is ignored.

   For several blocking operations, the timeout argument is a pointer to a
   timespec structure that specifies a timeout for  the  operation.   How‐
   ever,   notwithstanding the prototype shown above, for some operations,
   this argument is instead a four-byte integer whose  meaning  is  deter‐
 

Re: Revised futex(2) man page for review

2015-03-28 Thread Michael Kerrisk (man-pages)
On 03/28/2015 09:53 AM, Michael Kerrisk (man-pages) wrote:
> Hello all,
[...]
> So, please take a look at the page below. At this point,
> I would most especially appreciate help with the FIXMEs.

One more point I should have added. The revised page
currently sits in a Git branch, here:
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Revised futex(2) man page for review

2015-03-28 Thread Michael Kerrisk (man-pages)
Hello all,

As becomes quickly obvious upon reading it, the current futex(2) 
man page is in a sorry state, lacking many important details, and
also the various additions that have been made to the interface
over the last years. I've been working on revising it, first
of all based on input I got in response to a request for help
last year (http://thread.gmane.org/gmane.linux.kernel/1703405), 
especially taking Thomas Gleixner's input 
(http://thread.gmane.org/gmane.linux.kernel/1703405/focus=2952) 
into account. I also got some further offlist input from Darren
 Hart, Torvald Riegel, and Davidlohr Bueso that has been
incorporated into the revised draft. Other than that, I got
some useful info out of Ulrich Drepper's paper (cited at the
end of the page) and one or two web pages (cited in the page
source).

The page has now increased in size by a factor of about 5, but
is far from complete. In particular, as I reworked the page, 
there were many details that I was not 100% certain of, and I
have added FIXME markers to the page source. In addition,
Torvald added some text, and a few more FIXMEs. Some of
the FIXMEs are trivial, as in: I'd like confirmation that
I have correctly captured a technical detail. Others are more 
substantial, probably requiring the addition of further text.

I appreciate that there are probably other things that can be
improved in the page. (Torvald and Darren have some ideas.)
However, before growing the page any further, I would like to
resolve as many of the FIXMEs (and any other problems that people
see) as possible in the existing text. I need help with that. 
(And I know that dealing with that help, if I get it, will in 
itself will be quite a task to deal with, which is why I have 
been delaying it for many weeks now, as my time has been 
rather limited recently.)

So, please take a look at the page below. At this point,
I would most especially appreciate help with the FIXMEs.

Cheers,

Michael

=
.\" Page by b.hubert
.\" and Copyright (C) 2015, Thomas Gleixner 
.\" and Copyright (C) 2015, Michael Kerrisk 
.\"
.\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE)
.\" may be freely modified and distributed
.\" %%%LICENSE_END
.\"
.\" Niki A. Rahimi (LTC Security Development, narah...@us.ibm.com)
.\" added ERRORS section.
.\"
.\" Modified 2004-06-17 mtk
.\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE
.\"
.\" FIXME Still to integrate are some points from Torvald Riegel's mail of
.\"   2015-01-23:
.\"   http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977
.\"
.\" FIXME Do we need add some text regarding Torvald Riegel's 2015-01-24 mail
.\"   at http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242
.\"
.TH FUTEX 2 2014-05-21 "Linux" "Linux Programmer's Manual"
.SH NAME
futex \- fast user-space locking
.SH SYNOPSIS
.nf
.sp
.B "#include "
.B "#include "
.sp
.BI "int futex(int *" uaddr ", int " futex_op ", int " val ,
.BI "  const struct timespec *" timeout , \
" \fR  /* or: \fBu32 \fIval2\fP */ 
.BI "  int *" uaddr2 ", int " val3 );
.fi

.IR Note :
There is no glibc wrapper for this system call; see NOTES.
.SH DESCRIPTION
.PP
The
.BR futex ()
system call provides a method for waiting until a certain condition becomes
true.
It is typically used as a blocking construct in the context of
shared-memory synchronization: The program implements the majority of
the synchronization in user space, and uses one of operations of
the system call when it is likely that it has to block for
a longer time until the condition becomes true.
The program uses another operation of the system call to wake
anyone waiting for a particular condition.

The condition is represented by the futex word, which is an address
in memory supplied to the
.BR futex ()
system call, and the value at this memory location.
(While the virtual addresses for the same memory in separate
processes may not be equal,
the kernel maps them internally so that the same memory mapped
in different locations will correspond for
.BR futex ()
calls.)

When executing a futex operation that requests to block a thread,
the kernel will only block if the futex word has the value that the
calling thread supplied as expected value.
The load from the futex word, the comparison with
the expected value,
and the actual blocking will happen atomically and totally
ordered with respect to concurrently executing futex operations
on the same futex word,
such as operations that wake threads blocked on this futex word.
Thus, the futex word is used to connect the synchronization in user spac
with the implementation of blocking by the kernel; similar to an atomic
compare-and-exchange operation that potentially changes shared memory,
blocking via a futex is an atomic compare-and-block operation.
See NOTES for
a detailed specification of the synchronization semantics.

One example use of futexes is implementing locks.
The state of the lock (i.e.,
acquired or not acquired) can be represented as an atom