Re: Robust futexes: lost wakeups and design flaws in the glibc/kernel synchronization scheme

2017-01-13 Thread Torvald Riegel
On Thu, 2017-01-12 at 10:39 +0100, Torvald Riegel wrote:
> On Sat, 2016-12-24 at 17:01 +0100, Torvald Riegel wrote:
> > === Robust mutexes have bugs, in both glibc and the kernel
> > 
> > I've been reviewing the implementation of robust mutexes in both glibc
> > and the kernel support code recently because there are several bug
> > reports, for example:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1401665
> > https://sourceware.org/bugzilla/show_bug.cgi?id=19402
> > https://sourceware.org/bugzilla/show_bug.cgi?id=14485
> > 
> > This review revealed a bunch of bugs.  I have committed/proposed patches
> > that fix all glibc-only bugs that I am aware of:
> > https://sourceware.org/ml/libc-alpha/2016-12/msg00587.html
> > https://sourceware.org/ml/libc-alpha/2016-12/msg00862.html
> > https://sourceware.org/ml/libc-alpha/2016-12/msg00863.html
> > https://sourceware.org/ml/libc-alpha/2016-12/msg00947.html

These patches are now all committed to glibc master.  Thus, you can
start fixing/testing now and have a high probability that you won't run
into glibc bugs.



Re: Robust futexes: lost wakeups and design flaws in the glibc/kernel synchronization scheme

2017-01-12 Thread Torvald Riegel
On Sat, 2016-12-24 at 17:01 +0100, Torvald Riegel wrote:
> TLDR:
> (1) The Linux kernel fails to wake futex waiters when a userspace thread
> that releases a mutex is killed between modifying the futex word and the
> subsequent FUTEX_WAKE syscall.
> (2) POSIX makes certain requirements regarding when destruction of a
> mutex is safe that are violated by the current glibc/kernel
> synchronization scheme, which can lead to corruption of userspace
> memory; robust FUTEX_UNLOCK_PI must be changed slightly, and the non-PI
> robust mutexes must either use a FUTEX_UNLOCK-like syscall to release
> the mutex, try to detect this by interpreting (current position in)
> userspace code, or we must add a new futex recovery syscall if we want
> to preserve the userspace-only mutex-release fastpath.
> 
> [Please keep me in CC because I'm not subscribed to LKML.]
> 
> 
> === Robust mutexes have bugs, in both glibc and the kernel
> 
> I've been reviewing the implementation of robust mutexes in both glibc
> and the kernel support code recently because there are several bug
> reports, for example:
> https://bugzilla.redhat.com/show_bug.cgi?id=1401665
> https://sourceware.org/bugzilla/show_bug.cgi?id=19402
> https://sourceware.org/bugzilla/show_bug.cgi?id=14485
> 
> This review revealed a bunch of bugs.  I have committed/proposed patches
> that fix all glibc-only bugs that I am aware of:
> https://sourceware.org/ml/libc-alpha/2016-12/msg00587.html
> https://sourceware.org/ml/libc-alpha/2016-12/msg00862.html
> https://sourceware.org/ml/libc-alpha/2016-12/msg00863.html
> https://sourceware.org/ml/libc-alpha/2016-12/msg00947.html
> These fixes are meant to be easy to backport, so they don't contain as
> much cleanup and explanation of the glibc code as would be ideal.
> 
> I plan to follow these up with more cleanup and adding of documentation,
> which should also describe the glibc/kernel synchronization scheme in
> more detail.  I believe I have found all the bugs in robust mutexes
> (when including the kernel-side bugs described in this email), but I'll
> make another review pass when preparing this cleanup+documentation
> patch.
> 
> 
> === Background
> 
> A brief reminder of how glibc and the kernel attempt to synchronize with
> each other:
> Each thread maintains a list of robust mutexes that it has acquired, and
> a single pointer to a futex word that it is about to acquire or release
> (called op_pending).  The list is a concurrent list (to some extent)
> because userspace can be interrupted by the post-crash cleanup code run
> by the kernel (ie, this is similar to how one would synchronize with a
> signal handler).
> 
> To release a mutex without using FUTEX_UNLOCK_PI, userspace basically
> does the following:
> (L1)  atomic_store (&op_pending, &mutex->futex, memory_order_relaxed);
> (L2)  <"thread-safe" dequeue of mutex->futex from robust mutex list>
> (L3)  fw = atomic_exchange(&mutex->futex, 0, memory_order_release);
> (L4)  if (fw & FUTEX_WAITERS) futex_wake(&mutex->futex, 1);
> (L5)  atomic_store (&op_pending, NULL, memory_order_relaxed);
> (I'm not showing the signal fences required; for simplicity, assume that
> there is a signal fence after each line.)
> 
> In case of a robust PI mutex, L3 and L4 are done by the kernel in
> FUTEX_UNLOCK_PI.
> 
> 
> === Lost wakeups caused by the kernel
> 
> If a thread happens to be killed between L3 and L4 and FUTEX_WAITERS is
> set, then userspace will have set the futex word to 0 but not yet have
> run FUTEX_WAKE.
> op_pending still points to the futex word, but handle_futex_death in
> kernel/futex.c only calls futex_wake if the TID bits in the futex word
> are equal to the TID of the killed thread.  The result of that is a lost
> wake-up for one of the threads that set or shared FUTEX_WAITERS.
> 
> This can be fixed by running futex_wake conservatively, even if the
> value of *op_ending does not match the TID of the killed thread.  This
> can lead to spurious wakeups (eg, if the thread was killed between L4
> and L5), but userspace code has to, in the general case, be prepared for
> spurious futex wake-ups anyway if using POSIX mutexes:
> https://lkml.org/lkml/2014/11/27/472
> Also, the same spurious wake-ups can happen under slightly different
> executions without a crash (eg, if FUTEX_WAITERS is set spuriously due
> to timeouts on another thread that set it).
> 
> 
> === POSIX' mutex destruction requirements are violated
> 
> In a nutshell, the mutex destruction requirements mean that there must
> be a single atomic action that releases a mutex; after a mutex is
> released, the thread releasing the mutex must not a

Robust futexes: lost wakeups and design flaws in the glibc/kernel synchronization scheme

2016-12-24 Thread Torvald Riegel
TLDR:
(1) The Linux kernel fails to wake futex waiters when a userspace thread
that releases a mutex is killed between modifying the futex word and the
subsequent FUTEX_WAKE syscall.
(2) POSIX makes certain requirements regarding when destruction of a
mutex is safe that are violated by the current glibc/kernel
synchronization scheme, which can lead to corruption of userspace
memory; robust FUTEX_UNLOCK_PI must be changed slightly, and the non-PI
robust mutexes must either use a FUTEX_UNLOCK-like syscall to release
the mutex, try to detect this by interpreting (current position in)
userspace code, or we must add a new futex recovery syscall if we want
to preserve the userspace-only mutex-release fastpath.

[Please keep me in CC because I'm not subscribed to LKML.]


=== Robust mutexes have bugs, in both glibc and the kernel

I've been reviewing the implementation of robust mutexes in both glibc
and the kernel support code recently because there are several bug
reports, for example:
https://bugzilla.redhat.com/show_bug.cgi?id=1401665
https://sourceware.org/bugzilla/show_bug.cgi?id=19402
https://sourceware.org/bugzilla/show_bug.cgi?id=14485

This review revealed a bunch of bugs.  I have committed/proposed patches
that fix all glibc-only bugs that I am aware of:
https://sourceware.org/ml/libc-alpha/2016-12/msg00587.html
https://sourceware.org/ml/libc-alpha/2016-12/msg00862.html
https://sourceware.org/ml/libc-alpha/2016-12/msg00863.html
https://sourceware.org/ml/libc-alpha/2016-12/msg00947.html
These fixes are meant to be easy to backport, so they don't contain as
much cleanup and explanation of the glibc code as would be ideal.

I plan to follow these up with more cleanup and adding of documentation,
which should also describe the glibc/kernel synchronization scheme in
more detail.  I believe I have found all the bugs in robust mutexes
(when including the kernel-side bugs described in this email), but I'll
make another review pass when preparing this cleanup+documentation
patch.


=== Background

A brief reminder of how glibc and the kernel attempt to synchronize with
each other:
Each thread maintains a list of robust mutexes that it has acquired, and
a single pointer to a futex word that it is about to acquire or release
(called op_pending).  The list is a concurrent list (to some extent)
because userspace can be interrupted by the post-crash cleanup code run
by the kernel (ie, this is similar to how one would synchronize with a
signal handler).

To release a mutex without using FUTEX_UNLOCK_PI, userspace basically
does the following:
(L1)  atomic_store (&op_pending, &mutex->futex, memory_order_relaxed);
(L2)  <"thread-safe" dequeue of mutex->futex from robust mutex list>
(L3)  fw = atomic_exchange(&mutex->futex, 0, memory_order_release);
(L4)  if (fw & FUTEX_WAITERS) futex_wake(&mutex->futex, 1);
(L5)  atomic_store (&op_pending, NULL, memory_order_relaxed);
(I'm not showing the signal fences required; for simplicity, assume that
there is a signal fence after each line.)

In case of a robust PI mutex, L3 and L4 are done by the kernel in
FUTEX_UNLOCK_PI.


=== Lost wakeups caused by the kernel

If a thread happens to be killed between L3 and L4 and FUTEX_WAITERS is
set, then userspace will have set the futex word to 0 but not yet have
run FUTEX_WAKE.
op_pending still points to the futex word, but handle_futex_death in
kernel/futex.c only calls futex_wake if the TID bits in the futex word
are equal to the TID of the killed thread.  The result of that is a lost
wake-up for one of the threads that set or shared FUTEX_WAITERS.

This can be fixed by running futex_wake conservatively, even if the
value of *op_ending does not match the TID of the killed thread.  This
can lead to spurious wakeups (eg, if the thread was killed between L4
and L5), but userspace code has to, in the general case, be prepared for
spurious futex wake-ups anyway if using POSIX mutexes:
https://lkml.org/lkml/2014/11/27/472
Also, the same spurious wake-ups can happen under slightly different
executions without a crash (eg, if FUTEX_WAITERS is set spuriously due
to timeouts on another thread that set it).


=== POSIX' mutex destruction requirements are violated

In a nutshell, the mutex destruction requirements mean that there must
be a single atomic action that releases a mutex; after a mutex is
released, the thread releasing the mutex must not access the mutex'
memory anymore (but it is still allowed to run a FUTEX_WAKE on it); see
https://lkml.org/lkml/2014/11/27/472 for additional background.

Currently, the kernel can potentially modify the mutex' memory until L5
has executed.  This means that getting killed while releasing a robust
mutex is not guaranteed to be a single atomic action.

For example, assume that Thread1 has acquired the mutex and releases it,
but is killed right before L5; if the program is built in such a way
that only Thread2 will acquire the mutex after Thread1 has released it,
then POSIX allows Thread2 to acquire the mu

Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism

2016-05-02 Thread Torvald Riegel
On Fri, 2016-04-29 at 16:51 -0700, Linus Torvalds wrote:
> On Fri, Apr 29, 2016 at 2:10 PM, Linus Torvalds
>  wrote:
> > On Thu, Apr 28, 2016 at 11:32 AM, Linus Torvalds
> >  wrote:
> >>
> >> hash_long/ptr really shouldn't care about the low bits being zero at
> >> all, because it should mix in all the bits (using a prime multiplier
> >> and taking the high bits).
> >
> > Looking at this assertion, it doesn't actually pan out very well at all.
> >
> > The 32-bit hash looks to be ok. Certainly not perfect, but not horrible.
> >
> > The 64-bit hash seems to be quite horribly bad with lots of values.
> 
> Ok, I have tried to research this a bit more. There's a lot of
> confusion here that causes the fact that hash_64() sucks donkey balls.
> 
> The basic method for the hashing method by multiplication is fairly
> sane. It's well-documented, and attributed to Knuth as the comment
> above it says.

Section 2.2 of this article might also be of interest:

M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A Re-
liable Randomized Algorithm for the Closest-Pair Problem. Journal of
Algorithms, 25(1):19 – 51, 1997.

I don't know much about hash functions, but my understanding of this is
that one can do good hashing of words by multiplying with a random
number and using the most-significant bits of the lower-significant word
of the result.  Different random numbers may work better than others for
the input data (and some might be really awful), but the paper seems to
claim that one *can* find a good random number for all input data.  In
practice, this might mean that re-hashing with a different random number
after seeing too many conflicts in a hash table should eventually lead
to a good hashing (ie, because the *class* of such hash functions is
universal).



Re: [RFC patch 4/7] futex: Add support for attached futexes

2016-04-05 Thread Torvald Riegel
On Sun, 2016-04-03 at 06:30 -0500, Linus Torvalds wrote:
> On Sun, Apr 3, 2016 at 6:16 AM, Ingo Molnar  wrote:
> >
> > So an ABI distinction and offloading the decision to every single 
> > application that
> > wants to use it and hardcode it into actual application source code via an 
> > ABI is
> > pretty much the _WORST_ way to go about it IMHO...
> >
> > So how about this: don't add any ABI details, but make futexes 
> > auto-attached on
> > NUMA systems (and obviously PREEMPT_RT systems)?
> 
> I agree.
> 
> Do *not* make this a visible new ABI.

>From a glibc perspective, I agree that this shouldn't require an
extension of the ABI unless it's really the only possible way to solve
this.

For "special" mutex kinds such as PI mutexes, the change in the
interface might be justifiable -- but for ordinary mutexes, there's no
good place to add the attach/detach calls in each thread: An
implementation of, say, C11 mutexes cannot easily estimate whether it
should use attached futexes, and it would have to track whether a
particular mutex has been attached to by the current thread; this might
just move the overhead of tracking and caching associations from the
kernel to userspace.

> You will find that people will make exactly the wrong choices - either
> not using it (because the futex is deep in a standard library!) when
> they want to, or using it when they shouldn't (because the futex is
> deep in a standard library, and the library writer knows *his* code is
> so important that it should get a special faster futex).

There are cases where userspace might know that it should use a
"special" futex.  Consider an MCS-lock implementation in glibc by which
all pthreads, C, and C++ mutexes are backed: the lock nodes that threads
would be spinning on would be per-thread and exist for the whole
lifetime of the thread; attach and detach would be simple to do, and
there would be a limited number of these in the system.
A Java VM's park/unpark implementation might be another candidate.
However, these use cases are pretty specific (eg, there's only a single
waiter), so any kernel support for these might be more special too.



Re: futex(3) man page, final draft for pre-release review

2015-12-18 Thread Torvald Riegel
On Tue, 2015-12-15 at 14:41 -0800, Davidlohr Bueso wrote:
> On Tue, 15 Dec 2015, Michael Kerrisk (man-pages) wrote:
> 
> >   When executing a futex operation that requests to block a thread,
> >   the kernel will block only if the futex word has the  value  that
> >   the  calling  thread  supplied  (as  one  of the arguments of the
> >   futex() call) as the expected value of the futex word.  The load???
> >   ing  of the futex word's value, the comparison of that value with
> >   the expected value, and the actual blocking  will  happen  atomi???
> >
> >FIXME: for next line, it would be good to have an explanation of
> >"totally ordered" somewhere around here.
> >
> >   cally  and totally ordered with respect to concurrently executing
> >   futex operations on the same futex word.
> 
> So there are two things here regarding ordering. One is the most obvious
> which is ordered due to the taking/dropping the hb spinlock.

I suppose that this means what is described in the manpage already?
That is, that futex operations (ie, the syscalls) are atomic wrt each
other and in a strict total order?

> Secondly, its
> the cases which Peter brought up a while ago that involves atomic futex ops
> futex_atomic_*(), which   do not have clearly defined semantics, and you 
> get
> inconsistencies with certain archs (tile being the worst iirc).

OK.  So, from a user's POV, this is about the semantics of the kernel's
accesses to the futex word.  I agree that specifying this more clearly
would be helpful.

First, there are the comparisons of the futex words used in, for
example, FUTEX_WAIT.  They should use an atomic load within the
conceptual critical sections that make up futex operations.  This load
itself doesn't need to establish any ordering, so it can be equivalent
to a C11 memory_order_relaxed load.  Are there any objections to that?

Second, We have the write accesses in FUTEX_[TRY]LOCK_PI and
FUTEX_UNLOCK_PI.  We already specify those as atomic and within the
conceptual critical sections of the futex operation.  In addition, they
should establish ordering themselves, so C11 have memory_order_acquire /
memory_order_release semantics.  Specifying this would be good.  Any
objections to these semantics?

Third, we have the atomic read-modify-write operation that is part of
FUTEX_WAKE_OP (ie, AFAIU, the case you pointed at specifically).  I
don't have a strong opinion on what it should be, because I think
userspace can enforce the orderings it needs on its own (eg, if I
interpret Peter Zijlstra's example correctly, userspace can add
appropriate fences before the CPU0/futex_unlock and after the
CPU2/futex_load calls).  FUTEX_WAKE_OP accesses no other userspace
memory location, so there's no ordering relation to other accesses to
userspace memory that userspace cannot affect.
OTOH, legacy userspace may have assumed strong semantics, so making the
read-modify-write have memory_order_seq_cst semantics is probably a safe
bet.  Futex operations typically shouldn't be on the fast paths anyway.

> But anyway, the important thing users need to know about is that the atomic
> futex operation must be totally ordered wrt any other user tasks that are 
> trying
> to access that address.

I'm not sure what you mean precisely.  One can't order the whole futex
operations totally wrt memory accesses by userspace because they'd need
to synchronize to do that, and thus userspace would to hvae either hook
into the kernel's synchronization or use HTM or such.

> This is not necessarily the case for kernel ops. Peter
> illustrates this nicely with lock stealing example; 
> (see https://lkml.org/lkml/2015/8/26/596).
> 
> Internally, I believe we decided that making it fully ordered (as opposed to
> making use of implicit barriers for ACQUIRE/RELEASE), so you'd endup having
> an MB ll/sc MB kind of setup.

OK.  So, any objections to documenting that the read-modify-write op in
FUTEX_WAKE_OP has memory_order_seq_cst semantics?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: futex(3) man page, final draft for pre-release review

2015-12-18 Thread Torvald Riegel
On Tue, 2015-12-15 at 13:18 -0800, Darren Hart wrote:
> On Tue, Dec 15, 2015 at 02:43:50PM +0100, Michael Kerrisk (man-pages) wrote:
> > 
> >When executing a futex operation that requests to block a thread,
> >the kernel will block only if the futex word has the  value  that
> >the  calling  thread  supplied  (as  one  of the arguments of the
> >futex() call) as the expected value of the futex word.  The load‐
> >ing  of the futex word's value, the comparison of that value with
> >the expected value, and the actual blocking  will  happen  atomi‐
> > 
> > FIXME: for next line, it would be good to have an explanation of
> > "totally ordered" somewhere around here.
> > 
> >cally  and totally ordered with respect to concurrently executing
> 
> Totally ordered with respect futex operations refers to semantics of the
> ACQUIRE/RELEASE operations and how they impact ordering of memory reads and
> writes. The kernel futex operations are protected by spinlocks, which ensure
> that that all operations are serialized with respect to one another.
> 
> This is a lot to attempt to define in this document. Perhaps a reference to
> linux/Documentation/memory-barriers.txt as a footnote would be sufficient? Or
> perhaps for this manual, "serialized" would be sufficient, with a footnote
> regarding "totally ordered" and a pointer to the memory-barrier documentation?

I'd strongly prefer to document the semantics for users here.  And I
don't think users use the kernel's memory model -- instead, if we assume
that most users will call futex ops from C or C++, then the best we have
is the C11 / C++11 memory model.  Therefore, if we want to expand that,
we should specify semantics in terms of as-if equivalence to C11 pseudo
code.  I had proposed that in the past but, IIRC, Michael didn't want to
add a C11 "dependency" in the semantics back then, at least for the
initial release.

Here's what I wrote back then (atomic_*_relaxed() is like C11
atomic_*(..., memory_order_relaxed), lock/unlock have normal C11 mutex
semantics):



For example, we could say that futex_wait is, in terms of
synchronization semantics, *as if* we'd execute a piece of C11 code.
Here's a part of the docs for a glibc-internal futex wrapper that I'm
working on; this is futex_wait ... :

/* Atomically wrt other futex operations, this blocks iff the value at
   *FUTEX matches the expected value.  This is semantically equivalent to: 
 l =  (FUTEX);
 wait_flag =  (FUTEX);
 lock (l);
 val = atomic_load_relaxed (FUTEX);
 if (val != expected) { unlock (l); return EAGAIN; }
 atomic_store_relaxed (wait_flag, 1);
 unlock (l);
 // Now block; can time out in futex_time_wait (see below)
 while (atomic_load_relaxed(wait_flag));

   Note that no guarantee of a happens-before relation between a woken
   futex_wait and a futex_wake is documented; however, this does not matter
   in practice because we have to consider spurious wake-ups (see below),
   and thus would not be able to reason which futex_wake woke us anyway.


... and this is futex_wake:

/* Atomically wrt other futex operations, this unblocks the specified
   number of processes, or all processes blocked on this futex if there are
   fewer than the specified number.  Semantically, this is equivalent to:
 l =  (futex);
 lock (l);
 for (res = 0; processes_to_wake > 0; processes_to_wake--, res++) {
   if () break;
   wf =  (futex);
   // No happens-before guarantee with woken futex_wait (see above)
   atomic_store_relaxed (wf, 0);
 }
 return res;

This allows a programmer to really infer the guarantees he/she can get
from a futex in terms of synchronization, without the docs having to use
prose to describe that.  This should also not constrain the kernel in
terms of how to implement it, because it is a conceptual as-if relation
(e.g., the kernel won't spin-wait the whole time, and we might want to
make this clear for the PI case).

Of course, there are several as-if representations we could use, and we
might want to be a bit more pseudo-code-ish to make this also easy to
understand for people not familiar with C11 (e.g., using mutex + condvar
with some relaxation of condvar guaranteees).

=

I will go through the discussion pointed out by Davidlohr next.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: futex(3) man page, final draft for pre-release review

2015-12-18 Thread Torvald Riegel
On Wed, 2015-12-16 at 16:54 +0100, Michael Kerrisk (man-pages) wrote:
> Hello Darren,
> 
> On 12/15/2015 10:18 PM, Darren Hart wrote:
> > On Tue, Dec 15, 2015 at 02:43:50PM +0100, Michael Kerrisk (man-pages) wrote:
> 
> [...]
> 
> >>When executing a futex operation that requests to block a thread,
> >>the kernel will block only if the futex word has the  value  that
> >>the  calling  thread  supplied  (as  one  of the arguments of the
> >>futex() call) as the expected value of the futex word.  The load‐
> >>ing  of the futex word's value, the comparison of that value with
> >>the expected value, and the actual blocking  will  happen  atomi‐
> >>
> >> FIXME: for next line, it would be good to have an explanation of
> >> "totally ordered" somewhere around here.
> >>
> >>cally  and totally ordered with respect to concurrently executing
> > 
> > Totally ordered with respect futex operations refers to semantics of the
> > ACQUIRE/RELEASE operations and how they impact ordering of memory reads and
> > writes. The kernel futex operations are protected by spinlocks, which ensure
> > that that all operations are serialized with respect to one another.
> > 
> > This is a lot to attempt to define in this document. Perhaps a reference to
> > linux/Documentation/memory-barriers.txt as a footnote would be sufficient? 
> > Or
> > perhaps for this manual, "serialized" would be sufficient, with a footnote
> > regarding "totally ordered" and a pointer to the memory-barrier 
> > documentation?
> 
> I think I'll just settle for writing serialized in the man page, and be 
> done with it :-).

I'd prefer if you'd not just use "serialized" :)  Eventually, I'd prefer
if we can explain the semantics for the user in terms of the terminology
and semantics of the memory model of the programming language that users
will likely use to call futex ops (ie, C11 / C++11).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: futex(3) man page, final draft for pre-release review

2015-12-15 Thread Torvald Riegel
On Tue, 2015-12-15 at 14:43 +0100, Michael Kerrisk (man-pages) wrote:
> Hello all,
> 
> After much too long a time, the revised futex man page *will*
> go out in the next man pages release (it has been merged
> into master).
> 
> There are various places where the page could still be improved,
> but it is much better (and more than 5 times longer) than the
> existing page.

This looks good to me; I just saw minor things (see below).  Thank you
for all the work you put into this (and to everybody who contributed)!

>When executing a futex operation that requests to block a thread,
>the kernel will block only if the futex word has the  value  that
>the  calling  thread  supplied  (as  one  of the arguments of the
>futex() call) as the expected value of the futex word.  The load‐
>ing  of the futex word's value, the comparison of that value with
>the expected value, and the actual blocking  will  happen  atomi‐
> 
> FIXME: for next line, it would be good to have an explanation of
> "totally ordered" somewhere around here.
> 
>cally  and totally ordered with respect to concurrently executing
>futex operations on the same futex word.  Thus, the futex word is
>used to connect the synchronization in user space with the imple‐
>mentation of blocking by the kernel.  Analogously  to  an  atomic
>compare-and-exchange  operation  that  potentially changes shared
>memory, blocking via a futex is an atomic compare-and-block oper‐
>ation.

Maybe -- should we just say that it refers to the mathematical notion of
a total order (or, technically, a strict total order in this case)?
Though I would hope that everyone using futexes is roughly aware of the
differences between partial and total orders.

>FUTEX_TRYLOCK_PI (since Linux 2.6.18)
>   This operation tries to acquire the futex at uaddr.  It is

s/futex/lock/ to make it consistent with FUTEX_LOCK.

>   invoked when a user-space atomic acquire did  not  succeed
>   because the futex word was not 0.
> 
> 
> FIXME(Next sentence) The wording "The trylock in kernel" below 
> needs clarification. Suggestions?
> 
>   The trylock in kernel might succeed because the futex word
>   contains stale state (FUTEX_WAITERS and/or
>   FUTEX_OWNER_DIED).   This can happen when the owner of the
>   futex died.  User space cannot handle this condition in  a
>   race-free  manner,  but  the  kernel  can  fix this up and
>   acquire the futex.
> 
>   The uaddr2, val, timeout, and val3 arguments are ignored.

What about "The acquisition of the lock might suceed if performed by the
kernel in cases when the futex word contains stale state...".

>FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31)
>   Wait  on  a  non-PI  futex  at  uaddr  and  potentially be
>   requeued (via a FUTEX_CMP_REQUEUE_PI operation in  another
>   task)  onto  a  PI futex at uaddr2.  The wait operation on
>   uaddr is the same as for FUTEX_WAIT.
> 
>   The waiter can be removed from the wait on  uaddr  without
>   requeueing on uaddr2 via a FUTEX_WAKE operation in another
>   task.  In this case, the  FUTEX_WAIT_REQUEUE_PI  operation
>   returns with the error EWOULDBLOCK.

This should be EAGAIN, I suppose, or the enumeration of errors should
include EWOULDBLOCK.

Torvald

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [c++std-parallel-1641] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-26 Thread Torvald Riegel
On Thu, 2015-05-21 at 13:42 -0700, Linus Torvalds wrote:
> On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
>  wrote:
> >
> > The compiler can (and does) speculate non-atomic non-volatile writes
> > in some cases, but I do not believe that it is permitted to speculate
> > either volatile or atomic writes.
> 
> I do *not* believe that a compiler is ever allowed to speculate *any*
> writes - volatile or not - unless the compiler can prove that the end
> result is either single-threaded, or the write in question is
> guaranteed to only be visible in that thread (ie local stack variable
> etc).

It must not speculative volatile accesses.  It could speculate
non-volatiles even if those are atomic and observable by other threads
but that would require further work/checks on all potential observers of
those (ie, to still satisfy as-if).  Thus, compilers are unlikely to do
such speculation, I'd say.

The as-if rule (ie, equality of observable behavior (ie, volatiles, ...)
to the abstract machine) makes all this clear.

> Also, I do think that the whole "consume" read should be explained
> better to compiler writers. Right now the language (including very
> much in the "restricted dependency" model) is described in very
> abstract terms. Yet those abstract terms are actually very subtle and
> complex, and very opaque to a compiler writer.

I believe the issues around the existing specification of mo_consume
where pointed out by compiler folks.  It's a complex problem, and I'm
all for more explanations, but I did get the impression that the
compiler writers in ISO C++ Study Group 1 do have a good understanding
of the problem.

> I personally think the whole "abstract machine" model of the C
> language is a mistake. It would be much better to talk about things in
> terms of actual code generation and actual issues. Make all the
> problems much more concrete, with actual examples of how memory
> ordering matters on different architectures.

As someone working for a toolchain team, I don't see how the
abstract-machine-based specification is a problem at all, nor have I
seen compiler writers struggling with it.  It does give precise rules
for code generation.

The level of abstraction is a good thing for most programs because for
those, the primary concern is that the observable behavior and end
result is computed -- it's secondary and QoI how that happens.  In
contrast, if you specify on the level of code generation, you'd have to
foresee how code generation might look in the future, including predict
future optimizations and all that.  That doesn't look future-proof to
me.

I do realize that this may be less than ideal for cases when one would
want to use a C compiler more like a convenient assembler.  But that
case isn't the 99%, I believe.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [c++std-parallel-1611] Compilers and RCU readers: Once more unto the breach!

2015-05-26 Thread Torvald Riegel
On Tue, 2015-05-19 at 17:55 -0700, Paul E. McKenney wrote: 
>   http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf

I have been discussing Section 7.9 with Paul during last week.

While I think that 7.9 helps narrow down the problem somewhat, I'm still
concerned that it effectively requires compilers to either track
dependencies or conservatively prevent optimizations like value
speculation and specialization based on that.  Neither is a good thing
for a compiler.


7.9 adds requirements that dependency chains stop if the program itself
informs the compiler about the value of something in the dependency
chain (e.g., as shown in Figure 33).  Also, if a correct program that
does not have undefined behavior must use a particular value, this is
also considered as "informing" the compiler about that value.  For
example:
  int arr[2];
  int* x = foo.load(mo_consume);
  if (x > arr)   // implies same object/array, so x is in arr[]
int r1 = *x; // compiler knows x == arr + 1
The program, assuming no undefined behavior, first tells the compiler
that x should be within arr, and then the comparison tells the compiler
that x!=arr, so x==arr+1 must hold because there are just two elements
in the array.

Having these requirements is generally good, but we don't yet know how
to specify this properly.  For example, I suppose we'd need to also say
that the compiler cannot assume to know anything about a value returned
from an mo_consume load; otherwise, nothing prevents a compiler from
using knowledge about the stores that the mo_consume load can read from
(see Section 7.2).

Also, a compiler is still required to not do value-speculation or
optimizations based on that.  For example, this program:

void op(type *p)
{
  foo /= p->a;
  bar = p->b;
}
void bar()
{
  pointer = ppp.load(mo_consume);
  op(pointer);
}

... must not be transformed into this program, even if the compiler
knows that global_var->a == 1:

void op(type *p) { /* unchanged */}
void bar()
{
pointer = ppp.load(mo_consume);
  if (pointer != global_var) {
op(pointer);
  else // specialization for global_var
{
  // compiler knows global_var->a==1;
  // compiler uses global_var directly, inlines, optimizes:
  bar = global_var->b;
}

The compiler could optimize out the division if pointer==global_var but
it must not access field b directly through global_var.  This would be
pretty awkwaard; the compiler may work based on an optimized expression
in the specialization (ie, create code that assumes global_var instead
of pointer) but it would still have to carry around and use the
non-optimized expression.


This wouldn't be as bad if it were easily constrained to code sequences
that really need the dependencies.  However, 7.9 does not effectively
contain dependencies to only the code that really needs them, IMO.
Unless the compiler proves otherwise, it has to assume that a load from
a pointer carries a dependency.  Proving that is often very hard because
it requires points-to analysis; 7.9 restricts this to intra-thread
analysis but that is still nontrivial.
Michael Matz' had a similar concern (in terms of what it results in).


Given that mo_consume is useful but a very specialized feature, I
wouldn't be optimistic that 7.9 would actually be supported by many
compilers.  The trade-off between having to track dependencies or having
to disallow optimizations is a bad one to make.  The simple way out for
a compiler would be to just emit mo_acquire instead of mo_consume and be
done with all -- and this might be the most practical decision overall,
or the default general-purpose implementation.  At least I haven't heard
any compiler implementer say that they think it's obviously worth
implementing.

I also don't think 7.9 is ready for ISO standardization yet (or any of
the other alternatives mentioned in the paper).  Standardizing a feature
that we're not sure whether it will actually be implemented is not a
good thing to do; it's too costly for all involved parties (compiler
writers *and* users).


IMO, the approach outlined in Section 7.7 is still the most promising
contender in the long run.  It avoid the perhaps more pervasive changes
that a type-system-based approach as the one in Section 7.2 might result
in, yet still informs the compiler where dependencies are actually used
and which chain of expressions would be involved in that.  Tracking is
probably simplified, as dependencies are never open-ended and
potentially leaking into various other regions of code.  It seems easier
to specify in a standard because we just need the programmer to annotate
the intent and the rest is compiler QoI.  It would require users to
annotate their use of dependencies, but they don't need to follow
further rules; performance tuning of the code so it actually makes use
of dependencies is mostly a compiler QoI thing, and if the compiler
can't maintain a dependency, it can issue warnings and thus make the
tuning interactive for the user.

Of 

Re: Revised futex(2) man page for review

2015-04-15 Thread Torvald Riegel
On Tue, 2015-04-14 at 23:40 +0200, Thomas Gleixner wrote:
> On Sat, 28 Mar 2015, Peter Zijlstra wrote:
> > On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote:
> > > So, please take a look at the page below. At this point,
> > > I would most especially appreciate help with the FIXMEs.
> > 
> > For people who cannot read that troff gibberish (me)..
> 
> Ditto :)
>  
> > NOTES
> >Glibc does not provide a wrapper for this system call;  call  it  
> > using
> >syscall(2).
> 
> You might mention that pthread_mutex, pthread_condvar interfaces are
> high level wrappers for the syscall and recommended to be used for
> normal use cases. IIRC unnamed semaphores are implemented with futexes
> as well.

If we add this, I'd rephrase it to something like that there are
high-level programming abstractions such as the pthread_condvar
interfaces or semaphores that are implemented using the syscall and that
are typically a better fit for normal use cases.  I'd consider only the
condvars as something like a wrapper, or targeting a similar use case.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: futex(2) man page update help request

2015-01-24 Thread Torvald Riegel
On Sat, 2015-01-24 at 12:35 +0100, Thomas Gleixner wrote:
> So we should never see -EINTR in the case of a spurious wakeup here.
> 
> But, here is the not so good news:
> 
>  I did some archaeology. The restart handling of futex_wait() got
>  introduced in kernel 2.6.22, so anything older than that will have
>  the spurious -EINTR issues.
> 
> futex_wait_pi() always had the restart handling and glibc folks back
> then (2006) requested that it should never return -EINTR, so it
> unconditionally restarts the syscall whether a signal had been
> delivered or not.
> 
> So kernels >= 2.6.22 should never return -EINTR spuriously. If that
> happens it's a bug and needs to be fixed.

Thanks for looking into this.

Michael, can you include the above in the documentation please?  This is
useful for userspace code like glibc that expects a minimum kernel
version.  Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: futex(2) man page update help request

2015-01-24 Thread Torvald Riegel
On Sat, 2015-01-24 at 11:05 +0100, Thomas Gleixner wrote:
> On Fri, 23 Jan 2015, Torvald Riegel wrote:
> 
> > On Fri, 2015-01-16 at 16:46 -0800, Darren Hart wrote:
> > > On 1/16/15, 12:54 PM, "Michael Kerrisk (man-pages)"
> > >  wrote:
> > > 
> > > >Color me stupid, but I can't see this in futex_requeue(). Where is that
> > > >check that is "independent of the requeue type (normal/pi)"?
> > > >
> > > >When I look through futex_requeue(), all the likely looking sources
> > > >of EINVAL are governed by a check on the 'requeue_pi' argument.
> > > 
> > > 
> > > Right, in the non-PI case, I believe there are valid use cases: move to
> > > the back of the FIFO, for example (OK, maybe the only example?).
> > 
> > But we never guarantee a futex is a FIFO, or do we?  If we don't, then
> > such a requeue could be implemented as a no-op by the kernel, which
> > would sort of invalidate the use case.
> > 
> > (And I guess we don't want to guarantee FIFO behavior for futexes.)
> 
> The (current) behaviour is:
> 
> real-time threads:   FIFO per priority level
> sched-other threads: FIFO independent of nice level
> 
> The wakeup is priority ordered. Highest priority level first.

OK.

But, just to be clear, do I correctly understand that you do not want to
guarantee FIFO behavior in the specified futex semantics?  I think there
are cases where being able to *rely* on FIFO (now and on all future
kernels) would be helpful for users (e.g., on POSIX/C++11 condvars and I
assume in other ordered-wakeup cases too) -- but at the same time, this
would constrain future futex implementations.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: futex(2) man page update help request

2015-01-23 Thread Torvald Riegel
On Thu, 2015-01-15 at 16:10 +0100, Michael Kerrisk (man-pages) wrote:
> [Adding a few people to CC that have expressed interest in the 
> progress of the updates of this page, or who may be able to
> provide review feedback. Eventually, you'll all get CCed on
> the new draft of the page.]
> 
> Hello Thomas,
> 
> On 05/15/2014 04:14 PM, Thomas Gleixner wrote:
> > On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:
> >> And that universe would love to have your documentation of 
> >> FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),
> > 
> > I give you almost the full treatment, but I leave REQUEUE_PI to
> > Darren and FUTEX_WAKE_OP to Jakub. :)
> 
> Thank you for the great effort you put into compiling the
> text below, and apologies for my long delay in following up.
> 
> I've integrated almost all of your suggestions into the 
> manual page. I will shortly send out a new draft of the
> page that contains various FIXMEs for points that remain 
> unclear.

Michael, thanks for working on the draft!  I'll review the draft closely
once you've sent it (or have I missed it?).

There are a few things that I'd like to see covered.

First, we should discuss that users, until they control all code in the
respective process, need to expect futexes to be affected by spurious
futex_wake calls; see https://lkml.org/lkml/2014/11/27/472 for
background and Linus' choice (AFAIU) to just document this.
So, for example, a return code of 0 for FUTEX_WAIT can mean either being
woken up by a FUTEX_WAKE intended for this futex, or a stale one
intended for another futex used by, for example, glibc internally.
(Note that as explained in this thread, this isn't just a glibc
artifact, but a result of the general futex design mixed with
destruction requirements for mutexes and other constructs in C++11 and
POSIX.)
It might also be necessary to further consider this when documenting the
errors, because it does affect how to handle them. See this for a glibc
perspective:
https://sourceware.org/ml/libc-alpha/2014-09/msg00381.html

Second, the current documentation for EINTR is that it can happen due to
receiving a signal *or* due to a spurious wake-up.  This is difficult to
handle when implementing POSIX semaphores, because they require that
EINTR is returned from SEM_WAIT if and only if the interruption was due
to a signal.  Thus, if FUTEX_WAIT returns EINTR, the semaphore
implementation can't return EINTR from sem_wait; see this for more
comments, including some discussion why use cases relying on the POSIX
requirement around EINTR are borderline timing-dependent:
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/sem_waitcommon.c;h=96848d7ac5b6f8f1f3099b422deacc09323c796a;hb=HEAD#l282
Others have commented that aio_suspend has a similar issue; if EINTR
wouldn't in fact be returned spuriously, the POSIX-implementation-side
would get easier.

Third, I think it would be useful to -- somewhere -- explain which
behavior the futex operations would have conceptually when expressed by
C11 code.  We currently say that they wake up, sleep, etc, and which
values they return.  But we never say how to properly synchronize with
them on the userspace side.  The C11 memory model is probably the best
model to use on the userspace side, so that's why I'm arguing for this.
Basically, I think we need to (1) tell people that they should use
memory_order_relaxed accesses to the futex variable (ie, the memory
location associated with the whole futex construct on the kernel side --
or do we have another name for this?), and (2) give some conceptual
guarantees for the kernel-side synchronization so that one use this to
derive how to use them correctly in userspace.

The man pages might not be the right place for this, and maybe we just
need a revision of "Futexes are tricky".  If you have other suggestions
for where to document this, or on the content, let me know.  (I'm also
willing to spend time on this :) ).


Torvald




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: futex(2) man page update help request

2015-01-23 Thread Torvald Riegel
On Fri, 2015-01-16 at 16:46 -0800, Darren Hart wrote:
> On 1/16/15, 12:54 PM, "Michael Kerrisk (man-pages)"
>  wrote:
> 
> >Color me stupid, but I can't see this in futex_requeue(). Where is that
> >check that is "independent of the requeue type (normal/pi)"?
> >
> >When I look through futex_requeue(), all the likely looking sources
> >of EINVAL are governed by a check on the 'requeue_pi' argument.
> 
> 
> Right, in the non-PI case, I believe there are valid use cases: move to
> the back of the FIFO, for example (OK, maybe the only example?).

But we never guarantee a futex is a FIFO, or do we?  If we don't, then
such a requeue could be implemented as a no-op by the kernel, which
would sort of invalidate the use case.

(And I guess we don't want to guarantee FIFO behavior for futexes.)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: POSIX mutex destruction requirements vs. futexes

2014-12-01 Thread Torvald Riegel
On Mon, 2014-12-01 at 10:31 -0800, Linus Torvalds wrote:
> On Mon, Dec 1, 2014 at 4:05 AM, Torvald Riegel  wrote:
> > On Thu, 2014-11-27 at 11:38 -0800, Linus Torvalds wrote:
> >>
> >> > (1)  Allow spurious wake-ups from FUTEX_WAIT.
> >>
> >> because afaik that is what we actually *do* today (we'll wake up
> >> whoever re-used that location in another thread), and it's mainly
> >> about the whole documentation issue. No?
> >
> > If that's what the kernel prefers to do, this would be fine with me.
> 
> I think it's more of a "can we even do anything else"?
> 
> The kernel doesn't even see the reuse of the futex, or the fast path
> parts. Not seeing the fast path is obviously by design, and not seeing
> the reuse is partly due to pthreads interfaces (I guess people should
> generally call mutex_destroy, but I doubt people really do, and even
> if they did, how would user space actually handle the nasty race
> between "pthread_unlock -> stale futex_wake" "pthread_mutex_destroy()"
> anyway?).

User space could count stale (or, pending) futex_wake calls and spin in
pthread_mutex_destroy() until this count is zero.  However, that would
increase contention significantly, and we must spin, not block in
pthread_mutex_destroy() at least for process-shared mutexes, because
there's no place to put a futex for this destruction-time blocking that
is not subject to the same issue again.  (For process-private futexes,
this could be a memory location that is only ever used by glibc.)
There might even be more issues related to unmapping memory of
process-shared mutexes based on reference-counting in the critical
sections.

The additional contention of counting stale futex_wake's worries me
most.  You only need to count when threads actually use futexes to
block, and perhaps glibc's mutex implementation could be changed to
spin-wait aggressively, and perhaps we could add some explicit handover
to other lock owners or something similar to avoid issuing a futex_wake
unless really necessary -- but this seems like quite a lot of hoops to
jump through to work around a relatively small aspect of the current
futexes.

> So the thing is, I don't see how we could possibly change the existing
> FUTEX_WAKE behavior.
> 
> And introducing a new "debug mode" that explicitly adds spurious
> events might as well be done in user space with a wrapper about
> FUTEX_WAKE or something.

User space could introduce a wrapper (e.g., glibc could provide a futex
wrapper that allows spurious wakeup on return of 0) -- but glibc can't
prevent users from not using futexes directly and not through the
wrapper.  Or should it try to intercept direct, non-wrapper uses of the
futex syscall in some way?

That's why I included a "debug mode" -- I'd rather call it a "new" mode
than a debug mode, because it's not really about debugging -- in the
list of options (ie, options (2) and (3)).  This would allow glibc (and
other libraries) to use the futex variants with the new semantics in the
most natural way.  And yet would not create situations in which calls to
the old futex variant *appear* to allow spurious wake-ups (due to
glibc's or other libraries' fault -- libstdc++ is in the same boat, for
example).

> Because as far as the *kernel* is concerned, there is no "spurious
> wake" event.  It's not spurious. It's real and exists, and the wakeup
> was really done by the user. The fact that it's really a stale wakeup
> for a previous allocation of a pthread mutex data structure is
> something that is completely and fundamentally invisible to the
> kernel.
> 
> No?

I agree, and that's why I mentioned that it may seem odd to fix this on
the level of the kernel interface.  However, it just seems the best
approach when considering practice in kernel and user space, the
by-design futex properties, and the realities of what POSIX requires.

> So even from a documentation standpoint, it's really not about "kernel
> interfaces" being incorrectly documented, as about a big honking
> warning about internal pthread_mutex() implementation in a library,
> and the impact that library choice has on allocation re-use, and how
> that means that even if the FUTEX_WAKE is guaranteed to not be
> spurious from a *kernel* standpoint, certain use models will create
> their own spurious wake events in very subtle ways.

I agree it's not the kernel's fault, but that doesn't solve the dilemma.
It's one aspect of the futex design -- whether spurious wake-ups are
allowed for a return of 0 -- that makes it hard to use futexes for
POSIX, C++11 (and C11, most likely) synchronization primitives without
user space violating the k

Re: POSIX mutex destruction requirements vs. futexes

2014-12-01 Thread Torvald Riegel
On Thu, 2014-11-27 at 11:38 -0800, Linus Torvalds wrote:
> On Thu, Nov 27, 2014 at 6:27 AM, Torvald Riegel  wrote:
> > This means that in the second example above, futex_wake can be
> > concurrent with whatever happens on the mutex' memory location after the
> > mutex has been destroyed.  Examples are:
> >   * The memory is unmapped.  futex_wake will return an error.  OK.
> >   * The memory is reused, but not for a futex.  No thread will get
> > woken.  OK.
> >   * The memory is reused for another glibc mutex.  The slow-path
> > futex wake will now hit another, unrelated futex -- but the
> > mutex implementation is robust to such spurious wake-ups anyway,
> > because it can always happen when a mutex is acquired and
> > released more than once.  OK.
> >   * The memory is reused for another futex in some custom data
> > structure that expects there is just one wait/wake cycle, and
> > relies on  FUTEX_WAIT returning 0 to mean that this is caused by
> > the matching FUTEX_WAKE call by *this* data structure.  Not OK,
> > because now the delayed slow-path wake-up introduces a spurious
> > wake-up in an unrelated futex.
> >
> > Thus, introducing spurious wake-ups is the core issue.
> 
> So my gut feeling is that we should just try to see if we can live
> with spurious wakeups, ie your:
> 
> > (1)  Allow spurious wake-ups from FUTEX_WAIT.
> 
> because afaik that is what we actually *do* today (we'll wake up
> whoever re-used that location in another thread), and it's mainly
> about the whole documentation issue. No?

If that's what the kernel prefers to do, this would be fine with me.  In
this case, I'd follow up with Michael Kerrisk to get this documented.  I
guess something along the lines of this would work:
  "Return 0 if the process was woken by a FUTEX_WAKE call, or
  spuriously.
  (Note that these spurious wake-ups can result from futex operations by
  other threads or processes.)"
A more detailed explanation outside of the manpages (ie, the rationale
for this wording) would also be useful for users I suppose.

However, I've heard others being more concerned over this change in
semantics.  Therefore, I'll describe the pros/cons that I see, just so
that we're all on the same page on what the decision for (1) would mean.

The change in return-value-of-0 semantics that (1) is about can violate
the futex semantics that existing code like the following example might
assume:

Initially f==0

Thread 1:
  atomic_store_relaxed(&f, 1);
  futex_wake(&f, 1, ...);

Thread 2:
  do {
err = futex_wait(&f, 0, NULL);
// handle fatal errors like ENOSYS, EINVAL, ...
  } while (err == EINTR);
  // err must have been EWOULDBLOCK or 0

For this to work according to the specs, the program itself must not
have pending FUTEX_WAKE's to f (e.g., by having a barrier after the
operations by both threads).  The program can use glibc pthreads
synchronization operations, because they don't specify anywhere that
they cause concurrent FUTEX_WAKE's after the synchronization data
structures have been destroyed and their memory has been reused.

For this to work in practice, the program would need to not use
reference-counting to trigger destruction of pthread synchronization
datastructures (or do something similar that causes an mutex unlock that
has not returned to the caller to be concurrent with destruction).

Thus, there can be correct programs out there that never are affected by
spurious wake-ups, but now would have to change if they want to be
correct after the change in documentation.  I would bet that those cases
are the minority, but I don't have any data to back this up or even make
a guess about how small a minority they would be.

OTOH, in practice, they would *not* become buggy after the documentation
change, *unless* the kernel should decide to, in the future, also
introduce spurious wake-ups due to other reasons and return 0 instead of
EINTR.  (That's why I put the note in the suggested docs change above.)

One could also argue that the current spec actually allows the spurious
wake-ups we are concerned with.  The manpage currently states:
  "Returns 0 if the process was woken by a FUTEX_WAKE call."
The spurious wake-ups *are* due to other FUTEX_WAKE calls, just not due
to the calls that the piece of code assumed they might come from.  But
that's probably not quite obvious :)

It also seems a little odd that consequences of the design of how
futexes are used in practice (ie, fast/slow path split) and the POSIX/C
++11 requirements on destruction bleed into the semantics of kernel
operations.  However, allowing spurious wake-ups for a return value of 0
might be the simplest way to

POSIX mutex destruction requirements vs. futexes

2014-11-27 Thread Torvald Riegel
TLDR: POSIX and C++11 make implementation requirements for the
destruction of mutexes that, if we want to continue to use the userspace
fastpath / kernel slowpath approach of non-PI futexes, conflict with
spurious wake-ups not being allowed for FUTEX_WAIT calls that return 0.
The best solutions seem to be to either allow spurious wake-ups, or
create new variants of FUTEX_WAIT and/or FUTEX_WAKE that allow spurious
wake-ups.


Using reference-counting in critical sections to decide when the mutex
protecting the critical section can be destroyed has been recently
discussed on LKML.   For example, something like this is supposed to
work:

int free = 0;

mutex_lock(&s->lock);
if (--s->refcount == 0)
free = 1
mutex_unlock(&s->lock);
if (free)
kfree(s);

The Austin Group has clarified that pthreads mutexes have to support
such uses:
http://austingroupbugs.net/view.php?id=811
C++11 makes essentially the same requirement (see 30.4.1.2.1p5;
destruction is allowed if no thread owns the mutex; it is not required
that all unlock calls have returned before destruction starts).
C11 is silent on this particular aspect of mutex semantics, but it
usually follows POSIX and/or C++ semantics.

The destruction requirement doesn't just apply in cases of explicit
reference counting.  Another example would be:

Thread1:
  pthread_mutex_lock(&m);
  pthread_create(..., Thread2, ...);
  pthread_mutex_unlock(&m);

Thread2:
  pthread_mutex_lock(&m);
  pthread_mutex_unlock(&m);
  pthread_mutex_destroy(&m);
  // reuse the memory of mutex m for something else;
  // a new, unrelated futex happens to reside at the same address as
  // m's internal futex

Here, the program has ensured in a different way that no thread owns the
mutex when it is destructed.

This requirement is tough to implement for glibc -- or with futexes in
general -- because what one would like to do in a mutex unlock
implementation based on futexes is the following, roughly:

lock():
  while (1) {
// fast path: assume uncontended lock
if (atomic_compare_exchange_acquire(&futex, NOT_ACQUIRED, ACQUIRED)
== SUCCESS)
  return;
// slow path: signal that there is a slow-path waiter and block
prev = atomic_exchange(&futex, ACQUIRED_AND_WAITERS);
if (prev == NOT_ACQUIRED) return;
futex_wait(&futex, ACQUIRED_AND_WAITERS, ...);
  }

unlock():
  // fast path unlock
  prev = atomic_exchange_release(&futex, NOT_ACQUIRED);
  // slow path unlock
  if (prev == ACQUIRED_AND_WAITERS)
futex_wake(&futex, ...);

Having the fast path is good for performance.  It enables spinning on
acquisition, and avoids a syscall when releasing the mutex.  The latter
is good in the uncontended case but also when there is contention,
because another spinning thread may be able to acquire the mutex more
quickly than if we'd require the kernel to wake a blocked thread -- thus
increasing scalability.

However, because another thread can acquire the mutex right after the
fast path of the unlock(), the next steps of this thread can then be
concurrent with the slow path of the unlock(), in particular the
futex_wake call.  All we need to make such a pending, concurrent
futex_wake call possible is that at some time in the past, we were in
the ACQUIRED_AND_WAITERS state.

This means that in the second example above, futex_wake can be
concurrent with whatever happens on the mutex' memory location after the
mutex has been destroyed.  Examples are:
  * The memory is unmapped.  futex_wake will return an error.  OK.
  * The memory is reused, but not for a futex.  No thread will get
woken.  OK.
  * The memory is reused for another glibc mutex.  The slow-path
futex wake will now hit another, unrelated futex -- but the
mutex implementation is robust to such spurious wake-ups anyway,
because it can always happen when a mutex is acquired and
released more than once.  OK.
  * The memory is reused for another futex in some custom data
structure that expects there is just one wait/wake cycle, and
relies on  FUTEX_WAIT returning 0 to mean that this is caused by
the matching FUTEX_WAKE call by *this* data structure.  Not OK,
because now the delayed slow-path wake-up introduces a spurious
wake-up in an unrelated futex.

Thus, introducing spurious wake-ups is the core issue.  This is not
restricted to pthreads mutexes
(https://sourceware.org/bugzilla/show_bug.cgi?id=13690) but also is an
issue for semaphores
(https://sourceware.org/bugzilla/show_bug.cgi?id=12674; I have a patch
that I'll send upstream soon that fixes the memory access issues under
concurrent destruction -- the futex issue remains) and barriers
(https://sourceware.org/bugzilla/show_bug.cgi?id=13065).
In general, this is an issue for all futex uses that rely on this same
fast-path / slow-path pattern and a similar destruction scheme.


There are ways to solve this in userspace only and with the existing

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Torvald Riegel
On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote:
> On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
> > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > > 
> > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > > 
> > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > > +o  Do not use the results from the boolean "&&" and "||" 
> > > > > > > > when
> > > > > > > > +   dereferencing.  For example, the following (rather 
> > > > > > > > improbable)
> > > > > > > > +   code is buggy:
> > > > > > > > +
> > > > > > > > +   int a[2];
> > > > > > > > +   int index;
> > > > > > > > +   int force_zero_index = 1;
> > > > > > > > +
> > > > > > > > +   ...
> > > > > > > > +
> > > > > > > > +   r1 = rcu_dereference(i1)
> > > > > > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +   The reason this is buggy is that "&&" and "||" are 
> > > > > > > > often compiled
> > > > > > > > +   using branches.  While weak-memory machines such as ARM 
> > > > > > > > or PowerPC
> > > > > > > > +   do order stores after such branches, they can speculate 
> > > > > > > > loads,
> > > > > > > > +   which can result in misordering bugs.
> > > > > > > > +
> > > > > > > > +o  Do not use the results from relational operators ("==", 
> > > > > > > > "!=",
> > > > > > > > +   ">", ">=", "<", or "<=") when dereferencing.  For 
> > > > > > > > example,
> > > > > > > > +   the following (quite strange) code is buggy:
> > > > > > > > +
> > > > > > > > +   int a[2];
> > > > > > > > +   int index;
> > > > > > > > +   int flip_index = 0;
> > > > > > > > +
> > > > > > > > +   ...
> > > > > > > > +
> > > > > > > > +   r1 = rcu_dereference(i1)
> > > > > > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +   As before, the reason this is buggy is that relational 
> > > > > > > > operators
> > > > > > > > +   are often compiled using branches.  And as before, 
> > > > > > > > although
> > > > > > > > +   weak-memory machines such as ARM or PowerPC do order 
> > > > > > > > stores
> > > > > > > > +   after such branches, but can speculate loads, which can 
> > > > > > > > again
> > > > > > > > +   result in misordering bugs.
> > > > > > > 
> > > > > > > Those two would be allowed by the wording I have recently 
> > > > > > > proposed,
> > > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > > (unless
> > > > > > > there are further constraints due to the type of r1 and the 
> > > > > > > values that
> > > > > > > flip_index can have).
> > > &g

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Torvald Riegel
On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote:
> On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
> > xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
> > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
> > 
> > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > 
> > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > 
> > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > +oDo not use the results from the boolean "&&" and "||" 
> > > > > > > when
> > > > > > > + dereferencing.  For example, the following (rather improbable)
> > > > > > > + code is buggy:
> > > > > > > +
> > > > > > > + int a[2];
> > > > > > > + int index;
> > > > > > > + int force_zero_index = 1;
> > > > > > > +
> > > > > > > + ...
> > > > > > > +
> > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > + The reason this is buggy is that "&&" and "||" are often 
> > > > > > > compiled
> > > > > > > + using branches.  While weak-memory machines such as ARM or 
> > > > > > > PowerPC
> > > > > > > + do order stores after such branches, they can speculate loads,
> > > > > > > + which can result in misordering bugs.
> > > > > > > +
> > > > > > > +oDo not use the results from relational operators ("==", 
> > > > > > > "!=",
> > > > > > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > > + the following (quite strange) code is buggy:
> > > > > > > +
> > > > > > > + int a[2];
> > > > > > > + int index;
> > > > > > > + int flip_index = 0;
> > > > > > > +
> > > > > > > + ...
> > > > > > > +
> > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > + As before, the reason this is buggy is that relational operators
> > > > > > > + are often compiled using branches.  And as before, although
> > > > > > > + weak-memory machines such as ARM or PowerPC do order stores
> > > > > > > + after such branches, but can speculate loads, which can again
> > > > > > > + result in misordering bugs.
> > > > > > 
> > > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > (unless
> > > > > > there are further constraints due to the type of r1 and the values 
> > > > > > that
> > > > > > flip_index can have).
> > > > > 
> > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > guarantees than we get by default from current compilers.
> > > > > 
> > > > > One question, though.  Suppose that the code did not want a value
> > > > > dependency to be tracked through a comparison operator.  What does
> > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > dependency to be tracked through a comparison.)
> > > > 
> > > > Hmm.  I suppose use an explic

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel
On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote:
> On 3 March 2014 20:44, Torvald Riegel  wrote:
> > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
> >> On 1 March 2014 08:03, Paul E. McKenney  wrote:
> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> >> Hi Paul,
> >> >>
> >> >> On 28 February 2014 18:50, Paul E. McKenney 
> >> >>  wrote:
> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >> >  wrote:
> >> >> >> > >
> >> >> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> >> >> > > where that other pointer was properly fetched using one
> >> >> >> > > of the RCU primitives.  Here it doesn't matter which 
> >> >> >> > > pointer
> >> >> >> > > you use.  At least as long as the rcu_assign_pointer() 
> >> >> >> > > for
> >> >> >> > > that other pointer happened after the last update to the
> >> >> >> > > pointed-to structure.
> >> >> >> > >
> >> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >> >
> >> >> >> > I think that it might be worth pointing out as an example, and 
> >> >> >> > saying
> >> >> >> > that code like
> >> >> >> >
> >> >> >> >p = atomic_read(consume);
> >> >> >> >X;
> >> >> >> >q = atomic_read(consume);
> >> >> >> >Y;
> >> >> >> >if (p == q)
> >> >> >> > data = p->val;
> >> >> >> >
> >> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> >> > *either* p or q, but you can't really tell which, since the 
> >> >> >> > compiler
> >> >> >> > can decide that the values are interchangeable.
> >> >> >> >
> >> >> >> > I cannot for the life of me come up with a situation where this 
> >> >> >> > would
> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> >> > stronger ordering than anything the consume through "p" would
> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> >> > irrelevant. No?
> >> >> >>
> >> >> >> I can make a contrived litmus test for it, but you are right, the 
> >> >> >> only
> >> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU 
> >> >> >> can
> >> >> >> reorder the loads into p and q, and the read from p->val can, as you 
> >> >> >> say,
> >> >> >> come from either pointer.
> >> >> >>
> >> >> >> For whatever it is worth, hear is the litmus test:
> >> >> >>
> >> >> >> T1:   p = kmalloc(...);
> >> >> >>   if (p == NULL)
> >> >> >>   deal_with_it();
> >> >> >>   p->a = 42;  /* Each field in its own cache line. */
> >> >> >>   p->b = 43;
> >> >> >>   p->c = 44;
> >> >> >>   atomic_store_explicit(&gp1, p, memory_order_release);
> >> >> >>   p->b = 143;
> >> >> >>   p->c = 144;
> >> >> >>   atomic_store_explicit(&gp2, p, memory_order_release);
> >> >> >>
> >> >> >> T2:   p = atomic_load_explicit(&gp2, memory_or

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel
On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > 
> > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > 
> > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > +o  Do not use the results from the boolean "&&" and "||" when
> > > > > > +   dereferencing.  For example, the following (rather improbable)
> > > > > > +   code is buggy:
> > > > > > +
> > > > > > +   int a[2];
> > > > > > +   int index;
> > > > > > +   int force_zero_index = 1;
> > > > > > +
> > > > > > +   ...
> > > > > > +
> > > > > > +   r1 = rcu_dereference(i1)
> > > > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +   The reason this is buggy is that "&&" and "||" are often 
> > > > > > compiled
> > > > > > +   using branches.  While weak-memory machines such as ARM or 
> > > > > > PowerPC
> > > > > > +   do order stores after such branches, they can speculate loads,
> > > > > > +   which can result in misordering bugs.
> > > > > > +
> > > > > > +o  Do not use the results from relational operators ("==", "!=",
> > > > > > +   ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > +   the following (quite strange) code is buggy:
> > > > > > +
> > > > > > +   int a[2];
> > > > > > +   int index;
> > > > > > +   int flip_index = 0;
> > > > > > +
> > > > > > +   ...
> > > > > > +
> > > > > > +   r1 = rcu_dereference(i1)
> > > > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +   As before, the reason this is buggy is that relational operators
> > > > > > +   are often compiled using branches.  And as before, although
> > > > > > +   weak-memory machines such as ARM or PowerPC do order stores
> > > > > > +   after such branches, but can speculate loads, which can again
> > > > > > +   result in misordering bugs.
> > > > > 
> > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > there are further constraints due to the type of r1 and the values 
> > > > > that
> > > > > flip_index can have).
> > > > 
> > > > And I am OK with the value_dep_preserving type providing more/better
> > > > guarantees than we get by default from current compilers.
> > > > 
> > > > One question, though.  Suppose that the code did not want a value
> > > > dependency to be tracked through a comparison operator.  What does
> > > > the developer do in that case?  (The reason I ask is that I have
> > > > not yet found a use case in the Linux kernel that expects a value
> > > > dependency to be tracked through a comparison.)
> > > 
> > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > comparison?
> > 
> > That should work well assuming that things like "if", "while", and "?:"
> > conditions are happy to take a vdp.  This assumes that p->a only returns
> > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > through the program.  ;-)
> > 
> > The other thing that can happen is that a vdp can get handed off to
> > another synchronization mechanism, for example, to reference counting:
> > 
> >

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel
On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > 
> > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > 
> > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > +oDo not use the results from the boolean "&&" and "||" when
> > > > > + dereferencing.  For example, the following (rather improbable)
> > > > > + code is buggy:
> > > > > +
> > > > > + int a[2];
> > > > > + int index;
> > > > > + int force_zero_index = 1;
> > > > > +
> > > > > + ...
> > > > > +
> > > > > + r1 = rcu_dereference(i1)
> > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > +
> > > > > + The reason this is buggy is that "&&" and "||" are often 
> > > > > compiled
> > > > > + using branches.  While weak-memory machines such as ARM or 
> > > > > PowerPC
> > > > > + do order stores after such branches, they can speculate loads,
> > > > > + which can result in misordering bugs.
> > > > > +
> > > > > +oDo not use the results from relational operators ("==", "!=",
> > > > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > + the following (quite strange) code is buggy:
> > > > > +
> > > > > + int a[2];
> > > > > + int index;
> > > > > + int flip_index = 0;
> > > > > +
> > > > > + ...
> > > > > +
> > > > > + r1 = rcu_dereference(i1)
> > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > +
> > > > > + As before, the reason this is buggy is that relational operators
> > > > > + are often compiled using branches.  And as before, although
> > > > > + weak-memory machines such as ARM or PowerPC do order stores
> > > > > + after such branches, but can speculate loads, which can again
> > > > > + result in misordering bugs.
> > > > 
> > > > Those two would be allowed by the wording I have recently proposed,
> > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > there are further constraints due to the type of r1 and the values that
> > > > flip_index can have).
> > > 
> > > And I am OK with the value_dep_preserving type providing more/better
> > > guarantees than we get by default from current compilers.
> > > 
> > > One question, though.  Suppose that the code did not want a value
> > > dependency to be tracked through a comparison operator.  What does
> > > the developer do in that case?  (The reason I ask is that I have
> > > not yet found a use case in the Linux kernel that expects a value
> > > dependency to be tracked through a comparison.)
> > 
> > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > comparison?
> 
> That should work well assuming that things like "if", "while", and "?:"
> conditions are happy to take a vdp.

I currently don't see a reason why that should be disallowed.  If we
have allowed an implicit conversion to non-vdp, I believe that should
follow.  ?: could be somewhat special, in that the type depends on the
2nd and 3rd operand.  Thus, "vdp x = non-vdp ? vdp : vdp;" should be
allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be
disallowed if we don't provide for implicit casts from non-vdp to vdp.

> This assumes that p->a only returns
> vdp if field "a" is declared vdp, otherwise we have vdps running wild
> through the program.  ;-)

That's a good question.  For the scheme I had in mind, I'm not concerned
about vdps running wild because one needs t

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > 
> > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > +oDo not use the results from the boolean "&&" and "||" when
> > > + dereferencing.  For example, the following (rather improbable)
> > > + code is buggy:
> > > +
> > > + int a[2];
> > > + int index;
> > > + int force_zero_index = 1;
> > > +
> > > + ...
> > > +
> > > + r1 = rcu_dereference(i1)
> > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > +
> > > + The reason this is buggy is that "&&" and "||" are often compiled
> > > + using branches.  While weak-memory machines such as ARM or PowerPC
> > > + do order stores after such branches, they can speculate loads,
> > > + which can result in misordering bugs.
> > > +
> > > +oDo not use the results from relational operators ("==", "!=",
> > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > + the following (quite strange) code is buggy:
> > > +
> > > + int a[2];
> > > + int index;
> > > + int flip_index = 0;
> > > +
> > > + ...
> > > +
> > > + r1 = rcu_dereference(i1)
> > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > +
> > > + As before, the reason this is buggy is that relational operators
> > > + are often compiled using branches.  And as before, although
> > > + weak-memory machines such as ARM or PowerPC do order stores
> > > + after such branches, but can speculate loads, which can again
> > > + result in misordering bugs.
> > 
> > Those two would be allowed by the wording I have recently proposed,
> > AFAICS.  r1 != flip_index would result in two possible values (unless
> > there are further constraints due to the type of r1 and the values that
> > flip_index can have).
> 
> And I am OK with the value_dep_preserving type providing more/better
> guarantees than we get by default from current compilers.
> 
> One question, though.  Suppose that the code did not want a value
> dependency to be tracked through a comparison operator.  What does
> the developer do in that case?  (The reason I ask is that I have
> not yet found a use case in the Linux kernel that expects a value
> dependency to be tracked through a comparison.)

Hmm.  I suppose use an explicit cast to non-vdp before or after the
comparison?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
> On 1 March 2014 08:03, Paul E. McKenney  wrote:
> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> Hi Paul,
> >>
> >> On 28 February 2014 18:50, Paul E. McKenney  
> >> wrote:
> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >  wrote:
> >> >> > >
> >> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> >> > > where that other pointer was properly fetched using one
> >> >> > > of the RCU primitives.  Here it doesn't matter which pointer
> >> >> > > you use.  At least as long as the rcu_assign_pointer() for
> >> >> > > that other pointer happened after the last update to the
> >> >> > > pointed-to structure.
> >> >> > >
> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >
> >> >> > I think that it might be worth pointing out as an example, and saying
> >> >> > that code like
> >> >> >
> >> >> >p = atomic_read(consume);
> >> >> >X;
> >> >> >q = atomic_read(consume);
> >> >> >Y;
> >> >> >if (p == q)
> >> >> > data = p->val;
> >> >> >
> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> > *either* p or q, but you can't really tell which, since the compiler
> >> >> > can decide that the values are interchangeable.
> >> >> >
> >> >> > I cannot for the life of me come up with a situation where this would
> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> > stronger ordering than anything the consume through "p" would
> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> > irrelevant. No?
> >> >>
> >> >> I can make a contrived litmus test for it, but you are right, the only
> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> >> reorder the loads into p and q, and the read from p->val can, as you 
> >> >> say,
> >> >> come from either pointer.
> >> >>
> >> >> For whatever it is worth, hear is the litmus test:
> >> >>
> >> >> T1:   p = kmalloc(...);
> >> >>   if (p == NULL)
> >> >>   deal_with_it();
> >> >>   p->a = 42;  /* Each field in its own cache line. */
> >> >>   p->b = 43;
> >> >>   p->c = 44;
> >> >>   atomic_store_explicit(&gp1, p, memory_order_release);
> >> >>   p->b = 143;
> >> >>   p->c = 144;
> >> >>   atomic_store_explicit(&gp2, p, memory_order_release);
> >> >>
> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
> >> >>   q = atomic_load_explicit(&gp1, memory_order_consume);
> >> >>   if (p == q) {
> >> >>   /* The compiler decides that q->c is same as p->c. */
> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
> >> >>   }
> >> >>
> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> >> you get.
> >> >>
> >> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> trouble anyway.  If you really want to do something like that and still
> >> >> see consistency across all the fields in the structure, please put a 
> >> >> lock
> >> >> in the structure and use it to guard updates and accesses to those 
> >> >> fields.
> >> >
> >> > And here is a patch documenting the restrictions for the current Linux
> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> > differently than atomic_load_explicit(&p, memory_order_consume).
> >> >
> >> > Thoughts?
> >>
> >> That might serve as informal documentation for linux kernel
> >> programmers about the bounds on the optimisations that you expect
> >> compilers to do for common-case RCU code - and I guess that's what you
> >> intend it to be for.   But I don't see how one can make it precise
> >> enough to serve as a language definition, so that compiler people
> >> could confidently say "yes, we respect that", which I guess is what
> >> you really need.  As a useful criterion, we should aim for something
> >> precise enough that in a verified-compiler context you can
> >> mathematically prove that the compiler will satisfy it  (even though
> >> that won't happen anytime soon for GCC), and that analysis tool
> >> authors can actually know what they're working with.   All this stuff
> >> about "you should avoid cancellation", and "avoid masking with just a
> >> small number of bits" is just too vague.
> >
> > Understood, and yes, this is intended to docum

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> +oDo not use the results from the boolean "&&" and "||" when
> + dereferencing.  For example, the following (rather improbable)
> + code is buggy:
> +
> + int a[2];
> + int index;
> + int force_zero_index = 1;
> +
> + ...
> +
> + r1 = rcu_dereference(i1)
> + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> +
> + The reason this is buggy is that "&&" and "||" are often compiled
> + using branches.  While weak-memory machines such as ARM or PowerPC
> + do order stores after such branches, they can speculate loads,
> + which can result in misordering bugs.
> +
> +oDo not use the results from relational operators ("==", "!=",
> + ">", ">=", "<", or "<=") when dereferencing.  For example,
> + the following (quite strange) code is buggy:
> +
> + int a[2];
> + int index;
> + int flip_index = 0;
> +
> + ...
> +
> + r1 = rcu_dereference(i1)
> + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> +
> + As before, the reason this is buggy is that relational operators
> + are often compiled using branches.  And as before, although
> + weak-memory machines such as ARM or PowerPC do order stores
> + after such branches, but can speculate loads, which can again
> + result in misordering bugs.

Those two would be allowed by the wording I have recently proposed,
AFAICS.  r1 != flip_index would result in two possible values (unless
there are further constraints due to the type of r1 and the values that
flip_index can have).

I don't think the wording is flawed.  We could raise the requirement of
having more than one value left for r1 to having more than N with N > 1
values left, but the fundamental problem remains in that a compiler
could try to generate a (big) switch statement.

Instead, I think that this indicates that the value_dep_preserving type
modifier would be useful: It would tell the compiler that it shouldn't
transform this into a branch in this case, yet allow that optimization
for all other code.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Thu, 2014-02-27 at 09:50 -0800, Paul E. McKenney wrote:
> Your proposal looks quite promising at first glance.  But rather than
> try and comment on it immediately, I am going to take a number of uses of
> RCU from the Linux kernel and apply your proposal to them, then respond
> with the results
> 
> Fair enough?

Sure.  Thanks for doing the cross-check!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Thu, 2014-02-27 at 11:47 -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>  wrote:
> >
> > 3.  The comparison was against another RCU-protected pointer,
> > where that other pointer was properly fetched using one
> > of the RCU primitives.  Here it doesn't matter which pointer
> > you use.  At least as long as the rcu_assign_pointer() for
> > that other pointer happened after the last update to the
> > pointed-to structure.
> >
> > I am a bit nervous about #3.  Any thoughts on it?
> 
> I think that it might be worth pointing out as an example, and saying
> that code like
> 
>p = atomic_read(consume);
>X;
>q = atomic_read(consume);
>Y;
>if (p == q)
> data = p->val;
> 
> then the access of "p->val" is constrained to be data-dependent on
> *either* p or q, but you can't really tell which, since the compiler
> can decide that the values are interchangeable.

The wording I proposed would make the p dereference have a value
dependency unless X and Y would somehow restrict p and q.  The reasoning
is that if the atomic loads return potentially more than one value, then
even if we find out that two such loads did return the same value, we
still don't know what the exact value was.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Thu, 2014-02-27 at 09:01 -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel  wrote:
> > Regarding the latter, we make a fresh start at each mo_consume load (ie,
> > we assume we know nothing -- L could have returned any possible value);
> > I believe this is easier to reason about than other scopes like function
> > granularities (what happens on inlining?), or translation units.  It
> > should also be simple to implement for compilers, and would hopefully
> > not constrain optimization too much.
> >
> > [...]
> >
> > Paul's litmus test would work, because we guarantee to the programmer
> > that it can assume that the mo_consume load would return any value
> > allowed by the type; effectively, this forbids the compiler analysis
> > Paul thought about:
> 
> So realistically, since with the new wording we can ignore the silly
> cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
> cases ("if (p == &variable) .. use p"), and you would forbid the
> "global value range optimization case" that Paul bright up, what
> remains would seem to be just really subtle compiler transformations
> of data dependencies to control dependencies.
> 
> And the only such thing I can think of is basically compiler-initiated
> value-prediction, presumably directed by PGO (since now if the value
> prediction is in the source code, it's considered to break the value
> chain).

The other example that comes to mind would be feedback-directed JIT
compilation.  I don't think that's widely used today, and it might never
be for the kernel -- but *in the standard*, we at least have to consider
what the future might bring.

> The good thing is that afaik, value-prediction is largely not used in
> real life, afaik. There are lots of papers on it, but I don't think
> anybody actually does it (although I can easily see some
> specint-specific optimization pattern that is build up around it).
> 
> And even value prediction is actually fine, as long as the compiler
> can see the memory *source* of the value prediction (and it isn't a
> mo_consume). So it really ends up limiting your value prediction in
> very simple ways: you cannot do it to function arguments if they are
> registers. But you can still do value prediction on values you loaded
> from memory, if you can actually *see* that memory op.

I think one would need to show that the source is *not even indirectly*
a mo_consume load.  With the wording I proposed, value dependencies
don't break when storing to / loading from memory locations.

Thus, if a compiler ends up at a memory load after waling SSA, it needs
to prove that the load cannot read a value that (1) was produced by a
store sequenced-before the load and (2) might carry a value dependency
(e.g., by being a mo_consume load) that the value prediction in question
would break.  This, in general, requires alias analysis.
Deciding whether a prediction would break a value dependency has to
consider what later stages in a compiler would be doing, including LTO
or further rounds of inlining/optimizations.  OTOH, if the compiler can
treat an mo_consume load as returning all possible values (eg, by
ignoring all knowledge about it), then it can certainly do so with other
memory loads too.

So, I think that the constraints due to value dependencies can matter in
practice.  However, the impact on optimizations on
non-mo_consume-related code are hard to estimate -- I don't see a huge
amount of impact right now, but I also wouldn't want to predict that
this can't change in the future.

> Of course, on more strongly ordered CPU's, even that "register
> argument" limitation goes away.
> 
> So I agree that there is basically no real optimization constraint.
> Value-prediction is of dubious value to begin with, and the actual
> constraint on its use if some compiler writer really wants to is not
> onerous.
> 
> > What I have in mind is roughly the following (totally made-up syntax --
> > suggestions for how to do this properly are very welcome):
> > * Have a type modifier (eg, like restrict), that specifies that
> > operations on data of this type are preserving value dependencies:
> 
> So I'm not violently opposed, but I think the upsides are not great.
> Note that my earlier suggestion to use "restrict" wasn't because I
> believed the annotation itself would be visible, but basically just as
> a legalistic promise to the compiler that *if* it found an alias, then
> it didn't need to worry about ordering. So to me, that type modifier
> was about conceptual guarantees, not about actual value chains.
> 
> Anyway, the reason I don't believe any type m

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Torvald Riegel
On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
>  wrote:
> >
> > Good points.  How about the following replacements?
> >
> > 3.  Adding or subtracting an integer to/from a chained pointer
> > results in another chained pointer in that same pointer chain.
> > The results of addition and subtraction operations that cancel
> > the chained pointer's value (for example, "p-(long)p" where "p"
> > is a pointer to char) are implementation defined.
> >
> > 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> > applied to a chained pointer and an integer for the purposes
> > of alignment and pointer translation results in another
> > chained pointer in that same pointer chain.  Other uses
> > of bitwise operators on chained pointers (for example,
> > "p|~0") are implementation defined.
> 
> Quite frankly, I think all of this language that is about the actual
> operations is irrelevant and wrong.
> 
> It's not going to help compiler writers, and it sure isn't going to
> help users that read this.
> 
> Why not just talk about "value chains" and that any operations that
> restrict the value range severely end up breaking the chain. There is
> no point in listing the operations individually, because every single
> operation *can* restrict things. Listing individual operations and
> depdendencies is just fundamentally wrong.

[...]

> The *only* thing that matters for all of them is whether they are
> "value-preserving", or whether they drop so much information that the
> compiler might decide to use a control dependency instead. That's true
> for every single one of them.
> 
> Similarly, actual true control dependencies that limit the problem
> space sufficiently that the actual pointer value no longer has
> significant information in it (see the above example) are also things
> that remove information to the point that only a control dependency
> remains. Even when the value itself is not modified in any way at all.

I agree that just considering syntactic properties of the program seems
to be insufficient.  Making it instead depend on whether there is a
"semantic" dependency due to a value being "necessary" to compute a
result seems better.  However, whether a value is "necessary" might not
be obvious, and I understand Paul's argument that he does not want to
have to reason about all potential compiler optimizations.  Thus, I
believe we need to specify when a value is "necessary".

I have a suggestion for a somewhat different formulation of the feature
that you seem to have in mind, which I'll discuss below.  Excuse the
verbosity of the following, but I'd rather like to avoid
misunderstandings than save a few words.


What we'd like to capture is that a value originating from a mo_consume
load is "necessary" for a computation (e.g., it "cannot" be replaced
with value predictions and/or control dependencies); if that's the case
in the program, we can reasonably assume that a compiler implementation
will transform this into a data dependency, which will then lead to
ordering guarantees by the HW.

However, we need to specify when a value is "necessary".  We could say
that this is implementation-defined, and use a set of litmus tests
(e.g., like those discussed in the thread) to roughly carve out what a
programmer could expect.  This may even be practical for a project like
the Linux kernel that follows strict project-internal rules and pays a
lot of attention to what the particular implementations of compilers
expected to compile the kernel are doing.  However, I think this
approach would be too vague for the standard and for many other
programs/projects.


One way to understand "necessary" would be to say that if a mo_consume
load can result in more than V different values, then the actual value
is "unknown", and thus "necessary" to compute anything based on it.
(But this is flawed, as discussed below.)

However, how big should V be?  If it's larger than 1, atomic bool cannot
be used with mo_consume, which seems weird.
If V is 1, then Linus' litmus tests work (but Paul's doesn't; see
below), but the compiler must not try to predict more than one value.
This holds for any choice of V, so there always is an *additional*
constraint on code generation for operations that are meant to take part
in such "value dependencies".  The bigger V might be, the less likely it
should be for this to actually constrain a particular compiler's
optimizations (e.g., while it might be meaningful to use value
prediction for two or three values, it's probably not for 1000s).
Nonetheless, if we don't want to predict the future, we need to specify
V.  Given that we always have some constraint for code generation
anyway, and given that V > 1 might be an arbitrary-looking constraint
and disallows use on atomic bool, I believe V should be 1.

Furthermore, there is a problem in saying "a load can 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Wed, 2014-02-26 at 18:43 +, Joseph S. Myers wrote:
> On Wed, 26 Feb 2014, Torvald Riegel wrote:
> 
> > On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote:
> > > On Fri, 21 Feb 2014, Paul E. McKenney wrote:
> > > 
> > > > This needs to be as follows:
> > > > 
> > > > [[carries_dependency]] int getzero(int i [[carries_dependency]])
> > > > {
> > > > return i - i;
> > > > }
> > > > 
> > > > Otherwise dependencies won't get carried through it.
> > > 
> > > C11 doesn't have attributes at all (and no specification regarding calls 
> > > and dependencies that I can see).  And the way I read the C++11 
> > > specification of carries_dependency is that specifying carries_dependency 
> > > is purely about increasing optimization of the caller: that if it isn't 
> > > specified, then the caller doesn't know what dependencies might be 
> > > carried.  "Note: The carries_dependency attribute does not change the 
> > > meaning of the program, but may result in generation of more efficient 
> > > code. - end note".
> > 
> > I think that this last sentence can be kind of misleading, especially
> > when looking at it from an implementation point of view.  How
> > dependencies are handled (ie, preserving the syntactic dependencies vs.
> > emitting barriers) must be part of the ABI, or things like
> > [[carries_dependency]] won't work as expected (or lead to inefficient
> > code).  Thus, in practice, all compiler vendors on a platform would have
> > to agree to a particular handling, which might end up in selecting the
> > easy-but-conservative implementation option (ie, always emitting
> > mo_acquire when the source uses mo_consume).
> 
> Regardless of the ABI, my point is that if a program is valid, it is also 
> valid when all uses of [[carries_dependency]] are removed.  If a function 
> doesn't use [[carries_dependency]], that means "dependencies may or may 
> not be carried through this function".  If a function uses 
> [[carries_dependency]], that means that certain dependencies *are* carried 
> through the function (and the ABI should then specify what this means the 
> caller can rely on, in terms of the architecture's memory model).  (This 
> may or may not be useful, but it's how I understand C++11.)

I agree.  What I tried to point out is that it's not the case that an
*implementation* can just ignore [[carries_dependency]].  So from an
implementation perspective, the attribute does have semantics.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Mon, 2014-02-24 at 09:28 -0800, Paul E. McKenney wrote:
> On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote:
> > Hi,
> > 
> > On Mon, 24 Feb 2014, Linus Torvalds wrote:
> > 
> > > > To me that reads like
> > > >
> > > >   int i;
> > > >   int *q = &i;
> > > >   int **p = &q;
> > > >
> > > >   atomic_XXX (p, CONSUME);
> > > >
> > > > orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
> > > > want to say that it orders against aliased storage - but then go further
> > > > and include "indirectly through a chain of pointers"?!  Thus an
> > > > atomic read of a int * orders against any 'int' memory operation but
> > > > not against 'float' memory operations?
> > > 
> > > No, it's not about type at all, and the "chain of pointers" can be
> > > much more complex than that, since the "int *" can point to within an
> > > object that contains other things than just that "int" (the "int" can
> > > be part of a structure that then has pointers to other structures
> > > etc).
> > 
> > So, let me try to poke holes into your definition or increase my 
> > understanding :) .  You said "chain of pointers"(dereferences I assume), 
> > e.g. if p is result of consume load, then access to 
> > p->here->there->next->prev->stuff is supposed to be ordered with that load 
> > (or only when that last load/store itself is also an atomic load or 
> > store?).
> > 
> > So, what happens if the pointer deref chain is partly hidden in some 
> > functions:
> > 
> > A * adjustptr (B *ptr) { return &ptr->here->there->next; }
> > B * p = atomic_XXX (&somewhere, consume);
> > adjustptr(p)->prev->stuff = bla;
> > 
> > As far as I understood you, this whole ptrderef chain business would be 
> > only an optimization opportunity, right?  So if the compiler can't be sure 
> > how p is actually used (as in my function-using case, assume adjustptr is 
> > defined in another unit), then the consume load would simply be 
> > transformed into an acquire (or whatever, with some barrier I mean)?  Only 
> > _if_ the compiler sees all obvious uses of p (indirectly through pointer 
> > derefs) can it, yeah, do what with the consume load?
> 
> Good point, I left that out of my list.  Adding it:
> 
> 13.   By default, pointer chains do not propagate into or out of functions.
>   In implementations having attributes, a [[carries_dependency]]
>   may be used to mark a function argument or return as passing
>   a pointer chain into or out of that function.
> 
>   If a function does not contain memory_order_consume loads and
>   also does not contain [[carries_dependency]] attributes, then
>   that function may be compiled using any desired dependency-breaking
>   optimizations.
> 
>   The ordering effects are implementation defined when a given
>   pointer chain passes into or out of a function through a parameter
>   or return not marked with a [[carries_dependency]] attributed.
> 
> Note that this last paragraph differs from the current standard, which
> would require ordering regardless.

I would prefer if we could get rid off [[carries_dependency]] as well;
currently, it's a hint whose effectiveness really depends on how the
particular implementation handles this attribute.  If we still need
something like it in the future, it would be good if it had a clearer
use and performance effects.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Mon, 2014-02-24 at 09:38 -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz  wrote:
> >
> > So, let me try to poke holes into your definition or increase my
> > understanding :) .  You said "chain of pointers"(dereferences I assume),
> > e.g. if p is result of consume load, then access to
> > p->here->there->next->prev->stuff is supposed to be ordered with that load
> > (or only when that last load/store itself is also an atomic load or
> > store?).
> 
> It's supposed to be ordered wrt the first load (the consuming one), yes.
> 
> > So, what happens if the pointer deref chain is partly hidden in some
> > functions:
> 
> No problem.
> 
> The thing is, the ordering is actually handled by the CPU in all
> relevant cases.  So the compiler doesn't actually need to *do*
> anything. All this legalistic stuff is just to describe the semantics
> and the guarantees.
> 
> The problem is two cases:
> 
>  (a) alpha (which doesn't really order any accesses at all, not even
> dependent loads), but for a compiler alpha is actually trivial: just
> add a "rmb" instruction after the load, and you can't really do
> anything else (there's a few optimizations you can do wrt the rmb, but
> they are very specific and simple).
> 
> So (a) is a "problem", but the solution is actually really simple, and
> gives very *strong* guarantees: on alpha, a "consume" ends up being
> basically the same as a read barrier after the load, with only very
> minimal room for optimization.
> 
>  (b) ARM and powerpc and similar architectures, that guarantee the
> data dependency as long as it is an *actual* data dependency, and
> never becomes a control dependency.
> 
> On ARM and powerpc, control dependencies do *not* order accesses (the
> reasons boil down to essentially: branch prediction breaks the
> dependency, and instructions that come after the branch can be happily
> executed before the branch). But it's almost impossible to describe
> that in the standard, since compilers can (and very much do) turn a
> control dependency into a data dependency and vice versa.
> 
> So the current standard tries to describe that "control vs data"
> dependency, and tries to limit it to a data dependency. It fails. It
> fails for multiple reasons - it doesn't allow for trivial
> optimizations that just remove the data dependency, and it also
> doesn't allow for various trivial cases where the compiler *does* turn
> the data dependency into a control dependency.
> 
> So I really really think that the current C standard language is
> broken. Unfixably broken.
> 
> I'm trying to change the "syntactic data dependency" that the current
> standard uses into something that is clearer and correct.
> 
> The "chain of pointers" thing is still obviously a data dependency,
> but by limiting it to pointers, it simplifies the language, clarifies
> the meaning, avoids all syntactic tricks (ie "p-p" is clearly a
> syntactic dependency on "p", but does *not* involve in any way
> following the pointer) and makes it basically impossible for the
> compiler to break the dependency without doing value prediction, and
> since value prediction has to be disallowed anyway, that's a feature,
> not a bug.

AFAIU, Michael is wondering about how we can separate non-synchronizing
code (ie, in this case, not taking part in any "chain of pointers" used
with mo_consume loads) from code that does.  If we cannot, then we
prevent value prediction *everywhere*, unless the compiler can prove
that the code is never part of such a chain (which is hard due to alias
analysis being hard, etc.).
(We can probably argue to which extent value prediction is necessary for
generation of efficient code, but it obviously does work in
non-synchronizing code (or even with acquire barriers with some care) --
so forbidding it entirely might be bad.)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote:
> On Fri, 21 Feb 2014, Paul E. McKenney wrote:
> 
> > This needs to be as follows:
> > 
> > [[carries_dependency]] int getzero(int i [[carries_dependency]])
> > {
> > return i - i;
> > }
> > 
> > Otherwise dependencies won't get carried through it.
> 
> C11 doesn't have attributes at all (and no specification regarding calls 
> and dependencies that I can see).  And the way I read the C++11 
> specification of carries_dependency is that specifying carries_dependency 
> is purely about increasing optimization of the caller: that if it isn't 
> specified, then the caller doesn't know what dependencies might be 
> carried.  "Note: The carries_dependency attribute does not change the 
> meaning of the program, but may result in generation of more efficient 
> code. - end note".

I think that this last sentence can be kind of misleading, especially
when looking at it from an implementation point of view.  How
dependencies are handled (ie, preserving the syntactic dependencies vs.
emitting barriers) must be part of the ABI, or things like
[[carries_dependency]] won't work as expected (or lead to inefficient
code).  Thus, in practice, all compiler vendors on a platform would have
to agree to a particular handling, which might end up in selecting the
easy-but-conservative implementation option (ie, always emitting
mo_acquire when the source uses mo_consume).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote:
> On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote:
> > Hi,
> > 
> > On Thu, 20 Feb 2014, Linus Torvalds wrote:
> > 
> > > But I'm pretty sure that any compiler guy must *hate* that current odd
> > > dependency-generation part, and if I was a gcc person, seeing that
> > > bugzilla entry Torvald pointed at, I would personally want to
> > > dismember somebody with a rusty spoon..
> > 
> > Yes.  Defining dependency chains in the way the standard currently seems 
> > to do must come from people not writing compilers.  There's simply no 
> > sensible way to implement it without being really conservative, because 
> > the depchains can contain arbitrary constructs including stores, 
> > loads and function calls but must still be observed.  
> > 
> > And with conservative I mean "everything is a source of a dependency, and 
> > hence can't be removed, reordered or otherwise fiddled with", and that 
> > includes code sequences where no atomic objects are anywhere in sight [1].
> > In the light of that the only realistic way (meaning to not have to 
> > disable optimization everywhere) to implement consume as currently 
> > specified is to map it to acquire.  At which point it becomes pointless.
> 
> No, only memory_order_consume loads and [[carries_dependency]]
> function arguments are sources of dependency chains.

However, that is, given how the standard specifies things, just one of
the possible ways for how an implementation can handle this.  Treating
[[carries_dependency]] as a "necessary" annotation to make exploiting
mo_consume work in practice is possible, but it's not required by the
standard.

Also, dependencies are specified to flow through loads and stores
(restricted to scalar objects and bitfields), so any load that might
load from a dependency-carrying store can also be a source (and that
doesn't seem to be restricted by [[carries_dependency]]).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-22 Thread Torvald Riegel
On Thu, 2014-02-20 at 11:09 -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 10:53 AM, Torvald Riegel  wrote:
> > On Thu, 2014-02-20 at 10:32 -0800, Linus Torvalds wrote:
> >> On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney
> >>  wrote:
> >> >
> >> > You really need that "consume" to be "acquire".
> >>
> >> So I think we now all agree that that is what the standard is saying.
> >
> > Huh?
> >
> > The standard says that there are two separate things (among many more):
> > mo_acquire and mo_consume.  They both influence happens-before in
> > different (and independent!) ways.
> >
> > What Paul is saying is that *you* should have used *acquire* in that
> > example.
> 
> I understand.
> 
> And I disagree. I think the standard is wrong, and what I *should* be
> doing is point out the fact very loudly, and just tell people to NEVER
> EVER use "consume" as long as it's not reliable and has insane
> semantics.

Stating that (1) "the standard is wrong" and (2) that you think that
mo_consume semantics are not good is two different things.  Making bold
statements without a proper context isn't helpful in making this
discussion constructive.  It's simply not efficient if I (or anybody
else reading this) has to wonder whether you actually mean what you said
(even if, when reading it literally, is arguably not consistent with the
arguments brought up in the discussion) or whether those statements just
have to be interpreted in some other way.

> So what I "should do" is to not accept any C11 atomics use in the
> kernel.

You're obviously free to do that.

> Because with the "acquire", it generates worse code than what
> we already have,

I would argue that this is still under debate.  At least I haven't seen
a definition of what you want that is complete and based on the standard
(e.g., an example of what a compiler might do in a specific case isn't a
definition).  From what I've seen, it's not inconceivable that what you
want is just an optimized acquire.

I'll bring this question up again elsewhere in the thread (where it
hopefully fits better).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-22 Thread Torvald Riegel
On Thu, 2014-02-20 at 10:18 -0800, Paul E. McKenney wrote:
> On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140220172700.0...@vmsdvm4.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4)
> > 
> > On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote:
> > > On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote:
> > > > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel  
> > > > wrote:
> > > > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> > > > >>
> > > > >> Can you point to it? Because I can find a draft standard, and it sure
> > > > >> as hell does *not* contain any clarity of the model. It has a *lot* 
> > > > >> of
> > > > >> verbiage, but it's pretty much impossible to actually understand, 
> > > > >> even
> > > > >> for somebody who really understands memory ordering.
> > > > >
> > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > > > This has an explanation of the model up front, and then the detailed
> > > > > formulae in Section 6.  This is from 2010, and there might have been
> > > > > smaller changes since then, but I'm not aware of any bigger ones.
> > > > 
> > > > Ahh, this is different from what others pointed at. Same people,
> > > > similar name, but not the same paper.
> > > > 
> > > > I will read this version too, but from reading the other one and the
> > > > standard in parallel and trying to make sense of it, it seems that I
> > > > may have originally misunderstood part of the whole control dependency
> > > > chain.
> > > > 
> > > > The fact that the left side of "? :", "&&" and "||" breaks data
> > > > dependencies made me originally think that the standard tried very
> > > > hard to break any control dependencies. Which I felt was insane, when
> > > > then some of the examples literally were about the testing of the
> > > > value of an atomic read. The data dependency matters quite a bit. The
> > > > fact that the other "Mathematical" paper then very much talked about
> > > > consume only in the sense of following a pointer made me think so even
> > > > more.
> > > > 
> > > > But reading it some more, I now think that the whole "data dependency"
> > > > logic (which is where the special left-hand side rule of the ternary
> > > > and logical operators come in) are basically an exception to the rule
> > > > that sequence points end up being also meaningful for ordering (ok, so
> > > > C11 seems to have renamed "sequence points" to "sequenced before").
> > > > 
> > > > So while an expression like
> > > > 
> > > > atomic_read(p, consume) ? a : b;
> > > > 
> > > > doesn't have a data dependency from the atomic read that forces
> > > > serialization, writing
> > > > 
> > > >if (atomic_read(p, consume))
> > > >   a;
> > > >else
> > > >   b;
> > > > 
> > > > the standard *does* imply that the atomic read is "happens-before" wrt
> > > > "a", and I'm hoping that there is no question that the control
> > > > dependency still acts as an ordering point.
> > > 
> > > The control dependency should order subsequent stores, at least assuming
> > > that "a" and "b" don't start off with identical stores that the compiler
> > > could pull out of the "if" and merge.  The same might also be true for ?:
> > > for all I know.  (But see below)
> > 
> > I don't think this is quite true.  I agree that a conditional store will
> > not be executed speculatively (note that if it would happen in both the
> > then and the else branch, it's not conditional); so, the store in
> > "a;" (assuming it would be a store) won't happen unless the thread can
> > really observe a true value for p.  However, this is *this thread's*
> > view of the world, but not guaranteed to constrain how any other thread
> > sees the state.  mo_consume does not contribute to
> > inter-thread-happens-before in the same way that mo_acquire does (which
> > *does* put a constraint on i-t-h-b, and thus enforces a global
&g

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Torvald Riegel
On Thu, 2014-02-20 at 10:32 -0800, Linus Torvalds wrote: 
> On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney
>  wrote:
> >
> > You really need that "consume" to be "acquire".
> 
> So I think we now all agree that that is what the standard is saying.

Huh?

The standard says that there are two separate things (among many more):
mo_acquire and mo_consume.  They both influence happens-before in
different (and independent!) ways.

What Paul is saying is that *you* should have used *acquire* in that
example.

> And I'm saying that that is wrong, that the standard is badly written,
> and should be fixed.
> 
> Because before the standard is fixed, I claim that "consume" is
> unusable. We cannot trust it. End of story.

Then we still have all the rest.  Let's just ignore mo_consume for now,
and look at mo_acquire, I suggest.

> The fact that apparently gcc is currently buggy because it got the
> dependency calculations *wrong* just reinforces my point.

Well, I'm pretty sure nobody actually worked on trying to preserve the
dependencies at all.  IOW, I suspect this fell through the cracks.  We
can ask the person working on this if you really want to know.

> The gcc bug Torvald pointed at is exactly because the current C
> standard is illogical unreadable CRAP.

It's obviously logically consistent to the extent that it can be
represented by a formal specification such as the one by the Cambridge
group.  Makes sense, or not?

> I can guarantee that what
> happened is:
> 
>  - the compiler saw that the result of the read was used as the left
> hand expression of the ternary "? :" operator
> 
>  - as a result, the compiler decided that there's no dependency
> 
>  - the compiler didn't think about the dependency that comes from the
> result of the load *also* being used as the middle part of the ternary
> expression, because it had optimized it away, despite the standard not
> talking about that at all.
> 
>  - so the compiler never saw the dependency that the standard talks about
> 
> BECAUSE THE STANDARD LANGUAGE IS PURE AND UTTER SHIT.

Please, be specific.  Right now you're saying that all of it is useless.
Which is arguable not true.

> My suggested language never had any of these problems, because *my*
> suggested semantics are clear, logical, and don't have these kinds of
> idiotic pit-falls.

Have you looked at and understood the semantics of the memory model
(e.g. in the formalized form) with mo_consume and related being ignored
(ie, just ignore 6.13 and 6.14 in n3132)?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Torvald Riegel
On Thu, 2014-02-20 at 10:11 -0800, Paul E. McKenney wrote:
> But yes, the compiler guys would be extremely happy to simply drop
> memory_order_consume from the standard, as it is the memory order
> that they most love to hate.
> 
> Getting them to agree to any sort of peep-hole optimization semantics
> for memory_order_consume is likely problematic.  

I wouldn't be so pessimistic about that.  If the transformations can be
shown to be always correct in terms of the semantics specified in the
standard, and if the performance win is sufficiently large, why not?  Of
course, somebody has to volunteer to actually implement it :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Torvald Riegel
On Thu, 2014-02-20 at 09:01 -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 12:30 AM, Paul E. McKenney
>  wrote:
> >>
> >> So lets make this really simple: if you have a consume->cmp->read, is
> >> the ordering of the two reads guaranteed?
> >
> > Not as far as I know.  Also, as far as I know, there is no difference
> > between consume and relaxed in the consume->cmp->read case.
> 
> Ok, quite frankly, I think that means that "consume" is misdesigned.
> 
> > The above example can have a return value of 0 if translated
> > straightforwardly into either ARM or Power, right?
> 
> Correct. And I think that is too subtle. It's dangerous, it makes code
> that *looks* correct work incorrectly, and it actually happens to work
> on x86 since x86 doesn't have crap-for-brains memory ordering
> semantics.
> 
> > So, if you make one of two changes to your example, then I will agree
> > with you.
> 
> No. We're not playing games here. I'm fed up with complex examples
> that make no sense.

Note that Paul's second suggestion for a change was to just use
mo_acquire;  that's a simple change, and the easiest option, so it
should be just fine.

> Nobody sane writes code that does that pointer comparison, and it is
> entirely immaterial what the compiler can do behind our backs. The C
> standard semantics need to make sense to the *user* (ie programmer),
> not to a CPU and not to a compiler. The CPU and compiler are "tools".
> They don't matter. Their only job is to make the code *work*, dammit.
> 
> So no idiotic made-up examples that involve code that nobody will ever
> write and that have subtle issues.
> 
> So the starting point is that (same example as before, but with even
> clearer naming):
> 
> Initialization state:
>   initialized = 0;
>   value = 0;
> 
> Consumer:
> 
> return atomic_read(&initialized, consume) ? value : -1;
> 
> Writer:
> value = 42;
> atomic_write(&initialized, 1, release);
> 
> and because the C memory ordering standard is written in such a way
> that this is subtly buggy (and can return 0, which is *not* logically
> a valid value), then I think the C memory ordering standard is broken.
> 
> That "consumer" memory ordering is dangerous as hell, and it is
> dangerous FOR NO GOOD REASON.
> 
> The trivial "fix" to the standard would be to get rid of all the
> "carries a dependency" crap, and just say that *anything* that depends
> on it is ordered wrt it.
> 
> That just means that on alpha, "consume" implies an unconditional read
> barrier (well, unless the value is never used and is loaded just
> because it is also volatile), on x86, "consume" is the same as
> "acquire" which is just a plain load with ordering guarantees, and on
> ARM or power you can still avoid the extra synchronization *if* the
> value is used just for computation and for following pointers, but if
> the value is used for a comparison, there needs to be a
> synchronization barrier.
> 
> Notice? Getting rid of the stupid "carries-dependency" crap from the
> standard actually
>  (a) simplifies the standard

Agreed, although it's easy to ignore the parts related to mo_consume, I
think.

>  (b) means that the above obvious example *works*
>  (c) does not in *any* way make for any less efficient code generation
> for the cases that "consume" works correctly for in the current
> mis-designed standard.
>  (d) is actually a hell of a lot easier to explain to a compiler
> writer, and I can guarantee that it is simpler to implement too.

mo_acquire is certainly easier to implement in a compiler.

> Why do I claim (d) "it is simpler to implement" - because on ARM/power
> you can implement it *exactly* as a special "acquire", with just a
> trivial peep-hole special case that follows the use chain of the
> acquire op to the consume, and then just drop the acquire bit if the
> only use is that compute-to-load chain.

That's similar to the way I saw it and described in my reply to your
other email (before getting to this email here).  It seems that this
indeed might be doable transparently in the compiler, without requiring
a special mo_acquire variant visible to programmers.

> In fact, realistically, the *only* thing you need to actually care
> about for the intended use case of "consume" is the question "is the
> consuming load immediately consumed as an address (with offset) of a
> memory operation. So you don't even need to follow any complicated
> computation chain in a compiler - the only case that matters for the
> barrier removal optimization is the "oh, I can see that it is only
> used as an address to a dereference".

To make this mo_acquire optimization apply often, a compiler might have
to try to filter out accesses that don't synchronize (e.g., so that an
access to a non-shared temporary variable doesn't prevent the
optimization).

> Seriously. The current standard is broken.

Please, let's be precise in such statement, so that everyone actually
knows what's meant.  The rest of the memory model can be perfectly 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Torvald Riegel
On Thu, 2014-02-20 at 09:34 -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 9:14 AM, Torvald Riegel  wrote:
> >>
> >> So the clarification is basically to the statement that the "if
> >> (consume(p)) a" version *would* have an ordering guarantee between the
> >> read of "p" and "a", but the "consume(p) ? a : b" would *not* have
> >> such an ordering guarantee. Yes?
> >
> > Not as I understand it.  If my reply above wasn't clear, let me know and
> > I'll try to rephrase it into something that is.
> 
> Yeah, so you and Paul agree. And as I mentioned in the email that
> crossed with yours, I think that means that the standard is overly
> complex, hard to understand, fragile, and all *pointlessly* so.

Let's step back a little here and distinguish different things:

1) AFAICT, mo_acquire provides all the ordering guarantees you want.
Thus, I suggest focusing on mo_acquire for now, especially when reading
the model.  Are you fine with the mo_acquire semantics?

2) mo_acquire is independent of carries-a-dependency and similar
definitions/relations, so you can ignore all those to reason about
mo_acquire.

3) Do you have concerns over the runtime costs of barriers with
mo_acquire semantics?  If so, that's a valid discussion point, and we
can certainly dig deeper into this topic to see how we can possibly use
weaker HW barriers by exploiting things the compiler sees and can
potentially preserve (e.g., control dependencies).  There might be some
stuff the compiler can do without needing further input from the
programmer.

4) mo_consume is kind of the difficult special-case variant of
mo_acquire.  We should discuss it (including whether it's helpful)
separately from the memory model, because it's not essential.

> Btw, there are many ways that "use a consume as an input to a
> conditional" can happen. In particular, even if the result is actually
> *used* like a pointer as far as the programmer is concerned, tricks
> like pointer compression etc can well mean that the "pointer" is
> actually at least partly implemented using conditionals, so that some
> paths end up being only dependent through a comparison of the pointer
> value.

AFAIU, this is similar to my concerns about how a compiler can
reasonably implement the ordering guarantees: The mo_consume value may
be used like a pointer in source code, but how this looks in the
generated code be different after reasonable transformations (including
transformations to control dependencies, so the compiler would have to
avoid those).  Or did I misunderstand?

> So I very much did *not* want to complicate the "litmus test" code
> snippet when Paul tried to make it more complex, but I do think that
> there are cases where code that "looks" like pure pointer chasing
> actually is not for some cases, and then can become basically that
> litmus test for some path.
> 
> Just to give you an example: in simple list handling it is not at all
> unusual to have a special node that is discovered by comparing the
> address, not by just loading from the pointer and following the list
> itself. Examples of that would be a HEAD node in a doubly linked list
> (Linux uses this concept quite widely, it's our default list
> implementation), or it could be a place-marker ("cursor" entry) in the
> list for safe traversal in the presence of concurrent deletes etc.
> 
> And obviously there is the already much earlier mentioned
> compiler-induced compare, due to value speculation, that can basically
> create such sequences even wherethey did not originally exist in the
> source code itself.
> 
> So even if you work with "pointer dereferences", and thus match that
> particular consume pattern, I really don't see why anybody would think
> that "hey, we can ignore any control dependencies" is a good idea.
> It's a *TERRIBLE* idea.
> 
> And as mentioned, it's a terrible idea with no upsides. It doesn't
> help compiler optimizations for the case it's *intended* to help,
> since those optimizations can still be done without the horribly
> broken semantics. It doesn't help compiler writers, it just confuses
> them.

I'm worried about how compilers can implement mo_consume without
prohibiting lots of optimizations on the code.  Your thoughts seem to
point in a similar direction.

I think we should continue by discussing mo_acquire first.  It has
easier semantics and allows a relatively simple implementation in
compilers (although there might be not-quite-so-simple optimizations).  

It's unfortunate we started the discussion with the tricky special case
first; maybe that's what contributed to the confusion.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Torvald Riegel
On Thu, 2014-02-20 at 00:30 -0800, Paul E. McKenney wrote:
> Well, all the compilers currently convert consume to acquire, so you have
> your wish there.  Of course, that also means that they generate actual
> unneeded memory-barrier instructions, which seems extremely sub-optimal
> to me.

GCC doesn't currently, but it also doesn't seem to track the
dependencies, but that's a bug:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Torvald Riegel
On Wed, 2014-02-19 at 20:43 -0800, Linus Torvalds wrote:

[Paul has already answered many of your questions, and my reply to your
previous email should also answer some.]

> If the consumer of an atomic load isn't a pointer chasing operation,
> then the consume should be defined to be the same as acquire. None of
> this "conditionals break consumers". No, conditionals on the
> dependency path should turn consumers into acquire, because otherwise
> the "consume" load is dangerous as hell.

Yes, mo_consume is more tricky than mo_acquire.

However, that has an advantage because you can avoid getting stronger
barriers if you don't need them (ie, you can avoid the "auto-update to
acquire" you seem to have in mind).

The auto-upgrade would be a possible semantics, I agree.

Another option may be to let an implementation optimize the HW barriers
that it uses for mo_acquire.  That is, if the compiler sees that (1) the
result of an mo_acquire load is used on certain code paths *only* for
consumers that carry the dependency *and* (2) there are no other
operations on that code path that can possibly rely on the mo_acquire
ordering guarantees, then the compiler can use a weaker HW barrier on
archs such as PowerPC or ARM.

That is similar to the rules for mo_consume you seem to have in mind,
but makes it a compiler optimization on mo_acquire.  However, the
compiler has to be conservative here, so having a mo_consume that is
trickier to use but doesn't ever silently introduce stronger HW barriers
seems to be useful (modulo the difficulties regarding to how it's
currently specified in the standard).

> And if the definition of acquire doesn't include the control
> dependency either, then the C atomic memory model is just completely
> and utterly broken, since the above *trivial* and clearly useful
> example is broken.

In terms of the model, if you establish a synchronizes-with using a
reads-from that has (or crosses) a release/acquire memory-order pair,
then this synchronizes-with will also order other operations that it's
sequenced-before (see the composition of synchronizes-with and
sequenced-before in the inter-thread-happens-before definition in n3132
6.15).

So yes, mo_acquire does that the logical "control dependencies" /
sequenced-before into account.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Torvald Riegel
On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote:
> On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote:
> > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel  wrote:
> > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> > >>
> > >> Can you point to it? Because I can find a draft standard, and it sure
> > >> as hell does *not* contain any clarity of the model. It has a *lot* of
> > >> verbiage, but it's pretty much impossible to actually understand, even
> > >> for somebody who really understands memory ordering.
> > >
> > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > This has an explanation of the model up front, and then the detailed
> > > formulae in Section 6.  This is from 2010, and there might have been
> > > smaller changes since then, but I'm not aware of any bigger ones.
> > 
> > Ahh, this is different from what others pointed at. Same people,
> > similar name, but not the same paper.
> > 
> > I will read this version too, but from reading the other one and the
> > standard in parallel and trying to make sense of it, it seems that I
> > may have originally misunderstood part of the whole control dependency
> > chain.
> > 
> > The fact that the left side of "? :", "&&" and "||" breaks data
> > dependencies made me originally think that the standard tried very
> > hard to break any control dependencies. Which I felt was insane, when
> > then some of the examples literally were about the testing of the
> > value of an atomic read. The data dependency matters quite a bit. The
> > fact that the other "Mathematical" paper then very much talked about
> > consume only in the sense of following a pointer made me think so even
> > more.
> > 
> > But reading it some more, I now think that the whole "data dependency"
> > logic (which is where the special left-hand side rule of the ternary
> > and logical operators come in) are basically an exception to the rule
> > that sequence points end up being also meaningful for ordering (ok, so
> > C11 seems to have renamed "sequence points" to "sequenced before").
> > 
> > So while an expression like
> > 
> > atomic_read(p, consume) ? a : b;
> > 
> > doesn't have a data dependency from the atomic read that forces
> > serialization, writing
> > 
> >if (atomic_read(p, consume))
> >   a;
> >else
> >   b;
> > 
> > the standard *does* imply that the atomic read is "happens-before" wrt
> > "a", and I'm hoping that there is no question that the control
> > dependency still acts as an ordering point.
> 
> The control dependency should order subsequent stores, at least assuming
> that "a" and "b" don't start off with identical stores that the compiler
> could pull out of the "if" and merge.  The same might also be true for ?:
> for all I know.  (But see below)

I don't think this is quite true.  I agree that a conditional store will
not be executed speculatively (note that if it would happen in both the
then and the else branch, it's not conditional); so, the store in
"a;" (assuming it would be a store) won't happen unless the thread can
really observe a true value for p.  However, this is *this thread's*
view of the world, but not guaranteed to constrain how any other thread
sees the state.  mo_consume does not contribute to
inter-thread-happens-before in the same way that mo_acquire does (which
*does* put a constraint on i-t-h-b, and thus enforces a global
constraint that all threads have to respect).

Is it clear which distinction I'm trying to show here?

> That said, in this case, you could substitute relaxed for consume and get
> the same effect.  The return value from atomic_read() gets absorbed into
> the "if" condition, so there is no dependency-ordered-before relationship,
> so nothing for consume to do.
> 
> One caution...  The happens-before relationship requires you to trace a
> full path between the two operations of interest.  This is illustrated
> by the following example, with both x and y initially zero:
> 
> T1:   atomic_store_explicit(&x, 1, memory_order_relaxed);
>   r1 = atomic_load_explicit(&y, memory_order_relaxed);
> 
> T2:   atomic_store_explicit(&y, 1, memory_order_relaxed);
>   r2 = atomic_load_explicit(&x, memory_order_relaxed);
> 
> There is a happens-before relationship between T1's load and store,
> and another happens-before relationship between T2

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Torvald Riegel
On Wed, 2014-02-19 at 16:53 -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel  wrote:
> > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> >>
> >> Can you point to it? Because I can find a draft standard, and it sure
> >> as hell does *not* contain any clarity of the model. It has a *lot* of
> >> verbiage, but it's pretty much impossible to actually understand, even
> >> for somebody who really understands memory ordering.
> >
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > This has an explanation of the model up front, and then the detailed
> > formulae in Section 6.  This is from 2010, and there might have been
> > smaller changes since then, but I'm not aware of any bigger ones.
> 
> Ahh, this is different from what others pointed at. Same people,
> similar name, but not the same paper.
> 
> I will read this version too, but from reading the other one and the
> standard in parallel and trying to make sense of it, it seems that I
> may have originally misunderstood part of the whole control dependency
> chain.
> 
> The fact that the left side of "? :", "&&" and "||" breaks data
> dependencies made me originally think that the standard tried very
> hard to break any control dependencies.

Hmm... (I'm not quite sure how to express this properly, so bear with
me.)

The control dependencies as you seem to consider aren't there *as is*,
and removing "? :", "&&", and "||" from what's called
Carries-a-dependency-to in n3132 is basically removing control
dependences in expressions from that relation (ie, so it's just data
dependences).

However, control dependences in the program are not ignored, because
they control what steps the abstract machine would do.  If we have
candidate executions with a particular choice for the reads-from
relation, they must make sense in terms of program logic (which does
include conditionals and such), or a candidate execution will not be
consistent.  The difference is in how reads-from is constrainted by
different memory orders.

I hope what I mean gets clearer below when looking at the example...

Mark Batty / Peter Sewell:  Do you have a better explanation?  (And
please correct anything wrong in what I wrote ;)

> Which I felt was insane, when
> then some of the examples literally were about the testing of the
> value of an atomic read. The data dependency matters quite a bit. The
> fact that the other "Mathematical" paper then very much talked about
> consume only in the sense of following a pointer made me think so even
> more.
> 
> But reading it some more, I now think that the whole "data dependency"
> logic (which is where the special left-hand side rule of the ternary
> and logical operators come in) are basically an exception to the rule
> that sequence points end up being also meaningful for ordering (ok, so
> C11 seems to have renamed "sequence points" to "sequenced before").

I'm not quite sure what you mean, but I think that this isn't quite the
case.  mo_consume and mo_acquire combine with sequenced-before
differently.  Have a look at the definition of
inter-thread-happens-before (6.15 in n3132): for synchronizes-with, the
composition with sequenced-before is used (ie, what you'd get with
mo_acquire), whereas for mo_consume, we just have
dependency_ordered_before (so, data dependencies only).

> So while an expression like
> 
> atomic_read(p, consume) ? a : b;
> 
> doesn't have a data dependency from the atomic read that forces
> serialization, writing
> 
>if (atomic_read(p, consume))
>   a;
>else
>   b;
> 
> the standard *does* imply that the atomic read is "happens-before" wrt
> "a", and I'm hoping that there is no question that the control
> dependency still acts as an ordering point.

I don't think it does for mo_consume (but does for mo_acquire).  Both
examples do basically the same in the abstract machine, and I think it
would be surprising if rewriting could the code with something seemingly
equivalent would make a semantic difference.  AFAICT, "? :" etc. is
removed to allow the standard defining data dependencies as being based
on expressions that take the mo_consume value as input, without also
pulling in control dependencies.

To look at this example in the cppmem tool, we can transform your
example above into:

int main() {
  atomic_int x = 0; 
  int y = 0;
  {{{ { y = 1; x.store(1, memory_order_release); }
  ||| { if (x.load(memory_order_consume)) r2 = y; }
  }}};
  return 0; }

This has 2 threads, one is doing the store to y (the data) and the
release operation, the other is using consume

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Wed, 2014-02-19 at 07:23 -0800, David Lang wrote:
> On Tue, 18 Feb 2014, Torvald Riegel wrote:
> 
> > On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote:
> >> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> >>> Well, that's how atomics that aren't volatile are defined in the
> >>> standard.  I can see that you want something else too, but that doesn't
> >>> mean that the other thing is broken.
> >>
> >> Well that other thing depends on being able to see the entire program at
> >> compile time. PaulMck already listed various ways in which this is
> >> not feasible even for normal userspace code.
> >>
> >> In particular; DSOs and JITs were mentioned.
> >
> > No it doesn't depend on whole-program analysis being possible.  Because
> > if it isn't, then a correct compiler will just not do certain
> > optimizations simply because it can't prove properties required for the
> > optimization to hold.  With the exception of access to objects via magic
> > numbers (e.g., fixed and known addresses (see my reply to Paul), which
> > are outside of the semantics specified in the standard), I don't see a
> > correctness problem here.
> 
> Are you really sure that the compiler can figure out every possible thing 
> that a 
> loadable module or JITed code can access? That seems like a pretty strong 
> claim.

If the other code can be produced by a C translation unit that is valid
to be linked with the rest of the program, then I'm pretty sure the
compiler has a well-defined notion of whether it does or does not see
all other potential accesses.  IOW, if the C compiler is dealing with C
semantics and mechanisms only (including the C mechanisms for sharing
with non-C code!), then it will know what to do.

If you're playing tricks behind the C compiler's back using
implementation-defined stuff outside of the C specification, then
there's nothing the compiler really can do.  For example, if you're
trying to access a variable on a function's stack from some other
function, you better know how the register allocator of the compiler
operates.  In contrast, if you let this function simply export the
address of the variable to some external place, all will be fine.

The documentation of GCC's -fwhole-program and -flto might also be
interesting for you.  GCC wouldn't need to have -fwhole-program if it
weren't conservative by default (correctly so).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Wed, 2014-02-19 at 07:14 -0800, Paul E. McKenney wrote:
> On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote:
> > On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
> > > > xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com
> > > > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
> > > > 
> > > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> > > > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > > > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel 
> > > > > >  wrote:
> > > > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > > > > > >> And exactly because I know enough, I would *really* like atomics 
> > > > > > >> to be
> > > > > > >> well-defined, and have very clear - and *local* - rules about 
> > > > > > >> how they
> > > > > > >> can be combined and optimized.
> > > > > > >
> > > > > > > "Local"?
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > So I think that one of the big advantages of atomics over volatile 
> > > > > > is
> > > > > > that they *can* be optimized, and as such I'm not at all against
> > > > > > trying to generate much better code than for volatile accesses.
> > > > > > 
> > > > > > But at the same time, that can go too far. For example, one of the
> > > > > > things we'd want to use atomics for is page table accesses, where it
> > > > > > is very important that we don't generate multiple accesses to the
> > > > > > values, because parts of the values can be change *by*hardware* (ie
> > > > > > accessed and dirty bits).
> > > > > > 
> > > > > > So imagine that you have some clever global optimizer that sees that
> > > > > > the program never ever actually sets the dirty bit at all in any
> > > > > > thread, and then uses that kind of non-local knowledge to make
> > > > > > optimization decisions. THAT WOULD BE BAD.
> > > > > 
> > > > > Might as well list other reasons why value proofs via whole-program
> > > > > analysis are unreliable for the Linux kernel:
> > > > > 
> > > > > 1.As Linus said, changes from hardware.
> > > > 
> > > > This is what's volatile is for, right?  (Or the weak-volatile idea I
> > > > mentioned).
> > > > 
> > > > Compilers won't be able to prove something about the values of such
> > > > variables, if marked (weak-)volatile.
> > > 
> > > Yep.
> > > 
> > > > > 2.Assembly code that is not visible to the compiler.
> > > > >   Inline asms will -normally- let the compiler know what
> > > > >   memory they change, but some just use the "memory" tag.
> > > > >   Worse yet, I suspect that most compilers don't look all
> > > > >   that carefully at .S files.
> > > > > 
> > > > >   Any number of other programs contain assembly files.
> > > > 
> > > > Are the annotations of changed memory really a problem?  If the "memory"
> > > > tag exists, isn't that supposed to mean all memory?
> > > > 
> > > > To make a proof about a program for location X, the compiler has to
> > > > analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
> > > > the compiler will simply not be able to prove a thing (except maybe due
> > > > to the data-race-free requirement for non-atomics).  The attempt to
> > > > prove something isn't unreliable, simply because a correct compiler
> > > > won't claim to be able to "prove" something.
> > > 
> > > I am indeed less worried about inline assembler than I am about files
> > > full of assembly.  Or files full of other languages.
> > > 
> > > > One reason that could corrupt this is that if program addresses objects
> > > > other than through the mechanisms defined in the language.  For example,
> > > > if one thread lays out a data structure at a constant fixed memory
> > &

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Tue, 2014-02-18 at 14:14 -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 1:21 PM, Torvald Riegel  wrote:
> >>
> >> So imagine that you have some clever global optimizer that sees that
> >> the program never ever actually sets the dirty bit at all in any
> >> thread, and then uses that kind of non-local knowledge to make
> >> optimization decisions. THAT WOULD BE BAD.
> >>
> >> Do you see what I'm aiming for?
> >
> > Yes, I do.  But that seems to be "volatile" territory.  It crosses the
> > boundaries of the abstract machine, and thus is input/output.  Which
> > fraction of your atomic accesses can read values produced by hardware?
> > I would still suppose that lots of synchronization is not affected by
> > this.
> 
> The "hardware can change things" case is indeed pretty rare.
> 
> But quite frankly, even when it isn't hardware, as far as the compiler
> is concerned you have the exact same issue - you have TLB faults
> happening on other CPU's that do the same thing asynchronously using
> software TLB fault handlers. So *semantically*, it really doesn't make
> any difference what-so-ever if it's a software TLB handler on another
> CPU, a microcoded TLB fault, or an actual hardware path.

I think there are a few semantic differences:

* If a SW handler uses the C11 memory model, it will synchronize like
any other thread.  HW might do something else entirely, including
synchronizing differently, not using atomic accesses, etc.  (At least
that's the constraints I had in mind).

* If we can treat any interrupt handler like Just Another Thread, then
the next question is whether the compiler will be aware that there is
another thread.  I think that in practice it will be: You'll set up the
handler in some way by calling a function the compiler can't analyze, so
the compiler will know that stuff accessible to the handler (e.g.,
global variables) will potentially be accessed by other threads. 

* Similarly, if the C code is called from some external thing, it also
has to assume the presence of other threads.  (Perhaps this is what the
compiler has to assume in a freestanding implementation anyway...)

However, accessibility will be different for, say, stack variables that
haven't been shared with other functions yet; those are arguably not
reachable by other things, at least not through mechanisms defined by
the C standard.  So optimizing these should be possible with the
assumption that there is no other thread (at least as default -- I'm not
saying that this is the only reasonable semantics).

> So if the answer for all of the above is "use volatile", then I think
> that means that the C11 atomics are badly designed.
> 
> The whole *point* of atomic accesses is that stuff like above should
> "JustWork(tm)"

I think that it should in the majority of cases.  If the other thing
potentially accessing can do as much as a valid C11 thread can do, the
synchronization itself will work just fine.  In most cases except the
(void*)0x123 example (or linker scripts etc.) the compiler is aware when
data is made visible to other threads or other non-analyzable functions
that may spawn other threads (or just by being a plain global variable
accessible to other (potentially .S) translation units.

> > Do you perhaps want a weaker form of volatile?  That is, one that, for
> > example, allows combining of two adjacent loads of the dirty bits, but
> > will make sure that this is treated as if there is some imaginary
> > external thread that it cannot analyze and that may write?
> 
> Yes, that's basically what I would want. And it is what I would expect
> an atomic to be. Right now we tend to use "ACCESS_ONCE()", which is a
> bit of a misnomer, because technically we really generally want
> "ACCESS_AT_MOST_ONCE()" (but "once" is what we get, because we use
> volatile, and is a hell of a lot simpler to write ;^).
> 
> So we obviously use "volatile" for this currently, and generally the
> semantics we really want are:
> 
>  - the load or store is done as a single access ("atomic")
> 
>  - the compiler must not try to re-materialize the value by reloading
> it from memory (this is the "at most once" part)

In the presence of other threads performing operations unknown to the
compiler, that's what you should get even if the compiler is trying to
optimize C11 atomics.  The first requirement is clear, and the "at most
once" follows from another thread potentially writing to the variable.

The only difference I can see right now is that a compiler may be able
to *prove* that it doesn't matter whether it reloaded the value or not.
But this seems very 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Tue, 2014-02-18 at 22:47 +0100, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> > Yes, I do.  But that seems to be "volatile" territory.  It crosses the
> > boundaries of the abstract machine, and thus is input/output.  Which
> > fraction of your atomic accesses can read values produced by hardware?
> > I would still suppose that lots of synchronization is not affected by
> > this.
> 
> Its not only hardware; also the kernel/user boundary has this same
> problem. We cannot a-priory say what userspace will do; in fact, because
> we're a general purpose OS, we must assume it will willfully try its
> bestest to wreck whatever assumptions we make about its behaviour.

That's a good note, and I think a distinct case from those below,
because here you're saying that you can't assume that the userspace code
follows the C11 semantics ...

> We also have loadable modules -- much like regular userspace DSOs -- so
> there too we cannot say what will or will not happen.
> 
> We also have JITs that generate code on demand.

.. whereas for those, you might assume that the other code follows C11
semantics and the same ABI, which makes this just a normal case already
handled as (see my other replies nearby in this thread).

> And I'm absolutely sure (with the exception of the JITs, its not an area
> I've worked on) that we have atomic usage across all those boundaries.

That would be fine as long as all involved parties use the same memory
model and ABI to implement it.

(Of course, I'm assuming here that the compiler is aware of sharing with
other entities, which is always the case except in those corner case
like accesses to (void*)0x123 magically aliasing with something else).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote:
> On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
> > xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
> > 
> > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  
> > > > wrote:
> > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > > > >> And exactly because I know enough, I would *really* like atomics to 
> > > > >> be
> > > > >> well-defined, and have very clear - and *local* - rules about how 
> > > > >> they
> > > > >> can be combined and optimized.
> > > > >
> > > > > "Local"?
> > > > 
> > > > Yes.
> > > > 
> > > > So I think that one of the big advantages of atomics over volatile is
> > > > that they *can* be optimized, and as such I'm not at all against
> > > > trying to generate much better code than for volatile accesses.
> > > > 
> > > > But at the same time, that can go too far. For example, one of the
> > > > things we'd want to use atomics for is page table accesses, where it
> > > > is very important that we don't generate multiple accesses to the
> > > > values, because parts of the values can be change *by*hardware* (ie
> > > > accessed and dirty bits).
> > > > 
> > > > So imagine that you have some clever global optimizer that sees that
> > > > the program never ever actually sets the dirty bit at all in any
> > > > thread, and then uses that kind of non-local knowledge to make
> > > > optimization decisions. THAT WOULD BE BAD.
> > > 
> > > Might as well list other reasons why value proofs via whole-program
> > > analysis are unreliable for the Linux kernel:
> > > 
> > > 1.As Linus said, changes from hardware.
> > 
> > This is what's volatile is for, right?  (Or the weak-volatile idea I
> > mentioned).
> > 
> > Compilers won't be able to prove something about the values of such
> > variables, if marked (weak-)volatile.
> 
> Yep.
> 
> > > 2.Assembly code that is not visible to the compiler.
> > >   Inline asms will -normally- let the compiler know what
> > >   memory they change, but some just use the "memory" tag.
> > >   Worse yet, I suspect that most compilers don't look all
> > >   that carefully at .S files.
> > > 
> > >   Any number of other programs contain assembly files.
> > 
> > Are the annotations of changed memory really a problem?  If the "memory"
> > tag exists, isn't that supposed to mean all memory?
> > 
> > To make a proof about a program for location X, the compiler has to
> > analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
> > the compiler will simply not be able to prove a thing (except maybe due
> > to the data-race-free requirement for non-atomics).  The attempt to
> > prove something isn't unreliable, simply because a correct compiler
> > won't claim to be able to "prove" something.
> 
> I am indeed less worried about inline assembler than I am about files
> full of assembly.  Or files full of other languages.
> 
> > One reason that could corrupt this is that if program addresses objects
> > other than through the mechanisms defined in the language.  For example,
> > if one thread lays out a data structure at a constant fixed memory
> > address, and another one then uses the fixed memory address to get
> > access to the object with a cast (e.g., (void*)0x123).
> 
> Or if the program uses gcc linker scripts to get the same effect.
> 
> > > 3.Kernel modules that have not yet been written.  Now, the
> > >   compiler could refrain from trying to prove anything about
> > >   an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> > >   is currently no way to communicate this information to the
> > >   compiler other than marking the variable "volatile".
> > 
> > Even if the variable is just externally accessible, then the compiler
> > knows that it can't do whole-program analysis about it.
> > 
> > It is true that whole-program analysis will not be applicable in this
> > case, but it will not be 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Tue, 2014-02-18 at 22:52 +0100, Peter Zijlstra wrote:
> > > 4.Some drivers allow user-mode code to mmap() some of their
> > >   state.  Any changes undertaken by the user-mode code would
> > >   be invisible to the compiler.
> > 
> > A good point, but a compiler that doesn't try to (incorrectly) assume
> > something about the semantics of mmap will simply see that the mmap'ed
> > data will escape to stuff if can't analyze, so it will not be able to
> > make a proof.
> > 
> > This is different from, for example, malloc(), which is guaranteed to
> > return "fresh" nonaliasing memory.
> 
> The kernel side of this is different.. it looks like 'normal' memory, we
> just happen to allow it to end up in userspace too.
> 
> But on that point; how do you tell the compiler the difference between
> malloc() and mmap()? Is that some function attribute?

Yes:

malloc
The malloc attribute is used to tell the compiler that a
function may be treated as if any non-NULL pointer it returns
cannot alias any other pointer valid when the function returns
and that the memory has undefined content. This often improves
optimization. Standard functions with this property include
malloc and calloc. realloc-like functions do not have this
property as the memory pointed to does not have undefined
content.

I'm not quite sure whether GCC assumes malloc() to be indeed C's malloc
even if the function attribute isn't used, and/or whether that is
different for freestanding environments.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Tue, 2014-02-18 at 23:48 +, Peter Sewell wrote:
> On 18 February 2014 20:43, Torvald Riegel  wrote:
> > On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote:
> >> Several of you have said that the standard and compiler should not
> >> permit speculative writes of atomics, or (effectively) that the
> >> compiler should preserve dependencies.  In simple examples it's easy
> >> to see what that means, but in general it's not so clear what the
> >> language should guarantee, because dependencies may go via non-atomic
> >> code in other compilation units, and we have to consider the extent to
> >> which it's desirable to limit optimisation there.
> >
> > [...]
> >
> >> 2) otherwise, the language definition should prohibit it but the
> >>compiler would have to preserve dependencies even in compilation
> >>units that have no mention of atomics.  It's unclear what the
> >>(runtime and compiler development) cost of that would be in
> >>practice - perhaps Torvald could comment?
> >
> > If I'm reading the standard correctly, it requires that data
> > dependencies are preserved through loads and stores, including nonatomic
> > ones.  That sounds convenient because it allows programmers to use
> > temporary storage.
> 
> The standard only needs this for consume chains,

That's right, and the runtime cost / implementation problems of
mo_consume was what I was making statements about.  Sorry if that wasn't
clear.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> > Well, that's how atomics that aren't volatile are defined in the
> > standard.  I can see that you want something else too, but that doesn't
> > mean that the other thing is broken.
> 
> Well that other thing depends on being able to see the entire program at
> compile time. PaulMck already listed various ways in which this is
> not feasible even for normal userspace code.
> 
> In particular; DSOs and JITs were mentioned.

No it doesn't depend on whole-program analysis being possible.  Because
if it isn't, then a correct compiler will just not do certain
optimizations simply because it can't prove properties required for the
optimization to hold.  With the exception of access to objects via magic
numbers (e.g., fixed and known addresses (see my reply to Paul), which
are outside of the semantics specified in the standard), I don't see a
correctness problem here.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  wrote:
> > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > >> And exactly because I know enough, I would *really* like atomics to be
> > >> well-defined, and have very clear - and *local* - rules about how they
> > >> can be combined and optimized.
> > >
> > > "Local"?
> > 
> > Yes.
> > 
> > So I think that one of the big advantages of atomics over volatile is
> > that they *can* be optimized, and as such I'm not at all against
> > trying to generate much better code than for volatile accesses.
> > 
> > But at the same time, that can go too far. For example, one of the
> > things we'd want to use atomics for is page table accesses, where it
> > is very important that we don't generate multiple accesses to the
> > values, because parts of the values can be change *by*hardware* (ie
> > accessed and dirty bits).
> > 
> > So imagine that you have some clever global optimizer that sees that
> > the program never ever actually sets the dirty bit at all in any
> > thread, and then uses that kind of non-local knowledge to make
> > optimization decisions. THAT WOULD BE BAD.
> 
> Might as well list other reasons why value proofs via whole-program
> analysis are unreliable for the Linux kernel:
> 
> 1.As Linus said, changes from hardware.

This is what's volatile is for, right?  (Or the weak-volatile idea I
mentioned).

Compilers won't be able to prove something about the values of such
variables, if marked (weak-)volatile.

> 2.Assembly code that is not visible to the compiler.
>   Inline asms will -normally- let the compiler know what
>   memory they change, but some just use the "memory" tag.
>   Worse yet, I suspect that most compilers don't look all
>   that carefully at .S files.
> 
>   Any number of other programs contain assembly files.

Are the annotations of changed memory really a problem?  If the "memory"
tag exists, isn't that supposed to mean all memory?

To make a proof about a program for location X, the compiler has to
analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
the compiler will simply not be able to prove a thing (except maybe due
to the data-race-free requirement for non-atomics).  The attempt to
prove something isn't unreliable, simply because a correct compiler
won't claim to be able to "prove" something.

One reason that could corrupt this is that if program addresses objects
other than through the mechanisms defined in the language.  For example,
if one thread lays out a data structure at a constant fixed memory
address, and another one then uses the fixed memory address to get
access to the object with a cast (e.g., (void*)0x123).

> 3.Kernel modules that have not yet been written.  Now, the
>   compiler could refrain from trying to prove anything about
>   an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
>   is currently no way to communicate this information to the
>   compiler other than marking the variable "volatile".

Even if the variable is just externally accessible, then the compiler
knows that it can't do whole-program analysis about it.

It is true that whole-program analysis will not be applicable in this
case, but it will not be unreliable.  I think that's an important
difference.

>   Other programs have similar issues, e.g., via dlopen().
> 
> 4.Some drivers allow user-mode code to mmap() some of their
>   state.  Any changes undertaken by the user-mode code would
>   be invisible to the compiler.

A good point, but a compiler that doesn't try to (incorrectly) assume
something about the semantics of mmap will simply see that the mmap'ed
data will escape to stuff if can't analyze, so it will not be able to
make a proof.

This is different from, for example, malloc(), which is guaranteed to
return "fresh" nonaliasing memory.

> 5.JITed code produced based on BPF: https://lwn.net/Articles/437981/

This might be special, or not, depending on how the JITed code gets
access to data.  If this is via fixed addresses (e.g., (void*)0x123),
then see above.  If this is through function calls that the compiler
can't analyze, then this is like 4.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Tue, 2014-02-18 at 08:49 -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  wrote:
> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> >> And exactly because I know enough, I would *really* like atomics to be
> >> well-defined, and have very clear - and *local* - rules about how they
> >> can be combined and optimized.
> >
> > "Local"?
> 
> Yes.
> 
> So I think that one of the big advantages of atomics over volatile is
> that they *can* be optimized, and as such I'm not at all against
> trying to generate much better code than for volatile accesses.
> 
> But at the same time, that can go too far. For example, one of the
> things we'd want to use atomics for is page table accesses, where it
> is very important that we don't generate multiple accesses to the
> values, because parts of the values can be change *by*hardware* (ie
> accessed and dirty bits).
> 
> So imagine that you have some clever global optimizer that sees that
> the program never ever actually sets the dirty bit at all in any
> thread, and then uses that kind of non-local knowledge to make
> optimization decisions. THAT WOULD BE BAD.
> 
> Do you see what I'm aiming for?

Yes, I do.  But that seems to be "volatile" territory.  It crosses the
boundaries of the abstract machine, and thus is input/output.  Which
fraction of your atomic accesses can read values produced by hardware?
I would still suppose that lots of synchronization is not affected by
this.

Do you perhaps want a weaker form of volatile?  That is, one that, for
example, allows combining of two adjacent loads of the dirty bits, but
will make sure that this is treated as if there is some imaginary
external thread that it cannot analyze and that may write?

I'm trying to be careful here in distinguishing between volatile and
synchronization because I believe those are orthogonal, and this is a
good thing because it allows for more-than-conservatively optimized
code.

> Any optimization that tries to prove
> anything from more than local state is by definition broken, because
> it assumes that everything is described by the program.

Well, that's how atomics that aren't volatile are defined in the
standard.  I can see that you want something else too, but that doesn't
mean that the other thing is broken.

> But *local* optimizations are fine, as long as they follow the obvious
> rule of not actually making changes that are semantically visible.

If we assume that there is this imaginary thread called hardware that
can write/read to/from such weak-volatile atomics, I believe this should
restrict optimizations sufficiently even in the model as specified in
the standard.  For example, it would prevent a compiler from proving
that there is no access by another thread to a variable, so it would
prevent the cases in our discussion that you didn't want to get
optimized.  Yet, I believe the model itself could stay unmodified.

Thus, with this "weak volatile", we could let programmers request
precisely the semantics that they want, without using volatile too much
and without preventing optimizations more than necessary.  Thoughts?

> (In practice, I'd be impressed as hell for that particular example,
> and we actually do end up setting the dirty bit by hand in some
> situations, so the example is slightly made up, but there are other
> cases that might be more realistic in that sometimes we do things that
> are hidden from the compiler - in assembly etc - and the compiler
> might *think* it knows what is going on, but it doesn't actually see
> all accesses).

If something is visible to assembly, then the compiler should see this
(or at least be aware that it can't prove anything about such data).  Or
am I missing anything?

> > Sorry, but the rules *are* very clear.  I *really* suggest to look at
> > the formalization by Batty et al.  And in these rules, proving that a
> > read will always return value X has a well-defined meaning, and you can
> > use it.  That simply follows from how the model is built.
> 
> What "model"?

The C/C++ memory model was what I was referring to.

> That's the thing. I have tried to figure out whether the model is some
> abstract C model, or a model based on the actual hardware that the
> compiler is compiling for, and whether the model is one that assumes
> the compiler has complete knowledge of the system (see the example
> above).

It's a model as specified in the standard.  It's not parametrized by the
hardware the program will eventually run on (ignoring
implementation-defined behavior, timing, etc.).  The compiler has
complete knowledge of the system unless for "volatiles" and things
coming in from the extern

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Tue, 2014-02-18 at 18:21 +, Peter Sewell wrote:
> On 18 February 2014 17:38, Linus Torvalds  
> wrote:
> > On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell  
> > wrote:
> >>
> >> For example, suppose we have, in one compilation unit:
> >>
> >> void f(int ra, int*rb) {
> >>   if (ra==42)
> >> *rb=42;
> >>   else
> >> *rb=42;
> >> }
> >
> > So this is a great example, and in general I really like your page at:
> >
> >> For more context, this example is taken from a summary of the thin-air
> >> problem by Mark Batty and myself,
> >> , and the problem with
> >> dependencies via other compilation units was AFAIK first pointed out
> >> by Hans Boehm.
> >
> > and the reason I like your page is that it really talks about the
> > problem by pointing to the "unoptimized" code, and what hardware would
> > do.
> 
> Thanks.  It's certainly necessary to separately understand what compiler
> optimisation and the hardware might do, to get anywhere here.   But...
> 
> > As mentioned, I think that's actually the *correct* way to think about
> > the problem space, because it allows the programmer to take hardware
> > characteristics into account, without having to try to "describe" them
> > at a source level.
> 
> ...to be clear, I am ultimately after a decent source-level description of 
> what
> programmers can depend on, and we (Mark and I) view that page as
> identifying constraints on what that description can say.  There are too
> many compiler optimisations for people to reason directly in terms of
> the set of all transformations that they do, so we need some more
> concise and comprehensible envelope identifying what is allowed,
> as an interface between compiler writers and users.  AIUI that's basically
> what Torvald is arguing.

Yes, that's one reason.  Another one is that if a programmer would
actually want to use atomics in a machine-independent / portable way,
he/she does also not want to reason about how all those transformations
might interact with the machine's memory model.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote:
> Several of you have said that the standard and compiler should not
> permit speculative writes of atomics, or (effectively) that the
> compiler should preserve dependencies.  In simple examples it's easy
> to see what that means, but in general it's not so clear what the
> language should guarantee, because dependencies may go via non-atomic
> code in other compilation units, and we have to consider the extent to
> which it's desirable to limit optimisation there.

[...]

> 2) otherwise, the language definition should prohibit it but the
>compiler would have to preserve dependencies even in compilation
>units that have no mention of atomics.  It's unclear what the
>(runtime and compiler development) cost of that would be in
>practice - perhaps Torvald could comment?

If I'm reading the standard correctly, it requires that data
dependencies are preserved through loads and stores, including nonatomic
ones.  That sounds convenient because it allows programmers to use
temporary storage.

However, what happens if a dependency "arrives" at a store for which the
alias set isn't completely known?  Then we either have to add a barrier
to enforce the ordering at this point, or we have to assume that all
other potentially aliasing memory locations would also have to start
carrying dependencies (which might be in other functions in other
compilation units).  Neither option is good.  The first might introduce
barriers in places in which they might not be required (or the
programmer has to use kill_dependency() quite often to avoid all these).
The second is bad because points-to analysis is hard, so in practice the
points-to set will not be precisely known for a lot of pointers.  So
this might not just creep into other functions via calls of
[[carries_dependency]] functions, but also through normal loads and
stores, likely prohibiting many optimizations.

Furthermore, the dependency tracking can currently only be
"disabled/enabled" on a function granularity (via
[[carries_dependency]]).  Thus, if we have big functions, then
dependency tracking may slow down a lot of code in the big function.  If
we have small functions, there's a lot of attributes to be added.

If a function may only carry a dependency but doesn't necessarily (eg,
depending on input parameters), then the programmer has to make a
trade-off whether he/she want's to benefit from mo_consume but slow down
other calls due to additional barriers (ie, when this function is called
from non-[[carries_dependency]] functions), or vice versa.  (IOW,
because of the function granularity, other code's performance is
affected.)

If a compiler wants to implement dependency tracking just for a few
constructs (e.g., operators -> + ...) and use barriers otherwise, then
this decision must be compatible with how all this is handled in other
compilation units.  Thus, compiler optimizations effectively become part
of the ABI, which doesn't seem right.

I hope these examples illustrate my concerns about the implementability
in practice of this.  It's also why I've suggested to move from an
opt-out approach as in the current standard (ie, with kill_dependency())
to an opt-in approach for conservative dependency tracking (e.g., with a
preserve_dependencies(exp) call, where exp will not be optimized in a
way that removes any dependencies).  This wouldn't help with many
optimizations being prevented, but it should at least help programmers
contain the problem to smaller regions of code.

I'm not aware of any implementation that tries to track dependencies, so
I can't give any real performance numbers.  This could perhaps be
simulated, but I'm not sure whether a realistic case would be made
without at least supporting [[carries_dependency]] properly in the
compiler, which would be some work.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Tue, 2014-02-18 at 08:55 -0800, Paul E. McKenney wrote:
> On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote:
> > On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote:
> > > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel  
> > > wrote:
> > > >
> > > > There's an underlying problem here that's independent from the actual
> > > > instance that you're worried about here: "no sense" is a ultimately a
> > > > matter of taste/objectives/priorities as long as the respective
> > > > specification is logically consistent.
> > > 
> > > Yes. But I don't think it's "independent".
> > > 
> > > Exactly *because* some people will read standards without applying
> > > "does the resulting code generation actually make sense for the
> > > programmer that wrote the code", the standard has to be pretty clear.
> > > 
> > > The standard often *isn't* pretty clear. It wasn't clear enough when
> > > it came to "volatile", and yet that was a *much* simpler concept than
> > > atomic accesses and memory ordering.
> > > 
> > > And most of the time it's not a big deal. But because the C standard
> > > generally tries to be very portable, and cover different machines,
> > > there tends to be a mindset that anything inherently unportable is
> > > "undefined" or "implementation defined", and then the compiler writer
> > > is basically given free reign to do anything they want (with
> > > "implementation defined" at least requiring that it is reliably the
> > > same thing).
> > 
> > Yes, that's how it works in general.  And this makes sense, because all
> > optimizations rely on that.  Second, you can't keep something consistent
> > (eg, between compilers) if it isn't specified.  So if we want stricter
> > rules, those need to be specified somewhere.
> > 
> > > And when it comes to memory ordering, *everything* is basically
> > > non-portable, because different CPU's very much have different rules.
> > 
> > Well, the current set of memory orders (and the memory model as a whole)
> > is portable, even though it might not allow to exploit all hardware
> > properties, and thus might perform sub-optimally in some cases.
> > 
> > > I worry that that means that the standard then takes the stance that
> > > "well, compiler re-ordering is no worse than CPU re-ordering, so we
> > > let the compiler do anything". And then we have to either add
> > > "volatile" to make sure the compiler doesn't do that, or use an overly
> > > strict memory model at the compiler level that makes it all pointless.
> > 
> > Using "volatile" is not a good option, I think, because synchronization
> > between threads should be orthogonal to observable output of the
> > abstract machine.
> 
> Are you thinking of "volatile" -instead- of atomics?  My belief is that
> given the current standard there will be times that we need to use
> "volatile" -in- -addition- to atomics.

No, I was talking about having to use "volatile" in addition to atomics
to get synchronization properties for the atomics.  ISTM that this would
be unfortunate because the objective for "volatile" (ie, declaring what
is output of the abstract machine) is orthogonal to synchronization
between threads.  So if we want to preserve control dependencies and get
ordering guarantees through that, I wouldn't prefer it through hacks
based on volatile.

For example, if we have a macro or helper function that contains
synchronization code, but the compiler can prove that the macro/function
is used just on data only accessible to a single thread, then the
"volatile" notation on it would prevent optimizations.

> > The current memory model might not allow to exploit all hardware
> > properties, I agree.
> > 
> > But then why don't we work on how to extend it to do so?  We need to
> > specify the behavior we want anyway, and this can't be independent of
> > the language semantics, so it has to be conceptually integrated with the
> > standard anyway.
> > 
> > > So I really really hope that the standard doesn't give compiler
> > > writers free hands to do anything that they can prove is "equivalent"
> > > in the virtual C machine model.
> > 
> > It does, but it also doesn't mean this can't be extended.  So let's
> > focus on whether we can find an extension.
> > 
> > > That's not how you get reliable
> > > results.
> > 
> > In this general form, that's obviously a false claim.
> 
> These two sentences starkly illustrate the difference in perspective
> between you two.  You are talking past each other.  Not sure how to fix
> this at the moment, but what else is new?  ;-)

Well, we're still talking, and we are making little steps in clarifying
things, so there is progress :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel  wrote:
> > The standard is clear on what's required.  I strongly suggest reading
> > the formalization of the memory model by Batty et al.
> 
> Can you point to it? Because I can find a draft standard, and it sure
> as hell does *not* contain any clarity of the model. It has a *lot* of
> verbiage, but it's pretty much impossible to actually understand, even
> for somebody who really understands memory ordering.

http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
This has an explanation of the model up front, and then the detailed
formulae in Section 6.  This is from 2010, and there might have been
smaller changes since then, but I'm not aware of any bigger ones.

The cppmem tool is based on this, so if you want to play around with a
few code examples, it's pretty nice because it shows you all allowed
executions, including graphs like in the paper (see elsewhere in this
thread for CAS syntax):
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Mon, 2014-02-17 at 16:09 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 3:17 PM, Torvald Riegel  wrote:
> > On Mon, 2014-02-17 at 14:32 -0800,
> >
> >> Stop claiming it "can return 1".. It *never* returns 1 unless you do
> >> the load and *verify* it, or unless the load itself can be made to go
> >> away. And with the code sequence given, that just doesn't happen. END
> >> OF STORY.
> >
> > void foo();
> > {
> >   atomic x = 1;
> >   if (atomic_load(&x, mo_relaxed) == 1)
> > atomic_store(&y, 3, mo_relaxed));
> > }
> 
> This is the very example I gave, where the real issue is not that "you
> prove that load returns 1", you instead say "store followed by a load
> can be combined".
> 
> I (in another email I just wrote) tried to show why the "prove
> something is true" is a very dangerous model.  Seriously, it's pure
> crap. It's broken.

I don't see anything dangerous in the example above with the language
semantics as specified: It's a well-defined situation, given the rules
of the language.  I replied to the other email you wrote with my
viewpoint on why the above is useful, how it compares to what you seem
to what, and where I think we need to start to bridge the gap.

> If the C standard defines atomics in terms of "provable equivalence",
> it's broken. Exactly because on a *virtual* machine you can prove
> things that are not actually true in a *real* machine.

For the control dependencies you have in mind, it's actually the other
way around.  You expect the real machine's properties in a program whose
semantics only give you the virtual machine's properties.  Anything you
prove on the virtual machine will be true on the real machine (in a
correct implementation) -- but you can't expect to have real-machine
properties on language that's based on the virtual machine.

> I have the
> example of value speculation changing the memory ordering model of the
> actual machine.

This example is not true for the language as specified.  It is true for
a modified language that you have in mind, but for this one I've just
seen pretty rough rules so far.  Please see my other reply.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Mon, 2014-02-17 at 19:42 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 7:24 PM, Linus Torvalds
>  wrote:
> >
> > As far as I can tell, the intent is that you can't do value
> > speculation (except perhaps for the "relaxed", which quite frankly
> > sounds largely useless).
> 
> Hmm. The language I see for "consume" is not obvious:
> 
>   "Consume operation: no reads in the current thread dependent on the
> value currently loaded can be reordered before this load"

I can't remember seeing that language in the standard (ie, C or C++).
Where is this from?

> and it could make a compiler writer say that value speculation is
> still valid, if you do it like this (with "ptr" being the atomic
> variable):
> 
>   value = ptr->val;

I assume the load from ptr has mo_consume ordering?

> into
> 
>   tmp = ptr;
>   value = speculated.value;
>   if (unlikely(tmp != &speculated))
> value = tmp->value;
> 
> which is still bogus. The load of "ptr" does happen before the load of
> "value = speculated->value" in the instruction stream, but it would
> still result in the CPU possibly moving the value read before the
> pointer read at least on ARM and power.

And surprise, in the C/C++ model the load from ptr is sequenced-before
the load from speculated, but there's no ordering constraint on the
reads-from relation for the value load if you use mo_consume on the ptr
load.  Thus, the transformed code has less ordering constraints than the
original code, and we arrive at the same outcome.

> So if you're a compiler person, you think you followed the letter of
> the spec - as far as *you* were concerned, no load dependent on the
> value of the atomic load moved to before the atomic load.

No.  Because the wobbly sentence you cited(?) above is not what the
standard says.

Would you please stop making claims about what compiler writers would do
or not if you seemingly aren't even familiar with the model that
compiler writers would use to reason about transformations?  Seriously?

> You go home,
> happy, knowing you've done your job. Never mind that you generated
> code that doesn't actually work.
> 
> I dread having to explain to the compiler person that he may be right
> in some theoretical virtual machine, but the code is subtly broken and
> nobody will ever understand why (and likely not be able to create a
> test-case showing the breakage).
> 
> But maybe the full standard makes it clear that "reordered before this
> load" actually means on the real hardware, not just in the generated
> instruction stream.

Do you think everyone else is stupid?  If there's an ordering constraint
in the virtual machine, it better be present when executing in the real
machine unless it provably cannot result in different output as
specified by the language's semantics.

> Reading it with understanding of the *intent* and
> understanding all the different memory models that requirement should
> be obvious (on alpha, you need an "rmb" instruction after the load),
> but ...

The standard is clear on what's required.  I strongly suggest reading
the formalization of the memory model by Batty et al.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote:
> On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote:
> > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel  wrote:
> > >
> > > Which example do you have in mind here?  Haven't we resolved all the
> > > debated examples, or did I miss any?
> > 
> > Well, Paul seems to still think that the standard possibly allows
> > speculative writes or possibly value speculation in ways that break
> > the hardware-guaranteed orderings.
> 
> It is not that I know of any specific problems, but rather that I
> know I haven't looked under all the rocks.  Plus my impression from
> my few years on the committee is that the standard will be pushed to
> the limit when it comes time to add optimizations.
> 
> One example that I learned about last week uses the branch-prediction
> hardware to validate value speculation.  And no, I am not at all a fan
> of value speculation, in case you were curious.  However, it is still
> an educational example.
> 
> This is where you start:
> 
>   p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
>   do_something(p->a, p->b, p->c);
>   p->d = 1;

I assume that's the source code.

> Then you leverage branch-prediction hardware as follows:
> 
>   p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
>   if (p == GUESS) {
>   do_something(GUESS->a, GUESS->b, GUESS->c);
>   GUESS->d = 1;
>   } else {
>   do_something(p->a, p->b, p->c);
>   p->d = 1;
>   }

I assume that this is a potential transformation by a compiler.

> The CPU's branch-prediction hardware squashes speculation in the case where
> the guess was wrong, and this prevents the speculative store to ->d from
> ever being visible.  However, the then-clause breaks dependencies, which
> means that the loads -could- be speculated, so that do_something() gets
> passed pre-initialization values.
> 
> Now, I hope and expect that the wording in the standard about dependency
> ordering prohibits this sort of thing.  But I do not yet know for certain.

The transformation would be incorrect.  p->a in the source code carries
a dependency, and as you say, the transformed code wouldn't have that
dependency any more.  So the transformed code would loose ordering
constraints that it has in the virtual machine, so in the absence of
other proofs of correctness based on properties not shown in the
example, the transformed code would not result in the same behavior as
allowed by the abstract machine.

If the transformation would actually be by a programmer, then this
wouldn't do the same as the first example because mo_consume doesn't
work through the if statement.

Are there other specified concerns that you have regarding this example?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel  wrote:
> >
> > There's an underlying problem here that's independent from the actual
> > instance that you're worried about here: "no sense" is a ultimately a
> > matter of taste/objectives/priorities as long as the respective
> > specification is logically consistent.
> 
> Yes. But I don't think it's "independent".
> 
> Exactly *because* some people will read standards without applying
> "does the resulting code generation actually make sense for the
> programmer that wrote the code", the standard has to be pretty clear.
> 
> The standard often *isn't* pretty clear. It wasn't clear enough when
> it came to "volatile", and yet that was a *much* simpler concept than
> atomic accesses and memory ordering.
> 
> And most of the time it's not a big deal. But because the C standard
> generally tries to be very portable, and cover different machines,
> there tends to be a mindset that anything inherently unportable is
> "undefined" or "implementation defined", and then the compiler writer
> is basically given free reign to do anything they want (with
> "implementation defined" at least requiring that it is reliably the
> same thing).

Yes, that's how it works in general.  And this makes sense, because all
optimizations rely on that.  Second, you can't keep something consistent
(eg, between compilers) if it isn't specified.  So if we want stricter
rules, those need to be specified somewhere.

> And when it comes to memory ordering, *everything* is basically
> non-portable, because different CPU's very much have different rules.

Well, the current set of memory orders (and the memory model as a whole)
is portable, even though it might not allow to exploit all hardware
properties, and thus might perform sub-optimally in some cases.

> I worry that that means that the standard then takes the stance that
> "well, compiler re-ordering is no worse than CPU re-ordering, so we
> let the compiler do anything". And then we have to either add
> "volatile" to make sure the compiler doesn't do that, or use an overly
> strict memory model at the compiler level that makes it all pointless.

Using "volatile" is not a good option, I think, because synchronization
between threads should be orthogonal to observable output of the
abstract machine.

The current memory model might not allow to exploit all hardware
properties, I agree.

But then why don't we work on how to extend it to do so?  We need to
specify the behavior we want anyway, and this can't be independent of
the language semantics, so it has to be conceptually integrated with the
standard anyway.

> So I really really hope that the standard doesn't give compiler
> writers free hands to do anything that they can prove is "equivalent"
> in the virtual C machine model.

It does, but it also doesn't mean this can't be extended.  So let's
focus on whether we can find an extension.

> That's not how you get reliable
> results.

In this general form, that's obviously a false claim.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel
On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> And exactly because I know enough, I would *really* like atomics to be
> well-defined, and have very clear - and *local* - rules about how they
> can be combined and optimized.

"Local"?

> None of this "if you can prove that the read has value X" stuff. And
> things like value speculation should simply not be allowed, because
> that actually breaks the dependency chain that the CPU architects give
> guarantees for. Instead, make the rules be very clear, and very
> simple, like my suggestion. You can never remove a load because you
> can "prove" it has some value, but you can combine two consecutive
> atomic accesses/

Sorry, but the rules *are* very clear.  I *really* suggest to look at
the formalization by Batty et al.  And in these rules, proving that a
read will always return value X has a well-defined meaning, and you can
use it.  That simply follows from how the model is built.

What you seem to want just isn't covered by the model as it is today --
you can't infer from that that the model itself would be wrong.  The
dependency chains aren't modeled in the way you envision it (except in
what consume_mo tries, but that seems to be hard to implement); they are
there on the level of the program logic as modeled by the abstract
machine and the respective execution/output rules, but they are not
built to represent those specific ordering guarantees the HW gives you.

I would also be cautious claiming that the rules you suggested would be
very clear and very simple.  I haven't seen a memory model spec from you
that would be viable as the standard model for C/C++, nor have I seen
proof that this would actually be easier to understand for programmers
in general.

> For example, CPU people actually do tend to give guarantees for
> certain things, like stores that are causally related being visible in
> a particular order.

Yes, but that's not part of the model so far.  If you want to exploit
this, please make a suggestion for how to extend the model to cover
this.

You certainly expect compilers to actually optimize code, which usually
removes or reorders stuff.  Now, we could say that we just don't want
that for atomic accesses, but even this isn't as clear-cut: How do we
actually detect where non-atomic code is intended by the programmer to
express (1) a dependency that needs to be transformed into a dependency
in the generated code from (2) a dependency that is just program logic?

IOW, consider the consume_mo example "*(p + flag - flag)", where flag is
coming from a consume_mo load: Can you give a complete set of rules (for
a full program) for when the compiler is allowed to optimize out
flag-flag and when not?  (And it should be practically implementable in
a compiler, and not prevent optimizations where people expect them.)

> If the compiler starts doing value speculation on
> atomic accesses, you are quite possibly breaking things like that.
> It's just not a good idea. Don't do it. Write the standard so that it
> clearly is disallowed.

You never want value speculation for anything possibly originating from
an atomic load?  Or what are the concrete rules you have in mind?

> Because you may think that a C standard is machine-independent, but
> that isn't really the case. The people who write code still write code
> for a particular machine. Our code works (in the general case) on
> different byte orderings, different register sizes, different memory
> ordering models. But in each *instance* we still end up actually
> coding for each machine.

That's how *you* do it.  I'm sure you are aware that for lots of
programmers, having a machine-dependent standard is not helpful at all.
As the standard is written, code is portable.  It can certainly be
beneficial for programmers to optimize for different machines, but the
core of the memory model will always remain portable, because that's
what arguably most programmers need.  That doesn't prevent
machine-dependent extensions, but those then should be well integrated
with the rest of the model.

> So the rules for atomics should be simple and *specific* enough that
> when you write code for a particular architecture, you can take the
> architecture memory ordering *and* the C atomics orderings into
> account, and do the right thing for that architecture.

That would be an extension, but not viable as a *general* requirement as
far as the standard is concerned.

> And that very much means that doing things like value speculation MUST
> NOT HAPPEN. See? Even if you can prove that your code is "equivalent",
> it isn't.

I'm kind of puzzled why you keep stating generalizations of assertions
that clearly aren't well-founded (ie, which might be true for the kernel
or your wishes for how the world should be, but aren't true in general
or for the objectives of others).

It's not helpful to just pretend other people's opinions or worlds
didn't exist in discussions such as this one.  It's clear that the
standard

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Torvald Riegel
On Mon, 2014-02-17 at 14:47 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 2:25 PM, Torvald Riegel  wrote:
> > On Mon, 2014-02-17 at 14:02 -0800, Linus Torvalds wrote:
> >>
> >> The argument was that an lvalue doesn't actually "access" the memory
> >> (an rvalue does), so this:
> >>
> >>volatile int *p = ...;
> >>
> >>*p;
> >>
> >> doesn't need to generate a load from memory, because "*p" is still an
> >> lvalue (since you could assign things to it).
> >>
> >> This isn't an issue in C, because in C, expression statements are
> >> always rvalues, but C++ changed that.
> >
> > Huhh.  I can see the problems that this creates in terms of C/C++
> > compatibility.
> 
> That's not the biggest problem.
> 
> The biggest problem is that you have compiler writers that don't care
> about sane *use* of the features they write a compiler for, they just
> care about the standard.
> 
> So they don't care about C vs C++ compatibility. Even more
> importantly, they don't care about the *user* that uses only C++ and
> the fact that their reading of the standard results in *meaningless*
> behavior. They point to the standard and say "that's what the standard
> says, suck it", and silently generate code (or in this case, avoid
> generating code) that makes no sense.

There's an underlying problem here that's independent from the actual
instance that you're worried about here: "no sense" is a ultimately a
matter of taste/objectives/priorities as long as the respective
specification is logically consistent.

If you want to be independent of your sanity being different from other
people's sanity (e.g., compiler writers), you need to make sure that the
specification is precise and says what you want.  IOW, think about the
specification being the program, and the people being computers; you
better want a well-defined program in this case.

> So it's not about C++ being incompatible with C, it's about C++ having
> insane and bad semantics unless you just admit that "oh, ok, I need to
> not just read the standard, I also need to use my brain, and admit
> that a C++ statement expression needs to act as if it is an "access"
> wrt volatile variables".

1) I agree that (IMO) a good standard strives for being easy to
understand.

2) In practice, there is a trade-off between "Easy to understand" and
actually producing a specification.  A standard is not a tutorial.  And
that's for good reason, because (a) there might be more than one way to
teach something and that should be allowed and (b) that the standard
should carry the full precision but still be compact enough to be
manageable.

3) Implementations can try to be nice to users by helping them avoiding
error-prone corner cases or such.  A warning for common problems is such
a case.  But an implementation has to draw a line somewhere, demarcating
cases where it fully exploits what the standard says (eg, to allow
optimizations) from cases where it is more conservative and does what
the standard allows but in a potentially more intuitive way.  That's
especially the case if it's being asked to produce high-performance
code.

4) There will be arguments for where the line actually is, simply
because different users will have different goals.

5) The way to reduce 4) is to either make the standard more specific, or
to provide better user documentation.  If the standard has strict
requirements, then there will be less misunderstanding.

6) To achieve 5), one way is to get involved in the standards process.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Torvald Riegel
On Mon, 2014-02-17 at 14:32 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 2:09 PM, Torvald Riegel  wrote:
> > On Sat, 2014-02-15 at 11:15 -0800, Linus Torvalds wrote:
> >> >
> >> > if (atomic_load(&x, mo_relaxed) == 1)
> >> >   atomic_store(&y, 3, mo_relaxed));
> >>
> >> No, please don't use this idiotic example. It is wrong.
> >
> > It won't be useful in practice in a lot of cases, but that doesn't mean
> > it's wrong.  It's clearly not illegal code.  It also serves a purpose: a
> > simple example to reason about a few aspects of the memory model.
> 
> It's not illegal code, but i you claim that you can make that store
> unconditional, it's a pointless and wrong example.
> 
> >> The fact is, if a compiler generates anything but the obvious sequence
> >> (read/cmp/branch/store - where branch/store might obviously be done
> >> with some other machine conditional like a predicate), the compiler is
> >> wrong.
> >
> > Why?  I've reasoned why (1) to (3) above allow in certain cases (i.e.,
> > the first load always returning 1) for the branch (or other machine
> > conditional) to not be emitted.  So please either poke holes into this
> > reasoning, or clarify that you don't in fact, contrary to what you wrote
> > above, agree with (1) to (3).
> 
> The thing is, the first load DOES NOT RETURN 1. It returns whatever
> that memory location contains. End of story.

The memory location is just an abstraction for state, if it's not
volatile.

> Stop claiming it "can return 1".. It *never* returns 1 unless you do
> the load and *verify* it, or unless the load itself can be made to go
> away. And with the code sequence given, that just doesn't happen. END
> OF STORY.

void foo();
{
  atomic x = 1;
  if (atomic_load(&x, mo_relaxed) == 1)
atomic_store(&y, 3, mo_relaxed));
}

This is a counter example to your claim, and yes, the compiler has proof
that x is 1.  It's deliberately simple, but I can replace this with
other more advanced situations.  For example, if x comes out of malloc
(or, on the kernel side, something else that returns non-aliasing
memory) and hasn't provably escaped to other threads yet.

I haven't posted this full example, but I've *clearly* said that *if*
the compiler can prove that the load would always return 1, it can
remove it.  And it's simple to see why that's the case: If this holds,
then in all allowed executions it would load from a know store, the
relaxed_mo gives no further ordering guarantees so we can just take the
value, and we're good.

> So your argument is *shit*. Why do you continue to argue it?

Maybe because it isn't?  Maybe you should try to at least trust that my
intentions are good, even if distrusting my ability to reason.

> I told you how that load can go away, and you agreed. But IT CANNOT GO
> AWAY any other way. You cannot claim "the compiler knows". The
> compiler doesn't know. It's that simple.

Oh yes it can.  Because of the same rules that allow you to perform the
other transformations.  Please try to see the similarities here.  You
previously said you don't want to mix volatile semantics and atomics.
This is something that's being applied in this example.

> >> So why do I say you are wrong, after I just gave you an example of how
> >> it happens? Because my example went back to the *real* issue, and
> >> there are actual real semantically meaningful details with doing
> >> things like load merging.
> >>
> >> To give an example, let's rewrite things a bit more to use an extra 
> >> variable:
> >>
> >> atomic_store(&x, 1, mo_relaxed);
> >> a = atomic_load(&1, mo_relaxed);
> >> if (a == 1)
> >> atomic_store(&y, 3, mo_relaxed);
> >>
> >> which looks exactly the same.
> >
> > I'm confused.  Is this a new example?
> 
> That is a new example. The important part is that it has left a
> "trace" for the programmer: because 'a' contains the value, the
> programmer can now look at the value later and say "oh, we know we did
> a store iff a was 1"
> 
> >> This sequence:
> >>
> >> atomic_store(&x, 1, mo_relaxed);
> >> a = atomic_load(&x, mo_relaxed);
> >> atomic_store(&y, 3, mo_relaxed);
> >>
> >> is actually - and very seriously - buggy.
> >>
> >> Why? Because you have effectively split the atomic_load into two loads
> >> - one for the value of 'a', and one for your '

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Torvald Riegel
On Mon, 2014-02-17 at 14:02 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 1:21 PM, Torvald Riegel  wrote:
> > On Mon, 2014-02-17 at 12:18 -0800, Linus Torvalds wrote:
> >> and then it is read by people (compiler writers) that intentionally
> >> try to mis-use the words and do language-lawyering ("that depends on
> >> what the meaning of 'is' is").
> >
> > That assumption about people working on compilers is a little too broad,
> > don't you think?
> 
> Let's just say that *some* are that way, and those are the ones that I
> end up butting heads with.
> 
> The sane ones I never have to argue with - point them at a bug, and
> they just say "yup, bug". The insane ones say "we don't need to fix
> that, because if you read this copy of the standards that have been
> translated to chinese and back, it clearly says that this is
> acceptable".
> 
> >> The whole "lvalue vs rvalue expression
> >> vs 'what is a volatile access'" thing for C++ was/is a great example
> >> of that.
> >
> > I'm not aware of the details of this.
> 
> The argument was that an lvalue doesn't actually "access" the memory
> (an rvalue does), so this:
> 
>volatile int *p = ...;
> 
>*p;
> 
> doesn't need to generate a load from memory, because "*p" is still an
> lvalue (since you could assign things to it).
> 
> This isn't an issue in C, because in C, expression statements are
> always rvalues, but C++ changed that.

Huhh.  I can see the problems that this creates in terms of C/C++
compatibility.

> The people involved with the C++
> standards have generally been totally clueless about their subtle
> changes.

This isn't a fair characterization.  There are many people that do care,
and certainly not all are clueless.  But it's a limited set of people,
bugs happen, and not all of them will have the same goals.

I think one way to prevent such problems in the future could be to have
someone in the kernel community volunteer to look through standard
revisions before they are published.  The standard needs to be fixed,
because compilers need to conform to the standard (e.g., a compiler's
extension "fixing" the above wouldn't be conforming anymore because it
emits more volatile reads than specified).

Or maybe those of us working on the standard need to flag potential
changes of interest to the kernel folks.  But that may be less reliable
than someone from the kernel side looking at them; I don't know.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Torvald Riegel
On Mon, 2014-02-17 at 14:14 -0800, Paul E. McKenney wrote:
> On Mon, Feb 17, 2014 at 09:39:54PM +0100, Richard Biener wrote:
> > On February 17, 2014 7:18:15 PM GMT+01:00, "Paul E. McKenney" 
> >  wrote:
> > >On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote:
> > >> On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote:
> > >> > You need volatile semantics to force the compiler to ignore any
> > >proofs
> > >> > it might otherwise attempt to construct.  Hence all the
> > >ACCESS_ONCE()
> > >> > calls in my email to Torvald.  (Hopefully I translated your example
> > >> > reasonably.)
> > >> 
> > >> My brain gave out for today; but it did appear to have the right
> > >> structure.
> > >
> > >I can relate.  ;-)
> > >
> > >> I would prefer it C11 would not require the volatile casts. It should
> > >> simply _never_ speculate with atomic writes, volatile or not.
> > >
> > >I agree with not needing volatiles to prevent speculated writes. 
> > >However,
> > >they will sometimes be needed to prevent excessive load/store
> > >combining.
> > >The compiler doesn't have the runtime feedback mechanisms that the
> > >hardware has, and thus will need help from the developer from time
> > >to time.
> > >
> > >Or maybe the Linux kernel simply waits to transition to C11 relaxed
> > >atomics
> > >until the compiler has learned to be sufficiently conservative in its
> > >load-store combining decisions.
> > 
> > Sounds backwards. Currently the compiler does nothing to the atomics. I'm 
> > sure we'll eventually add something. But if testing coverage is zero 
> > outside then surely things get worse, not better with time.
> 
> Perhaps we solve this chicken-and-egg problem by creating a test suite?

Perhaps.  The test suite might also be a good set of examples showing
which cases we expect to be optimized in a certain way, and which not.
I suppose the uses of (the equivalent) of atomics in the kernel would be
a good start.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Torvald Riegel
On Sat, 2014-02-15 at 11:15 -0800, Linus Torvalds wrote:
> On Sat, Feb 15, 2014 at 9:30 AM, Torvald Riegel  wrote:
> >
> > I think the example is easy to misunderstand, because the context isn't
> > clear.  Therefore, let me first try to clarify the background.
> >
> > (1) The abstract machine does not write speculatively.
> > (2) Emitting a branch instruction and executing a branch at runtime is
> > not part of the specified behavior of the abstract machine.  Of course,
> > the abstract machine performs conditional execution, but that just
> > specifies the output / side effects that it must produce (e.g., volatile
> > stores) -- not with which hardware instructions it is producing this.
> > (3) A compiled program must produce the same output as if executed by
> > the abstract machine.
> 
> Ok, I'm fine with that.
> 
> > Thus, we need to be careful what "speculative store" is meant to refer
> > to.  A few examples:
> >
> > if (atomic_load(&x, mo_relaxed) == 1)
> >   atomic_store(&y, 3, mo_relaxed));
> 
> No, please don't use this idiotic example. It is wrong.

It won't be useful in practice in a lot of cases, but that doesn't mean
it's wrong.  It's clearly not illegal code.  It also serves a purpose: a
simple example to reason about a few aspects of the memory model.

> The fact is, if a compiler generates anything but the obvious sequence
> (read/cmp/branch/store - where branch/store might obviously be done
> with some other machine conditional like a predicate), the compiler is
> wrong.

Why?  I've reasoned why (1) to (3) above allow in certain cases (i.e.,
the first load always returning 1) for the branch (or other machine
conditional) to not be emitted.  So please either poke holes into this
reasoning, or clarify that you don't in fact, contrary to what you wrote
above, agree with (1) to (3).

> Anybody who argues anything else is wrong, or confused, or confusing.

I appreciate your opinion, and maybe I'm just one of the three things
above (my vote is on "confusing").  But without you saying why doesn't
help me see what's the misunderstanding here.

> Instead, argue about *other* sequences where the compiler can do something.

I'd prefer if we could clarify the misunderstanding for the simple case
first that doesn't involve stronger ordering requirements in the form of
non-relaxed MOs.

> For example, this sequence:
> 
>atomic_store(&x, a, mo_relaxed);
>b = atomic_load(&x, mo_relaxed);
> 
> can validly be transformed to
> 
>atomic_store(&x, a, mo_relaxed);
>b = (typeof(x)) a;
> 
> and I think everybody agrees about that. In fact, that optimization
> can be done even for mo_strict.

Yes.

> But even that "obvious" optimization has subtle cases. What if the
> store is relaxed, but the load is strict? You can't do the
> optimization without a lot of though, because dropping the strict load
> would drop an ordering point. So even the "store followed by exact
> same load" case has subtle issues.

Yes if a compiler wants to optimize that, it has to give it more
thought.  My gut feeling is that either the store should get the
stronger ordering, or the accesses should be merged.  But I'd have to
think more about that one (which I can do on request).

> With similar caveats, it is perfectly valid to merge two consecutive
> loads, and to merge two consecutive stores.
> 
> Now that means that the sequence
> 
> atomic_store(&x, 1, mo_relaxed);
> if (atomic_load(&x, mo_relaxed) == 1)
> atomic_store(&y, 3, mo_relaxed);
> 
> can first be optimized to
> 
> atomic_store(&x, 1, mo_relaxed);
> if (1 == 1)
> atomic_store(&y, 3, mo_relaxed);
> 
> and then you get the end result that you wanted in the first place
> (including the ability to re-order the two stores due to the relaxed
> ordering, assuming they can be proven to not alias - and please don't
> use the idiotic type-based aliasing rules).
> 
> Bringing up your first example is pure and utter confusion.

Sorry if it was confusing.  But then maybe we need to talk about it
more, because it shouldn't be confusing if we agree on what the memory
model allows and what not.  I had originally picked the example because
it was related to the example Paul/Peter brought up.

> Don't do
> it. Instead, show what are obvious and valid transformations, and then
> you can bring up these kinds of combinations as "look, this is
> obviously also correct".

I have my doubts whether the best way to reason about the memory model
is by thinking about specific compiler transformations.  YM

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Torvald Riegel
On Mon, 2014-02-17 at 12:18 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel  wrote:
> >
> > Which example do you have in mind here?  Haven't we resolved all the
> > debated examples, or did I miss any?
> 
> Well, Paul seems to still think that the standard possibly allows
> speculative writes or possibly value speculation in ways that break
> the hardware-guaranteed orderings.

That's true, I just didn't see any specific examples so far.

> And personally, I can't read standards paperwork. It is invariably
> written in some basically impossible-to-understand lawyeristic mode,

Yeah, it's not the most intuitive form for things like the memory model.

> and then it is read by people (compiler writers) that intentionally
> try to mis-use the words and do language-lawyering ("that depends on
> what the meaning of 'is' is").

That assumption about people working on compilers is a little too broad,
don't you think?

I think that it is important to stick to a specification, in the same
way that one wouldn't expect a program with undefined behavior make any
sense of it, magically, in cases where stuff is undefined.

However, that of course doesn't include trying to exploit weasel-wording
(BTW, both users and compiler writers try to do it).  IMHO,
weasel-wording in a standard is a problem in itself even if not
exploited, and often it indicates that there is a real issue.  There
might be reasons to have weasel-wording (e.g., because there's no known
better way to express it like in case of the not really precise
no-out-of-thin-air rule today), but nonetheless those aren't ideal.

> The whole "lvalue vs rvalue expression
> vs 'what is a volatile access'" thing for C++ was/is a great example
> of that.

I'm not aware of the details of this.

> So quite frankly, as a result I refuse to have anything to do with the
> process directly.

That's unfortunate.  Then please work with somebody that isn't
uncomfortable with participating directly in the process.  But be
warned, it may very well be a person working on compilers :)

Have you looked at the formalization of the model by Batty et al.?  The
overview of this is prose, but the formalized model itself is all formal
relations and logic.  So there should be no language-lawyering issues
with that form.  (For me, the formalized model is much easier to reason
about.)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Torvald Riegel
On Mon, 2014-02-17 at 12:23 -0800, Paul E. McKenney wrote:
> On Mon, Feb 17, 2014 at 08:55:47PM +0100, Torvald Riegel wrote:
> > On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote:
> > > On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel  
> > > wrote:
> > > >
> > > > I think a major benefit of C11's memory model is that it gives a
> > > > *precise* specification for how a compiler is allowed to optimize.
> > > 
> > > Clearly it does *not*. This whole discussion is proof of that. It's
> > > not at all clear,
> > 
> > It might not be an easy-to-understand specification, but as far as I'm
> > aware it is precise.  The Cambridge group's formalization certainly is
> > precise.  From that, one can derive (together with the usual rules for
> > as-if etc.) what a compiler is allowed to do (assuming that the standard
> > is indeed precise).  My replies in this discussion have been based on
> > reasoning about the standard, and not secret knowledge (with the
> > exception of no-out-of-thin-air, which is required in the standard's
> > prose but not yet formalized).
> > 
> > I agree that I'm using the formalization as a kind of placeholder for
> > the standard's prose (which isn't all that easy to follow for me
> > either), but I guess there's no way around an ISO standard using prose.
> > 
> > If you see a case in which the standard isn't precise, please bring it
> > up or open a C++ CWG issue for it.
> 
> I suggest that I go through the Linux kernel's requirements for atomics
> and memory barriers and see how they map to C11 atomics.  With that done,
> we would have very specific examples to go over.  Without that done, the
> discussion won't converge very well.
> 
> Seem reasonable?

Sounds good!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Torvald Riegel
On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote:
> On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel  wrote:
> >
> > I think a major benefit of C11's memory model is that it gives a
> > *precise* specification for how a compiler is allowed to optimize.
> 
> Clearly it does *not*. This whole discussion is proof of that. It's
> not at all clear,

It might not be an easy-to-understand specification, but as far as I'm
aware it is precise.  The Cambridge group's formalization certainly is
precise.  From that, one can derive (together with the usual rules for
as-if etc.) what a compiler is allowed to do (assuming that the standard
is indeed precise).  My replies in this discussion have been based on
reasoning about the standard, and not secret knowledge (with the
exception of no-out-of-thin-air, which is required in the standard's
prose but not yet formalized).

I agree that I'm using the formalization as a kind of placeholder for
the standard's prose (which isn't all that easy to follow for me
either), but I guess there's no way around an ISO standard using prose.

If you see a case in which the standard isn't precise, please bring it
up or open a C++ CWG issue for it.

> and the standard apparently is at least debatably
> allowing things that shouldn't be allowed.

Which example do you have in mind here?  Haven't we resolved all the
debated examples, or did I miss any?

> It's also a whole lot more
> complicated than "volatile", so the likelihood of a compiler writer
> actually getting it right - even if the standard does - is lower.

It's not easy, that's for sure, but none of the high-performance
alternatives are easy either.  There are testing tools out there based
on the formalization of the model, and we've found bugs with them.

And the alternative of using something not specified by the standard is
even worse, I think, because then you have to guess what a compiler
might do, without having any constraints; IOW, one is resorting to "no
sane compiler would do that", and that doesn't seem to very robust
either.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Torvald Riegel
On Mon, 2014-02-17 at 18:59 +, Joseph S. Myers wrote:
> On Sat, 15 Feb 2014, Torvald Riegel wrote:
> 
> > glibc is a counterexample that comes to mind, although it's a smaller
> > code base.  (It's currently not using C11 atomics, but transitioning
> > there makes sense, and some thing I want to get to eventually.)
> 
> glibc is using C11 atomics (GCC builtins rather than _Atomic / 
> , but using __atomic_* with explicitly specified memory model 
> rather than the older __sync_*) on AArch64, plus in certain cases on ARM 
> and MIPS.

I think the major steps remaining is moving the other architectures
over, and rechecking concurrent code (e.g., for the code that I have
seen, it was either asm variants (eg, on x86), or built before C11; ARM
pthread_once was lacking memory_barriers (see "pthread_once unification"
patches I posted)).  We also need/should to move towards using
relaxed-MO atomic loads instead of plain loads.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Torvald Riegel
On Fri, 2014-02-14 at 18:44 -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney
>  wrote:
> >
> > One way of looking at the discussion between Torvald and myself would be
> > as a seller (Torvald) and a buyer (me) haggling over the fine print in
> > a proposed contract (the standard).  Whether that makes you feel better
> > or worse about the situation I cannot say.  ;-)
> 
> Oh, I'm perfectly fine with that. But we're not desperate to buy, and
> I actually think the C11 people are - or at least should be - *way*
> more desperate to sell their model to us than we are to take it.
> 
> Which means that as a buyer you should say "this is what we want, if
> you don't give us this, we'll just walk away". Not try to see how much
> we can pay for it. Because there is very little upside for us, and
> _unless_ the C11 standard gets things right it's just extra complexity
> for us, coupled with new compiler fragility and years of code
> generation bugs.

I think there is an upside to you, mainly in that it allows compiler
testing tools, potentially verification tools for atomics, and tools
like cppmem that show allowed executions of code.

I agree that the Linux community has been working well without this, and
it's big enough to make running it's own show viable.  This will be
different for smaller projects, though.

> Why would we want that extra complexity and inevitable compiler bugs?
> If we then have to fight compiler writers that point to the standard
> and say "..but look, the standard says we can do this", then at that
> point it went from "extra complexity and compiler bugs" to a whole
> 'nother level of frustration and pain.

I see your point, but flip side of the coin is that if you get the
standard to say what you want, then you can tell the compiler writers to
look at the standard.  Or show them bugs revealed by fuzz testing and
such.

> So just walk away unless the C11 standard gives us exactly what we
> want. Not "something kind of like what we'd use". EXACTLY. Because I'm
> not in the least interested in fighting compiler people that have a
> crappy standard they can point to. Been there, done that, got the
> T-shirt and learnt my lesson.
> 
> And the thing is, I suspect that the Linux kernel is the most complete
> - and most serious - user of true atomics that the C11 people can sell
> their solution to.

I agree, but there are likely also other big projects that could make
use of C11 atomics on the userspace side (e.g., certain databases, ...).

> If we don't buy it, they have no serious user.

I disagree with that.  That obviously depends on one's definition of
"serious", but if you combine all C/C++ programs that use low-level
atomics, then this is serious use as well.  There's lots of
shared-memory synchronization in userspace as well.

> Sure, they'll have lots
> of random other one-off users for their atomics, where each user wants
> one particular thing, but I suspect that we'll have the only really
> unified portable code base

glibc is a counterexample that comes to mind, although it's a smaller
code base.  (It's currently not using C11 atomics, but transitioning
there makes sense, and some thing I want to get to eventually.)

> that handles pretty much *all* the serious
> odd cases that the C11 atomics can actually talk about to each other.

You certainly have lots of odd cases, but I would disagree with the
assumption that only the Linux kernel will do full "testing" of the
implementations.  If you have plenty of userspace programs using the
atomics, that's a pretty big test suite, and one that should help bring
the compilers up to speed.  So that might be a benefit even to the Linux
kernel if it would use the C11 atomics.

> Oh, they'll push things through with or without us, and it will be a
> collection of random stuff, where they tried to please everybody, with
> particularly compiler/architecture people who have no f*cking clue
> about how their stuff is used pushing to make it easy/efficient for
> their particular compiler/architecture.

I'll ignore this... :)

> But we have real optimized uses of pretty much all relevant cases that
> people actually care about.

You certainly cover a lot of cases.  Finding out whether you cover all
that "people care about" would require you to actually ask all people,
which I'm sure you've done ;)

> We can walk away from them, and not really lose anything but a small
> convenience (and it's a convenience *only* if the standard gets things
> right).
> 
> And conversely, the C11 people can walk away from us too. But if they
> can't make us happy (and by "make us happy", I really mean no stupid
> games on our part) I personally think they'll have a stronger
> standard, and a real use case, and real arguments. I'm assuming they
> want that.

I agree.

> That's why I complain when you talk about things like marking control
> dependencies explicitly. That's *us* bending over backwards. And as a
> buyer, we have absolutely z

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Torvald Riegel
On Fri, 2014-02-14 at 12:02 -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds
>  wrote:
> >
> > Why are we still discussing this idiocy? It's irrelevant. If the
> > standard really allows random store speculation, the standard doesn't
> > matter, and sane people shouldn't waste their time arguing about it.
> 
> Btw, the other part of this coin is that our manual types (using
> volatile and various architecture-specific stuff) and our manual
> barriers and inline asm accesses are generally *fine*.

AFAICT, it does work for you, but hasn't been exactly pain-free.

I think a major benefit of C11's memory model is that it gives a
*precise* specification for how a compiler is allowed to optimize.
There is a formalization of the model, which allows things like the
cppmem tool by the Cambridge group.  It also allows meaningful fuzz
testing: http://www.di.ens.fr/~zappa/projects/cmmtest/ ; this did reveal
several GCC compiler bugs.

I also think that reasoning about this model is easier than reasoning
about how lots of different, concrete compiler optimizations would
interact.

> The C11 stuff doesn't buy us anything. The argument that "new
> architectures might want to use it" is prue and utter bollocks, since
> unless the standard gets the thing *right*, nobody sane would ever use
> it for some new architecture, when the sane thing to do is to just
> fill in the normal barriers and inline asms.
> 
> So I'm very very serious: either the compiler and the standard gets
> things right, or we don't use it. There is no middle ground where "we
> might use it for one or two architectures and add random hints".
> That's just stupid.
> 
> The only "middle ground" is about which compiler version we end up
> trusting _if_ it turns out that the compiler and standard do get
> things right. From Torvald's explanations (once I don't mis-read them
> ;), my take-away so far has actually been that the standard *does* get
> things right, but I do know from over-long personal experience that
> compiler people sometimes want to be legalistic and twist the
> documentation to the breaking point, at which point we just go "we'd
> be crazy do use that".

I agree that compilers want to optimize, and sometimes there's probably
a little too much emphasis on applying an optimization vs. not
surprising users.  But we have to draw a line (e.g., what is undefined
behavior and what is not), because we need this to be actually able to
optimize.

Therefore, we need to get the rules into a shape that both allows
optimizations and isn't full of surprising corner cases.  The rules are
the standard, so it's the standard we have to get right.  According to
my experience, a lot of thought goes into how to design the standard's
language and library so that they are intuitive yet efficient.

If you see issues in the standard, please bring them up.  Either report
the defects directly and get involved yourself, or reach out to somebody
that is participating in the standards process.

The standard certainly isn't perfect, so there is room to contribute.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Torvald Riegel
On Fri, 2014-02-14 at 11:50 -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney
>  wrote:
> >
> > Linus, Peter, any objections to marking places where we are relying on
> > ordering from control dependencies against later stores?  This approach
> > seems to me to have significant documentation benefits.
> 
> Quite frankly, I think it's stupid, and the "documentation" is not a
> benefit, it's just wrong.

I think the example is easy to misunderstand, because the context isn't
clear.  Therefore, let me first try to clarify the background.

(1) The abstract machine does not write speculatively.
(2) Emitting a branch instruction and executing a branch at runtime is
not part of the specified behavior of the abstract machine.  Of course,
the abstract machine performs conditional execution, but that just
specifies the output / side effects that it must produce (e.g., volatile
stores) -- not with which hardware instructions it is producing this.
(3) A compiled program must produce the same output as if executed by
the abstract machine.

Thus, we need to be careful what "speculative store" is meant to refer
to.  A few examples:

if (atomic_load(&x, mo_relaxed) == 1)
  atomic_store(&y, 3, mo_relaxed));

Here, the requirement is that in terms of program logic, y is assigned 3
if x equals 1.  It's not specified how an implementation does that.
* If the compiler can prove that x is always 1, then it can remove the
branch.  This is because of (2).  Because of the proof, (1) is not
violated.
* If the compiler can prove that the store to y is never observed or
does not change the program's output, the store can be removed.

if (atomic_load(&x, mo_relaxed) == 1)
  { atomic_store(&y, 3, mo_relaxed)); other_a(); }
else
  { atomic_store(&y, 3, mo_relaxed)); other_b(); }

Here, y will be assigned to regardless of the value of x.
* The compiler can hoist the store out of the two branches.  This is
because the store and the branch instruction aren't observable outcomes
of the abstract machine.
* The compiler can even move the store to y before the load from x
(unless this affects logical program order of this thread in some way.)
This is because the load/store are ordered by sequenced-before
(intra-thread), but mo_relaxed allows the hardware to reorder, so the
compiler can do it as well (IOW, other threads can't expect a particular
order).

if (atomic_load(&x, mo_acquire) == 1)
  atomic_store(&y, 3, mo_relaxed));

This is similar to the first case, but with stronger memory order.
* If the compiler proves that x is always 1, then it does so by showing
that the load will always be able to read from a particular store (or
several of them) that (all) assigned 1 to x -- as specified by the
abstract machine and taking the forward progress guarantees into
account.  In general, it still has to establish the synchronized-with
edge if any of those stores used release_mo (or other fences resulting
in the same situation), so it can't just get rid of the acquire "fence"
in this case.  (There are probably situations in which this can be done,
but I can't characterize them easily at the moment.)


These examples all rely on the abstract machine as specified in the
current standard.  In contrast, the example that Paul (and Peter, I
assume) where looking at is not currently modeled by the standard.
AFAIU, they want to exploit that control dependencies, when encountered
in binary code, can result in the hardware giving certain ordering
guarantees.

This is vaguely similar to mo_consume which is about data dependencies.
mo_consume is, partially due to how it's specified, pretty hard to
implement for compilers in a way that actually exploits and preserves
data dependencies and not just substitutes mo_consume for a stronger
memory order.

Part of this problem is that the standard takes an opt-out approach
regarding the code that should track dependencies (e.g., certain
operators are specified as not preserving them), instead of cleanly
carving out meaningful operators where one can track dependencies
without obstructing generally useful compiler optimizations (i.e.,
"opt-in").  This leads to cases such as that in "*(p + f - f)", the
compiler either has to keep f - f or emit a stronger fence if f is
originating from a mo_consume load.  Furthermore, dependencies are
supposed to be tracked across any load and store, so the compiler needs
to do points-to if it wants to optimize this as much as possible.

Paul and I have been thinking about alternatives, and one of them was
doing the opt-in by demarcating code that needs explicit dependency
tracking because it wants to exploit mo_consume.

Back to HW control dependencies, this lead to the idea of marking the
"control dependencies" in the source code (ie, on the abstract machine
level), that need to be preserved in the generated binary code, even if
they have no semantic meaning on the abstract machine level.  So, this
is something extra that isn't modeled in the standard curr

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Torvald Riegel
On Fri, 2014-02-14 at 09:29 -0800, Paul E. McKenney wrote:
> On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote:
> > On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > Another option would be to flag the conditional expression, prohibiting
> > > the compiler from optimizing out any conditional branches.  Perhaps
> > > something like this:
> > > 
> > >   r1 = atomic_load(x, memory_order_control);
> > >   if (control_dependency(r1))
> > >   atomic_store(y, memory_order_relaxed);
> > 
> > That's the one I had in mind and talked to you about earlier today.  My
> > gut feeling is that this is preferably over the other because it "marks"
> > the if-statement, so the compiler knows exactly which branches matter.
> > I'm not sure one would need the other memory order for that, if indeed
> > all you want is relaxed -> branch -> relaxed.  But maybe there are
> > corner cases (see the weaker-than-relaxed discussion in SG1 today).
> 
> Linus, Peter, any objections to marking places where we are relying on
> ordering from control dependencies against later stores?  This approach
> seems to me to have significant documentation benefits.

Let me note that at least as I'm concerned, that's just a quick idea.
At least I haven't looked at (1) how to properly specify the semantics
of this, (2) whether it has any bad effects on unrelated code, (3) and
whether there are pitfalls for compiler implementations.  It looks not
too bad at first glance, though.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Torvald Riegel
On Fri, 2014-02-14 at 10:50 +0100, Peter Zijlstra wrote:
> On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote:
> > That depends on what your goal is.

First, I don't know why you quoted that, but without the context,
quoting it doesn't make sense.  Let me repeat the point.  The standard
is the rule set for the compiler.  Period.  The compiler does not just
serve the semantics that you might have in your head.  It does have to
do something meaningful for all of its users.  Thus, the goal for the
compiler is to properly compile programs in the language as specified.

If there is a deficiency in the standard (bug or missing feature) -- and
thus the specification, we need to have a patch for the standard that
fixes this deficiency.  If you think that this is the case, that's where
you fix it.

If your goal is to do wishful thinking, imagine some kind of semantics
in your head, and then assume that magically, implementations will do
just that, then that's bound to fail.

> A compiler that we don't need to fight in order to generate sane code
> would be nice. But as Linus said; we can continue to ignore you lot and
> go on as we've done.

I don't see why it's so hard to understand that you need to specify
semantics, and the place (or at least the base) for that is the
standard.  Aren't you guys the ones replying "send a patch"? :)

This isn't any different.  If you're uncomfortable working with the
standard, then say so, and reach out to people that aren't.

You can surely ignore the specification of the language(s) that you are
depending on.  But that won't help you.  If you want a change, get
involved.  (Oh, and claiming that the other side doesn't get it doesn't
count as getting involved.)

There's no fight between people here.  It's just a technical problem
that we have to solve in the right way.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-13 Thread Torvald Riegel
On Wed, 2014-02-12 at 10:19 +0100, Peter Zijlstra wrote:
> > I don't know the specifics of your example, but from how I understand
> > it, I don't see a problem if the compiler can prove that the store will
> > always happen.
> > 
> > To be more specific, if the compiler can prove that the store will
> > happen anyway, and the region of code can be assumed to always run
> > atomically (e.g., there's no loop or such in there), then it is known
> > that we have one atomic region of code that will always perform the
> > store, so we might as well do the stuff in the region in some order.
> > 
> > Now, if any of the memory accesses are atomic, then the whole region of
> > code containing those accesses is often not atomic because other threads
> > might observe intermediate results in a data-race-free way.
> > 
> > (I know that this isn't a very precise formulation, but I hope it brings
> > my line of reasoning across.)
> 
> So given something like:
> 
>   if (x)
>   y = 3;
> 
> assuming both x and y are atomic (so don't gimme crap for now knowing
> the C11 atomic incantations); and you can prove x is always true; you
> don't see a problem with not emitting the conditional?

That depends on what your goal is.  It would be correct as far as the
standard is specified; this makes sense if all you want is indeed a
program that does what the abstract machine might do, and produces the
same output / side effects.

If you're trying to preserve the branch in the code emitted / executed
by the implementation, then it would not be correct.  But those branches
aren't specified as being part of the observable side effects.  In the
common case, this makes sense because it enables optimizations that are
useful; this line of reasoning also allows the compiler to merge some
atomic accesses in the way that Linus would like to see it.

> Avoiding the conditional changes the result; see that control dependency
> email from earlier.

It does not regarding how the standard defines "result".

> In the above example the load of X and the store to
> Y are strictly ordered, due to control dependencies. Not emitting the
> condition and maybe not even emitting the load completely wrecks this.

I think you're trying to solve this backwards.  You are looking at this
with an implicit wishlist of what the compiler should do (or how you
want to use the hardware), but this is not a viable specification that
one can write a compiler against.

We do need clear rules for what the compiler is allowed to do or not
(e.g., a memory model that models multi-threaded executions).  Otherwise
it's all hand-waving, and we're getting nowhere.  Thus, the way to
approach this is to propose a feature or change to the standard, make
sure that this is consistent and has no unintended side effects for
other aspects of compilation or other code, and then ask the compiler to
implement it.  IOW, we need a patch for where this all starts: in the
rules and requirements for compilation.

Paul and I are at the C++ meeting currently, and we had sessions in
which the concurrency study group talked about memory model issues like
dependency tracking and memory_order_consume.  Paul shared uses of
atomics (or likewise) in the kernel, and we discussed how the memory
model currently handles various cases and why, how one could express
other requirements consistently, and what is actually implementable in
practice.  I can't speak for Paul, but I thought those discussions were
productive.

> Its therefore an invalid optimization to take out the conditional or
> speculate the store, since it takes out the dependency.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-13 Thread Torvald Riegel
On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote:
> On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote:
> > On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote:
> > > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
> > > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
> > > >  wrote:
> > > > >
> > > > > Us Linux-kernel hackers will often need to use volatile semantics in
> > > > > combination with C11 atomics in most cases.  The C11 atomics do cover
> > > > > some of the reasons we currently use ACCESS_ONCE(), but not all of 
> > > > > them --
> > > > > in particular, it allows load/store merging.
> > > > 
> > > > I really disagree with the "will need to use volatile".
> > > > 
> > > > We should never need to use volatile (outside of whatever MMIO we do
> > > > using C) if C11 defines atomics correctly.
> > > > 
> > > > Allowing load/store merging is *fine*. All sane CPU's do that anyway -
> > > > it's called a cache - and there's no actual reason to think that
> > > > "ACCESS_ONCE()" has to mean our current "volatile".
> > > > 
> > > > Now, it's possible that the C standards simply get atomics _wrong_, so
> > > > that they create visible semantics that are different from what a CPU
> > > > cache already does, but that's a plain bug in the standard if so.
> > > > 
> > > > But merging loads and stores is fine. And I *guarantee* it is fine,
> > > > exactly because CPU's already do it, so claiming that the compiler
> > > > couldn't do it is just insanity.
> > > 
> > > Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
> > > normally get their stores pushed through the store buffer in reasonable
> > > time, and CPUs also use things like invalidations to ensure that a
> > > store is seen in reasonable time by readers.  Compilers don't always
> > > have these two properties, so we do need to be more careful of load
> > > and store merging by compilers.
> > 
> > The standard's _wording_ is a little vague about forward-progress
> > guarantees, but I believe the vast majority of the people involved do
> > want compilers to not prevent forward progress.  There is of course a
> > difference whether a compiler establishes _eventual_ forward progress in
> > the sense of after 10 years or forward progress in a small bounded
> > interval of time, but this is a QoI issue, and good compilers won't want
> > to introduce unnecessary latencies.  I believe that it is fine if the
> > standard merely talks about eventual forward progress.
> 
> The compiler will need to earn my trust on this one.  ;-)
> 
> > > > Now, there are things that are *not* fine, like speculative stores
> > > > that could be visible to other threads. Those are *bugs* (either in
> > > > the compiler or in the standard), and anybody who claims otherwise is
> > > > not worth discussing with.
> > > 
> > > And as near as I can tell, volatile semantics are required in C11 to
> > > avoid speculative stores.  I might be wrong about this, and hope that
> > > I am wrong.  But I am currently not seeing it in the current standard.
> > > (Though I expect that most compilers would avoid speculating stores,
> > > especially in the near term.
> > 
> > This really depends on how we define speculative stores.  The memory
> > model is absolutely clear that programs have to behave as if executed by
> > the virtual machine, and that rules out speculative stores to volatiles
> > and other locations.  Under certain circumstances, there will be
> > "speculative" stores in the sense that they will happen at different
> > times as if you had a trivial implementation of the abstract machine.
> > But to be allowed to do that, the compiler has to prove that such a
> > transformation still fulfills the as-if rule.
> 
> Agreed, although the as-if rule would ignore control dependencies, since
> these are not yet part of the standard (as you in fact note below).
> I nevertheless consider myself at least somewhat reassured that current
> C11 won't speculate stores.  My remaining concerns involve the compiler
> proving to itself that a given branch is always taken, thus motivating
> it to optimize the branch away -- though this is more properly a
> control-dependency concern.
> 
> > IOW, 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-13 Thread Torvald Riegel
On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote:
> On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
> > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
> >  wrote:
> > >
> > > Us Linux-kernel hackers will often need to use volatile semantics in
> > > combination with C11 atomics in most cases.  The C11 atomics do cover
> > > some of the reasons we currently use ACCESS_ONCE(), but not all of them --
> > > in particular, it allows load/store merging.
> > 
> > I really disagree with the "will need to use volatile".
> > 
> > We should never need to use volatile (outside of whatever MMIO we do
> > using C) if C11 defines atomics correctly.
> > 
> > Allowing load/store merging is *fine*. All sane CPU's do that anyway -
> > it's called a cache - and there's no actual reason to think that
> > "ACCESS_ONCE()" has to mean our current "volatile".
> > 
> > Now, it's possible that the C standards simply get atomics _wrong_, so
> > that they create visible semantics that are different from what a CPU
> > cache already does, but that's a plain bug in the standard if so.
> > 
> > But merging loads and stores is fine. And I *guarantee* it is fine,
> > exactly because CPU's already do it, so claiming that the compiler
> > couldn't do it is just insanity.
> 
> Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
> normally get their stores pushed through the store buffer in reasonable
> time, and CPUs also use things like invalidations to ensure that a
> store is seen in reasonable time by readers.  Compilers don't always
> have these two properties, so we do need to be more careful of load
> and store merging by compilers.

The standard's _wording_ is a little vague about forward-progress
guarantees, but I believe the vast majority of the people involved do
want compilers to not prevent forward progress.  There is of course a
difference whether a compiler establishes _eventual_ forward progress in
the sense of after 10 years or forward progress in a small bounded
interval of time, but this is a QoI issue, and good compilers won't want
to introduce unnecessary latencies.  I believe that it is fine if the
standard merely talks about eventual forward progress.

> > Now, there are things that are *not* fine, like speculative stores
> > that could be visible to other threads. Those are *bugs* (either in
> > the compiler or in the standard), and anybody who claims otherwise is
> > not worth discussing with.
> 
> And as near as I can tell, volatile semantics are required in C11 to
> avoid speculative stores.  I might be wrong about this, and hope that
> I am wrong.  But I am currently not seeing it in the current standard.
> (Though I expect that most compilers would avoid speculating stores,
> especially in the near term.

This really depends on how we define speculative stores.  The memory
model is absolutely clear that programs have to behave as if executed by
the virtual machine, and that rules out speculative stores to volatiles
and other locations.  Under certain circumstances, there will be
"speculative" stores in the sense that they will happen at different
times as if you had a trivial implementation of the abstract machine.
But to be allowed to do that, the compiler has to prove that such a
transformation still fulfills the as-if rule.

IOW, the abstract machine is what currently defines disallowed
speculative stores.  If you want to put *further* constraints on what
implementations are allowed to do, I suppose it is best to talk about
those and see how we can add rules that allow programmers to express
those constraints.  For example, control dependencies might be such a
case.  I don't have a specific suggestion -- maybe the control
dependencies are best tackled similar to consume dependencies (even
though we don't have a good solution for those yets).  But using
volatile accesses for that seems to be a big hammer, or even the wrong
one.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel
On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
> > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> > >
> > > Intuitively, this is wrong because this let's the program take a step
> > > the abstract machine wouldn't do.  This is different to the sequential
> > > code that Peter posted because it uses atomics, and thus one can't
> > > easily assume that the difference is not observable.
> > 
> > Btw, what is the definition of "observable" for the atomics?
> > 
> > Because I'm hoping that it's not the same as for volatiles, where
> > "observable" is about the virtual machine itself, and as such volatile
> > accesses cannot be combined or optimized at all.
> > 
> > Now, I claim that atomic accesses cannot be done speculatively for
> > writes, and not re-done for reads (because the value could change),
> > but *combining* them would be possible and good.
> > 
> > For example, we often have multiple independent atomic accesses that
> > could certainly be combined: testing the individual bits of an atomic
> > value with helper functions, causing things like "load atomic, test
> > bit, load same atomic, test another bit". The two atomic loads could
> > be done as a single load without possibly changing semantics on a real
> > machine, but if "visibility" is defined in the same way it is for
> > "volatile", that wouldn't be a valid transformation. Right now we use
> > "volatile" semantics for these kinds of things, and they really can
> > hurt.
> > 
> > Same goes for multiple writes (possibly due to setting bits):
> > combining multiple accesses into a single one is generally fine, it's
> > *adding* write accesses speculatively that is broken by design..
> > 
> > At the same time, you can't combine atomic loads or stores infinitely
> > - "visibility" on a real machine definitely is about timeliness.
> > Removing all but the last write when there are multiple consecutive
> > writes is generally fine, even if you unroll a loop to generate those
> > writes. But if what remains is a loop, it might be a busy-loop
> > basically waiting for something, so it would be wrong ("untimely") to
> > hoist a store in a loop entirely past the end of the loop, or hoist a
> > load in a loop to before the loop.
> > 
> > Does the standard allow for that kind of behavior?
> 
> You asked!  ;-)
> 
> So the current standard allows merging of both loads and stores, unless of
> course ordring constraints prevent the merging.  Volatile semantics may be
> used to prevent this merging, if desired, for example, for real-time code.

Agreed.

> Infinite merging is intended to be prohibited, but I am not certain that
> the current wording is bullet-proof (1.10p24 and 1.10p25).

Yeah, maybe not.  But it at least seems to rather clearly indicate the
intent ;)

> The only prohibition against speculative stores that I can see is in a
> non-normative note, and it can be argued to apply only to things that are
> not atomics (1.10p22).

I think this one is specifically about speculative stores that would
affect memory locations that the abstract machine would not write to,
and that might be observable or create data races.  While a compiler
could potentially prove that such stores aren't leading to a difference
in the behavior of the program (e.g., by proving that there are no
observers anywhere and this isn't overlapping with any volatile
locations), I think that this is hard in general and most compilers will
just not do such things.  In GCC, bugs in that category were fixed after
researchers doing fuzz-testing found them (IIRC, speculative stores by
loops).

> I don't see any prohibition against reordering
> a store to precede a load preceding a conditional branch -- which would
> not be speculative if the branch was know to be taken and the load
> hit in the store buffer.  In a system where stores could be reordered,
> some other CPU might perceive the store as happening before the load
> that controlled the conditional branch.  This needs to be addressed.

I don't know the specifics of your example, but from how I understand
it, I don't see a problem if the compiler can prove that the store will
always happen.

To be more specific, if the compiler can prove that the store will
happen anyway, and the region of code can be assumed to always run
atomically (e.g., there's no loop or such in there), then it is known
that we have one atomic region of code that will always perform the
store, so we might as well do

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel
On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> >
> > Intuitively, this is wrong because this let's the program take a step
> > the abstract machine wouldn't do.  This is different to the sequential
> > code that Peter posted because it uses atomics, and thus one can't
> > easily assume that the difference is not observable.
> 
> Btw, what is the definition of "observable" for the atomics?
> 
> Because I'm hoping that it's not the same as for volatiles, where
> "observable" is about the virtual machine itself, and as such volatile
> accesses cannot be combined or optimized at all.

No, atomics aren't an observable behavior of the abstract machine
(unless they are volatile).  See 1.8.p8 (citing the C++ standard).

> Now, I claim that atomic accesses cannot be done speculatively for
> writes, and not re-done for reads (because the value could change),

Agreed, unless the compiler can prove that this doesn't make a
difference in the program at hand and it's not volatile atomics.  In
general, that will be hard and thus won't happen often I suppose, but if
correctly proved it would fall under the as-if rule I think.

> but *combining* them would be possible and good.

Agreed.

> For example, we often have multiple independent atomic accesses that
> could certainly be combined: testing the individual bits of an atomic
> value with helper functions, causing things like "load atomic, test
> bit, load same atomic, test another bit". The two atomic loads could
> be done as a single load without possibly changing semantics on a real
> machine, but if "visibility" is defined in the same way it is for
> "volatile", that wouldn't be a valid transformation. Right now we use
> "volatile" semantics for these kinds of things, and they really can
> hurt.

Agreed.  In your example, the compiler would have to prove that the
abstract machine would always be able to run the two loads atomically
(ie, as one load) without running into impossible/disallowed behavior of
the program.  But if there's no loop or branch or such in-between, this
should be straight-forward because any hardware oddity or similar could
merge those loads and it wouldn't be disallowed by the standard
(considering that we're talking about a finite number of loads), so the
compiler would be allowed to do it as well.

> Same goes for multiple writes (possibly due to setting bits):
> combining multiple accesses into a single one is generally fine, it's
> *adding* write accesses speculatively that is broken by design..

Agreed.  As Paul points out, this being correct assumes that there are
no other ordering guarantees or memory accesses "interfering", but if
the stores are to the same memory location and adjacent to each other in
the program, then I don't see a reason why they wouldn't be combinable.

> At the same time, you can't combine atomic loads or stores infinitely
> - "visibility" on a real machine definitely is about timeliness.
> Removing all but the last write when there are multiple consecutive
> writes is generally fine, even if you unroll a loop to generate those
> writes. But if what remains is a loop, it might be a busy-loop
> basically waiting for something, so it would be wrong ("untimely") to
> hoist a store in a loop entirely past the end of the loop, or hoist a
> load in a loop to before the loop.

Agreed.  That's what 1.10p24 and 1.10p25 are meant to specify for loads,
although those might not be bullet-proof as Paul points out.  Forward
progress is rather vaguely specified in the standard, but at least parts
of the committee (and people in ISO C++ SG1, in particular) are working
on trying to improve this.

> Does the standard allow for that kind of behavior?

I think the standard requires (or intends to require) the behavior that
you (and I) seem to prefer in these examples.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel
On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > > > > > There are also so many ways to blow your head off it's untrue. 
> > > > > > > > For example,
> > > > > > > > cmpxchg takes a separate memory model parameter for failure and 
> > > > > > > > success, but
> > > > > > > > then there are restrictions on the sets you can use for each. 
> > > > > > > > It's not hard
> > > > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". 
> > > > > > > > Then there's
> > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely 
> > > > > > > > ignores consume
> > > > > > > > atm and optimises all of the data dependencies away) as well as 
> > > > > > > > the definition
> > > > > > > > of "data races", which seem to be used as an excuse to 
> > > > > > > > miscompile a program
> > > > > > > > at the earliest opportunity.
> > > > > > > 
> > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > > > memory_order_consume until the compilers implement it both 
> > > > > > > correctly and
> > > > > > > efficiently.  They are not there yet, and there is currently no 
> > > > > > > shortage
> > > > > > > of compiler writers who would prefer to ignore 
> > > > > > > memory_order_consume.
> > > > > > 
> > > > > > Do you have any input on
> > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, 
> > > > > > the
> > > > > > language standard's definition of dependencies?
> > > > > 
> > > > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > > > 
> > > > > — B is an invocation of any specialization of std::kill_dependency 
> > > > > (29.3), or
> > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or 
> > > > > logical OR (||, see 5.15) operator,
> > > > > or
> > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > > > 
> > > > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > > > after the "?" will carry a dependency, so the code fragment in 59448
> > > > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > > > of existence.  One way to do that on both ARM and Power is to actually
> > > > > emit code for "flag - flag", but there are a number of other ways to
> > > > > make that work.
> > > > 
> > > > And that's what would concern me, considering that these requirements
> > > > seem to be able to creep out easily.  Also, whereas the other atomics
> > > > just constrain compilers wrt. reordering across atomic accesses or
> > > > changes to the atomic accesses themselves, the dependencies are new
> > > > requirements on pieces of otherwise non-synchronizing code.  The latter
> > > > seems far more involved to me.
> > > 
> > > Well, the wording of 1.10p9 is pretty explicit on this point.
> > > There are only a few exceptions to the rule that dependencies from
> > > memory_order_consume loads must be tracked.  And to your point about
> > > requirements being placed on pieces of otherwise non-synchronizing code,
> > > we already have that with plain old load

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel
On Sun, 2014-02-09 at 17:24 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel  wrote:
> >
> > (a) seems to say that you don't like requiring programmers to mark
> > atomic accesses specially.  Is that the case?
> 
> In Paul's example, they were marked specially.
> 
> And you seemed to argue that Paul's example could possibly return
> anything but 0/0.

Just read my reply to Paul again.  Here's an excerpt:

> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   atomic_store_explicit(42, y, memory_order_relaxed);
>   if (r1 != 42)
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);

Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.


IOW, I wrote that such a compiler transformation would be wrong in my
opinion.  Thus, it should *not* return 42.
If you still see how what I wrote could be misunderstood, please let me
know because I really don't see it.

Yes, I don't do so by swearing or calling other insane, and I try to see
the reasoning that those compilers' authors might have had, even if I
don't agree with it.  In my personal opinion, that's a good thing.

> If you really think that, I hope to God that you have nothing to do
> with the C standard or any actual compiler I ever use.
> 
> Because such a standard or compiler would be shit. It's sadly not too 
> uncommon.

Thanks for the kind words.  Perhaps writing that was quicker than
reading what I actually wrote.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel
On Sun, 2014-02-09 at 16:56 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> >
> > I wouldn't characterize the situation like this (although I can't speak
> > for others, obviously).  IMHO, it's perfectly fine on sequential /
> > non-synchronizing code, because we know the difference isn't observable
> > by a correct program.
> 
> What BS is that? If you use an "atomic_store_explicit()", by
> definition you're either
> 
>  (a) f*cking insane
>  (b) not doing sequential non-synchronizing code
> 
> and a compiler that assumes that the programmer is insane may actually
> be correct more often than not, but it's still a shit compiler.
> Agreed?

Due to all the expletives, I can't really understand what you are
saying.  Nor does what I guess it might mean seem to relate to what I
said.

(a) seems to say that you don't like requiring programmers to mark
atomic accesses specially.  Is that the case?

If so, I have to disagree.  If you're writing concurrent code, marking
the atomic accesses is the least of your problem.  Instead, for the
concurrent code I've written, it rather improved readability; it made
code more verbose, but in turn it made the synchronization easier to
see.

Beside this question of taste (and I don't care what the Kernel style
guides are), there is a real technical argument here, though:  Requiring
all synchronizing memory accesses to be annotated makes the compiler
aware what is sequential code, and what is not.  Without it, one can't
exploit data-race-freedom.  So unless we want to slow down optimization
of sequential code, we need to tell the compiler what is what.  If every
access is potentially synchronizing, then this doesn't just prevent
speculative stores.

(b) seems as if you are saying that if there is a specially annotated
atomic access in the code, that this isn't sequential/non-synchronizing
code anymore.  I don't agree with that, obviously.

> So I don't see how any sane person can say that speculative writes are
> ok. They are clearly not ok.

We are discussing programs written against the C11/C++11 memory model
here.  At least that's the part that got forwarded to g...@gcc.gnu.org,
and was subject of the nearest emails in the thread.

This memory model requires programs to be data-race-free.  Thus, we can
optimize accordingly.  If you don't like that, then this isn't C11/C++11
anymore.  Which would be fine, but then complain about that
specifically.

> Speculative stores are a bad idea in general.

I disagree in the context of C11/C++11 programs.  At least from a
correctness point of view.

> They are completely
> invalid for anything that says "atomic".

I agree, as you will see when you read the other emails I posted as part
of this thread.  But those two things are separate.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel
On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
> > Hi Paul,
> > 
> > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > 
> > > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > > a wooden stake through its hart.
> > > > 
> > > > C11/C++11 should not be allowed to claim itself a memory model until 
> > > > that
> > > > is sorted.
> > > 
> > > There actually is a proposal being put forward, but it might not make ARM
> > > and Power people happy because it involves adding a compare, a branch,
> > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > much preferring the no-store-speculation approach.
> > 
> > Can you elaborate a bit on this please? We don't permit speculative stores
> > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > emit any additional instructions to prevent that from happening.
> 
> Requiring a compare/branch/ISB after each relaxed load enables a simple(r)
> proof that out-of-thin-air values cannot be observed in the face of any
> compiler optimization that refrains from reordering a prior relaxed load
> with a subsequent relaxed store.
> 
> > Stores can, of course, be observed out-of-order but that's a lot more
> > reasonable :)
> 
> So let me try an example.  I am sure that Torvald Riegel will jump in
> with any needed corrections or amplifications:
> 
> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> One would intuitively expect r1 == r2 == 0 as the only possible outcome.
> But suppose that the compiler used specialization optimizations, as it
> would if there was a function that has a very lightweight implementation
> for some values and a very heavyweight one for other.  In particular,
> suppose that the lightweight implementation was for the value 42.
> Then the compiler might do something like the following:
> 
> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   if (r1 == 42)
>   atomic_store_explicit(42, y, memory_order_relaxed);
>   else
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> Suddenly we have an explicit constant 42 showing up.  Of course, if
> the compiler carefully avoided speculative stores (as both Peter and
> I believe that it should if its code generation is to be regarded as
> anything other than an act of vandalism, the words in the standard
> notwithstanding), there would be no problem.  But currently, a number
> of compiler writers see absolutely nothing wrong with transforming
> the optimized-for-42 version above with something like this:
> 
> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   atomic_store_explicit(42, y, memory_order_relaxed);
>   if (r1 != 42)
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);

Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.

For this to be correct, the compiler would actually have to prove that
the speculative store is "as-if correct", which in turn would mean that
it needs to be aware of all potential observers, and check whether those
observers aren't actually affected by the speculative store.

I would guess that the compilers you have in mind don't really do that.
If they do, then I don't see why this should be okay, unless you think
out-of-thin-air values are something good (which I wouldn't agree with).

> And then it is a short and uncontroversial step to the following:
> 
> Initial state: x == y == 0
> 
> T1:   atomic_store_explicit(42, y, memory_order_rela

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel
On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > > > There are also so many ways to blow your head off it's untrue. For 
> > > > > > example,
> > > > > > cmpxchg takes a separate memory model parameter for failure and 
> > > > > > success, but
> > > > > > then there are restrictions on the sets you can use for each. It's 
> > > > > > not hard
> > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then 
> > > > > > there's
> > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely 
> > > > > > ignores consume
> > > > > > atm and optimises all of the data dependencies away) as well as the 
> > > > > > definition
> > > > > > of "data races", which seem to be used as an excuse to miscompile a 
> > > > > > program
> > > > > > at the earliest opportunity.
> > > > > 
> > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > memory_order_consume until the compilers implement it both correctly 
> > > > > and
> > > > > efficiently.  They are not there yet, and there is currently no 
> > > > > shortage
> > > > > of compiler writers who would prefer to ignore memory_order_consume.
> > > > 
> > > > Do you have any input on
> > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > > > language standard's definition of dependencies?
> > > 
> > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > 
> > > — B is an invocation of any specialization of std::kill_dependency 
> > > (29.3), or
> > > — A is the left operand of a built-in logical AND (&&, see 5.14) or 
> > > logical OR (||, see 5.15) operator,
> > > or
> > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > 
> > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > after the "?" will carry a dependency, so the code fragment in 59448
> > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > of existence.  One way to do that on both ARM and Power is to actually
> > > emit code for "flag - flag", but there are a number of other ways to
> > > make that work.
> > 
> > And that's what would concern me, considering that these requirements
> > seem to be able to creep out easily.  Also, whereas the other atomics
> > just constrain compilers wrt. reordering across atomic accesses or
> > changes to the atomic accesses themselves, the dependencies are new
> > requirements on pieces of otherwise non-synchronizing code.  The latter
> > seems far more involved to me.
> 
> Well, the wording of 1.10p9 is pretty explicit on this point.
> There are only a few exceptions to the rule that dependencies from
> memory_order_consume loads must be tracked.  And to your point about
> requirements being placed on pieces of otherwise non-synchronizing code,
> we already have that with plain old load acquire and store release --
> both of these put ordering constraints that affect the surrounding
> non-synchronizing code.

I think there's a significant difference.  With acquire/release or more
general memory orders, it's true that we can't order _across_ the atomic
access.  However, we can reorder and optimize without additional
constraints if we do not reorder.  This is not the case with consume
memory order, as the (p + flag - flag) example shows.

> This issue got a lot of discussion, and the compromise is that
> dependencies cannot leak into or out of functions unless the relevant
> parameters or return values are annotated with [[carries_dependency]].
> This means that the compiler can see all the places where dependencies
> must be tracked.  This is described in 7.6.4.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Torvald Riegel
On Fri, 2014-02-07 at 08:50 -0800, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > Hopefully some discussion of out-of-thin-air values as well.
> > 
> > Yes, absolutely shoot store speculation in the head already. Then drive
> > a wooden stake through its hart.
> > 
> > C11/C++11 should not be allowed to claim itself a memory model until that
> > is sorted.
> 
> There actually is a proposal being put forward, but it might not make ARM
> and Power people happy because it involves adding a compare, a branch,
> and an ISB/isync after every relaxed load...  Me, I agree with you,
> much preferring the no-store-speculation approach.

My vague recollection is that everyone agrees that out-of-thin-air
values shouldn't be allowed, but that it's surprisingly complex to
actually specify this properly.

However, the example that Peter posted further down in the thread seems
to be unrelated to out-of-thin-air.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Torvald Riegel
On Fri, 2014-02-07 at 18:06 +0100, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
> > Hi Paul,
> > 
> > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > 
> > > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > > a wooden stake through its hart.
> > > > 
> > > > C11/C++11 should not be allowed to claim itself a memory model until 
> > > > that
> > > > is sorted.
> > > 
> > > There actually is a proposal being put forward, but it might not make ARM
> > > and Power people happy because it involves adding a compare, a branch,
> > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > much preferring the no-store-speculation approach.
> > 
> > Can you elaborate a bit on this please? We don't permit speculative stores
> > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > emit any additional instructions to prevent that from happening.
> > 
> > Stores can, of course, be observed out-of-order but that's a lot more
> > reasonable :)
> 
> This is more about the compiler speculating on stores; imagine:
> 
>   if (x)
>   y = 1;
>   else
>   y = 2;
> 
> The compiler is allowed to change that into:
> 
>   y = 2;
>   if (x)
>   y = 1;

If you write the example like that, this is indeed allowed because it's
all sequential code (and there's no volatiles in there, at least you
didn't show them :).  A store to y would happen in either case.  You
cannot observe the difference between both examples in a data-race-free
program.

Are there supposed to be atomic/non-sequential accesses in there?  If
so, please update the example.

> Which is of course a big problem when you want to rely on the ordering.
> 
> There's further problems where things like memset() can write outside
> the specified address range. Examples are memset() using single
> instructions to wipe entire cachelines and then 'restoring' the tail
> bit.

As Joseph said, this would be a bug IMO.

> While valid for single threaded, its a complete disaster for concurrent
> code.
> 
> There's more, but it all boils down to doing stores you don't expect in
> a 'sane' concurrent environment and/or don't respect the control flow.

A few of those got fixed already, because they violated the memory
model's requirements.  If you have further examples that are valid code
in the C11/C++11 model, please report them.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Torvald Riegel
On Thu, 2014-02-06 at 20:06 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 11:58:22PM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> > > > On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote:
> > > > > To answer that question, you need to go and look at the definitions of
> > > > > synchronises-with, happens-before, dependency_ordered_before and a 
> > > > > whole
> > > > > pile of vaguely written waffle to realise that you don't know.
> > > > 
> > > > Are you familiar with the formalization of the C11/C++11 model by Batty
> > > > et al.?
> > > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > > 
> > > > They also have a nice tool that can run condensed examples and show you
> > > > all allowed (and forbidden) executions (it runs in the browser, so is
> > > > slow for larger examples), including nice annotated graphs for those:
> > > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> > > > 
> > > > It requires somewhat special syntax, but the following, which should be
> > > > equivalent to your example above, runs just fine:
> > > > 
> > > > int main() {
> > > >   atomic_int foo = 0; 
> > > >   atomic_int bar = 0; 
> > > >   atomic_int baz = 0; 
> > > >   {{{ {
> > > > foo.store(42, memory_order_relaxed);
> > > > bar.store(1, memory_order_seq_cst);
> > > > baz.store(42, memory_order_relaxed);
> > > >   }
> > > >   ||| {
> > > > r1=baz.load(memory_order_seq_cst).readsvalue(42);
> > > > r2=foo.load(memory_order_seq_cst).readsvalue(0);
> > > >   }
> > > >   }}};
> > > >   return 0; }
> > > > 
> > > > That yields 3 consistent executions for me, and likewise if the last
> > > > readsvalue() is using 42 as argument.
> > > > 
> > > > If you add a "fence(memory_order_seq_cst);" after the store to foo, the
> > > > program can't observe != 42 for foo anymore, because the seq-cst fence
> > > > is adding a synchronizes-with edge via the baz reads-from.
> > > > 
> > > > I think this is a really neat tool, and very helpful to answer such
> > > > questions as in your example.
> > > 
> > > Hmmm...  The tool doesn't seem to like fetch_add().  But let's assume that
> > > your substitution of store() for fetch_add() is correct.  Then this shows
> > > that we cannot substitute fetch_add() for atomic_add_return().
> > 
> > It should be in this example, I believe.
> 
> You lost me on this one.

I mean that in this example, substituting fetch_add() with store()
should not change meaning, given that what the fetch_add reads-from
seems irrelevant.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Torvald Riegel
On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > There are also so many ways to blow your head off it's untrue. For 
> > > > example,
> > > > cmpxchg takes a separate memory model parameter for failure and 
> > > > success, but
> > > > then there are restrictions on the sets you can use for each. It's not 
> > > > hard
> > > > to find well-known memory-ordering experts shouting "Just use
> > > > memory_model_seq_cst for everything, it's too hard otherwise". Then 
> > > > there's
> > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores 
> > > > consume
> > > > atm and optimises all of the data dependencies away) as well as the 
> > > > definition
> > > > of "data races", which seem to be used as an excuse to miscompile a 
> > > > program
> > > > at the earliest opportunity.
> > > 
> > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > memory_order_consume until the compilers implement it both correctly and
> > > efficiently.  They are not there yet, and there is currently no shortage
> > > of compiler writers who would prefer to ignore memory_order_consume.
> > 
> > Do you have any input on
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > language standard's definition of dependencies?
> 
> Let's see...  1.10p9 says that a dependency must be carried unless:
> 
> — B is an invocation of any specialization of std::kill_dependency (29.3), or
> — A is the left operand of a built-in logical AND (&&, see 5.14) or logical 
> OR (||, see 5.15) operator,
> or
> — A is the left operand of a conditional (?:, see 5.16) operator, or
> — A is the left operand of the built-in comma (,) operator (5.18);
> 
> So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> after the "?" will carry a dependency, so the code fragment in 59448
> needs to do the ordering rather than just optimizing "flag - flag" out
> of existence.  One way to do that on both ARM and Power is to actually
> emit code for "flag - flag", but there are a number of other ways to
> make that work.

And that's what would concern me, considering that these requirements
seem to be able to creep out easily.  Also, whereas the other atomics
just constrain compilers wrt. reordering across atomic accesses or
changes to the atomic accesses themselves, the dependencies are new
requirements on pieces of otherwise non-synchronizing code.  The latter
seems far more involved to me.

> BTW, there is some discussion on 1.10p9's handling of && and ||, and
> that clause is likely to change.  And yes, I am behind on analyzing
> usage in the Linux kernel to find out if Linux cares...

Do you have any pointers to these discussions (e.g., LWG issues)?

> > > And rcu_dereference() will need per-arch overrides for some time during
> > > any transition to memory_order_consume.
> > > 
> > > > Trying to introduce system concepts (writes to devices, interrupts,
> > > > non-coherent agents) into this mess is going to be an uphill battle 
> > > > IMHO. I'd
> > > > just rather stick to the semantics we have and the asm volatile 
> > > > barriers.
> > > 
> > > And barrier() isn't going to go away any time soon, either.  And
> > > ACCESS_ONCE() needs to keep volatile semantics until there is some
> > > memory_order_whatever that prevents loads and stores from being coalesced.
> > 
> > I'd be happy to discuss something like this in ISO C++ SG1 (or has this
> > been discussed in the past already?).  But it needs to have a paper I
> > suppose.
> 
> The current position of the usual suspects other than me is that this
> falls into the category of forward-progress guarantees, which are
> considers (again, by the usual suspects other than me) to be out
> of scope.

But I think we need to better describe forward progress, even though
that might be tricky.  We made at least some progress on
http://cplusplus.github.io/LWG/lwg-active.html#2159 in Chicago, even
though we can't constrain the OS schedulers too much, and for lock-free
we're in this weird position that on most general-purpose schedulers and
machines, obstruction-free algorithms are likely 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Torvald Riegel
On Thu, 2014-02-06 at 22:13 +, Joseph S. Myers wrote:
> On Thu, 6 Feb 2014, Torvald Riegel wrote:
> 
> > > It seems that nobody really
> > > agrees on exactly how the C11 atomics map to real architectural
> > > instructions on anything but the trivial architectures.
> > 
> > There's certainly different ways to implement the memory model and those
> > have to be specified elsewhere, but I don't see how this differs much
> > from other things specified in the ABI(s) for each architecture.
> 
> It is not clear to me that there is any good consensus understanding of 
> how to specify this in an ABI, or how, given an ABI, to determine whether 
> an implementation is valid.
> 
> For ABIs not considering atomics / concurrency, it's well understood, for 
> example, that the ABI specifies observable conditions at function call 
> boundaries ("these arguments are in these registers", "the stack pointer 
> has this alignment", "on return from the function, the values of these 
> registers are unchanged from the values they had on entry").  It may 
> sometimes specify things at other times (e.g. presence or absence of a red 
> zone - whether memory beyond the stack pointer may be overwritten on an 
> interrupt).  But if it gives a code sequence, it's clear this is just an 
> example rather than a requirement for particular code to be generated - 
> any code sequence suffices if it meets the observable conditions at the 
> points where code generated by one implementation may pass control to code 
> generated by another implementation.
> 
> When atomics are involved, you no longer have a limited set of 
> well-defined points where control may pass from code generated by one 
> implementation to code generated by another - the code generated by the 
> two may be running concurrently.

Agreed.

> We know of certain cases 
> <http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html> where there are 
> choices of the mapping of atomic operations to particular instructions.  
> But I'm not sure there's much evidence that these are the only ABI issues 
> arising from concurrency - that there aren't any other ways in which an 
> implementation may transform code, consistent with the as-if rule of ISO 
> C, that may run into incompatibilities of different choices.

I can't think of other incompatibilities with high likelihood --
provided we ignore consume memory order and the handling of dependencies
(see below).  But I would doubt there is a high risk of such because (1)
the data race definition should hopefully not cause subtle
incompatibilities and (2) there is  clear "hand-off point" from the
compiler to a specific instruction representing an atomic access.  For
example, if we have a release store, and we agree on the instructions
used for that, then compilers will have to ensure happens-before for
anything before the release store; for example, as long as stores
sequenced-before the release store are performed, it doesn't matter in
which order that happens.  Subsequently, an acquire-load somewhere else
can pick this sequence of events up just by using the agreed-upon
acquire-load; like with the stores, it can order subsequent loads in any
way it sees fit, including different optimizations.  That's obviously
not a formal proof, though.  But it seems likely to me, at least :)

I'm more concerned about consume and dependencies because as far as I
understand the standard's requirements, dependencies need to be tracked
across function calls.  Thus, we might have several compilers involved
in that, and we can't just "condense" things to happens-before, but it's
instead how and that we keep dependencies intact.  Because of that, I'm
wondering whether this is actually implementable practically.  (See
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448#c10)

> And even if 
> those are the only issues, it's not clear there are well-understood ways 
> to define the mapping from the C11 memory model to the architecture's 
> model, which provide a good way to reason about whether a particular 
> choice of instructions is valid according to the mapping.

I think that if we have different options, there needs to be agreement
on which to choose across the compilers, at the very least.  I don't
quite know how this looks like for GCC and LLVM, for example.

> > Are you familiar with the formalization of the C11/C++11 model by Batty
> > et al.?
> > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> 
> These discuss, as well as the model itself, proving the validity of a 
> particular choice of x86 instructions.  I imagine that might be a starting 
> point

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Torvald Riegel
On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote:
> > > On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote:
> > > > On 02/06/14 18:25, David Howells wrote:
> > > > >
> > > > > Is it worth considering a move towards using C11 atomics and barriers 
> > > > > and
> > > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be 
> > > > > able to do
> > > > > these.
> > > > 
> > > > 
> > > > It sounds interesting to me, if we can make it work properly and 
> > > > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in.
> > > 
> > > Given my (albeit limited) experience playing with the C11 spec and GCC, I
> > > really think this is a bad idea for the kernel.
> > 
> > I'm not going to comment on what's best for the kernel (simply because I
> > don't work on it), but I disagree with several of your statements.
> > 
> > > It seems that nobody really
> > > agrees on exactly how the C11 atomics map to real architectural
> > > instructions on anything but the trivial architectures.
> > 
> > There's certainly different ways to implement the memory model and those
> > have to be specified elsewhere, but I don't see how this differs much
> > from other things specified in the ABI(s) for each architecture.
> > 
> > > For example, should
> > > the following code fire the assert?
> > 
> > I don't see how your example (which is about what the language requires
> > or not) relates to the statement about the mapping above?
> > 
> > > 
> > > extern atomic foo, bar, baz;
> > > 
> > > void thread1(void)
> > > {
> > >   foo.store(42, memory_order_relaxed);
> > >   bar.fetch_add(1, memory_order_seq_cst);
> > >   baz.store(42, memory_order_relaxed);
> > > }
> > > 
> > > void thread2(void)
> > > {
> > >   while (baz.load(memory_order_seq_cst) != 42) {
> > >   /* do nothing */
> > >   }
> > > 
> > >   assert(foo.load(memory_order_seq_cst) == 42);
> > > }
> > > 
> > 
> > It's a good example.  My first gut feeling was that the assertion should
> > never fire, but that was wrong because (as I seem to usually forget) the
> > seq-cst total order is just a constraint but doesn't itself contribute
> > to synchronizes-with -- but this is different for seq-cst fences.
> 
> From what I can see, Will's point is that mapping the Linux kernel's
> atomic_add_return() primitive into fetch_add() does not work because
> atomic_add_return()'s ordering properties require that the assert()
> never fire.
> 
> Augmenting the fetch_add() with a seq_cst fence would work on many
> architectures, but not for all similar examples.  The reason is that
> the C11 seq_cst fence is deliberately weak compared to ARM's dmb or
> Power's sync.  To your point, I believe that it would make the above
> example work, but there are some IRIW-like examples that would fail
> according to the standard (though a number of specific implementations
> would in fact work correctly).

Thanks for the background.  I don't read LKML, and it wasn't obvious
from reading just the part of the thread posted to gcc@ that achieving
these semantics is the goal.

> > > To answer that question, you need to go and look at the definitions of
> > > synchronises-with, happens-before, dependency_ordered_before and a whole
> > > pile of vaguely written waffle to realise that you don't know.
> > 
> > Are you familiar with the formalization of the C11/C++11 model by Batty
> > et al.?
> > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > 
> > They also have a nice tool that can run condensed examples and show you
> > all allowed (and forbidden) executions (it runs in the browser, so is
> > slow for larger examples), including nice annotated graphs for those:
> > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> > 
> > It requires somewhat special syntax, but the following, which should be
> > equivalent to your example above, runs just fine:
> > 
> > int main() {
> >   atomic_int foo = 0; 
> >   atomic_int bar = 0; 
> >   atomic_int baz = 0; 
> >   {{{ {
> > foo.store(42, memory_order_rel

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Torvald Riegel
On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote:
> > > On 02/06/14 18:25, David Howells wrote:
> > > >
> > > > Is it worth considering a move towards using C11 atomics and barriers 
> > > > and
> > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able 
> > > > to do
> > > > these.
> > > 
> > > 
> > > It sounds interesting to me, if we can make it work properly and 
> > > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in.
> > 
> > Given my (albeit limited) experience playing with the C11 spec and GCC, I
> > really think this is a bad idea for the kernel. It seems that nobody really
> > agrees on exactly how the C11 atomics map to real architectural
> > instructions on anything but the trivial architectures. For example, should
> > the following code fire the assert?
> > 
> > 
> > extern atomic foo, bar, baz;
> > 
> > void thread1(void)
> > {
> > foo.store(42, memory_order_relaxed);
> > bar.fetch_add(1, memory_order_seq_cst);
> > baz.store(42, memory_order_relaxed);
> > }
> > 
> > void thread2(void)
> > {
> > while (baz.load(memory_order_seq_cst) != 42) {
> > /* do nothing */
> > }
> > 
> > assert(foo.load(memory_order_seq_cst) == 42);
> > }
> > 
> > 
> > To answer that question, you need to go and look at the definitions of
> > synchronises-with, happens-before, dependency_ordered_before and a whole
> > pile of vaguely written waffle to realise that you don't know. Certainly,
> > the code that arm64 GCC currently spits out would allow the assertion to 
> > fire
> > on some microarchitectures.
> 
> Yep!  I believe that a memory_order_seq_cst fence in combination with the
> fetch_add() would do the trick on many architectures, however.  All of
> this is one reason that any C11 definitions need to be individually
> overridable by individual architectures.

"Overridable" in which sense?  Do you want to change the semantics on
the language level in the sense of altering the memory model, or rather
use a different implementation under the hood to, for example, fix
deficiencies in the compilers?

> > There are also so many ways to blow your head off it's untrue. For example,
> > cmpxchg takes a separate memory model parameter for failure and success, but
> > then there are restrictions on the sets you can use for each. It's not hard
> > to find well-known memory-ordering experts shouting "Just use
> > memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> > the fun of load-consume vs load-acquire (arm64 GCC completely ignores 
> > consume
> > atm and optimises all of the data dependencies away) as well as the 
> > definition
> > of "data races", which seem to be used as an excuse to miscompile a program
> > at the earliest opportunity.
> 
> Trust me, rcu_dereference() is not going to be defined in terms of
> memory_order_consume until the compilers implement it both correctly and
> efficiently.  They are not there yet, and there is currently no shortage
> of compiler writers who would prefer to ignore memory_order_consume.

Do you have any input on
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
language standard's definition of dependencies?

> And rcu_dereference() will need per-arch overrides for some time during
> any transition to memory_order_consume.
> 
> > Trying to introduce system concepts (writes to devices, interrupts,
> > non-coherent agents) into this mess is going to be an uphill battle IMHO. 
> > I'd
> > just rather stick to the semantics we have and the asm volatile barriers.
> 
> And barrier() isn't going to go away any time soon, either.  And
> ACCESS_ONCE() needs to keep volatile semantics until there is some
> memory_order_whatever that prevents loads and stores from being coalesced.

I'd be happy to discuss something like this in ISO C++ SG1 (or has this
been discussed in the past already?).  But it needs to have a paper I
suppose.

Will you be in Issaquah for the C++ meeting next week?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Torvald Riegel
On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote:
> On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote:
> > On 02/06/14 18:25, David Howells wrote:
> > >
> > > Is it worth considering a move towards using C11 atomics and barriers and
> > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able 
> > > to do
> > > these.
> > 
> > 
> > It sounds interesting to me, if we can make it work properly and 
> > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in.
> 
> Given my (albeit limited) experience playing with the C11 spec and GCC, I
> really think this is a bad idea for the kernel.

I'm not going to comment on what's best for the kernel (simply because I
don't work on it), but I disagree with several of your statements.

> It seems that nobody really
> agrees on exactly how the C11 atomics map to real architectural
> instructions on anything but the trivial architectures.

There's certainly different ways to implement the memory model and those
have to be specified elsewhere, but I don't see how this differs much
from other things specified in the ABI(s) for each architecture.

> For example, should
> the following code fire the assert?

I don't see how your example (which is about what the language requires
or not) relates to the statement about the mapping above?

> 
> extern atomic foo, bar, baz;
> 
> void thread1(void)
> {
>   foo.store(42, memory_order_relaxed);
>   bar.fetch_add(1, memory_order_seq_cst);
>   baz.store(42, memory_order_relaxed);
> }
> 
> void thread2(void)
> {
>   while (baz.load(memory_order_seq_cst) != 42) {
>   /* do nothing */
>   }
> 
>   assert(foo.load(memory_order_seq_cst) == 42);
> }
> 

It's a good example.  My first gut feeling was that the assertion should
never fire, but that was wrong because (as I seem to usually forget) the
seq-cst total order is just a constraint but doesn't itself contribute
to synchronizes-with -- but this is different for seq-cst fences.

> To answer that question, you need to go and look at the definitions of
> synchronises-with, happens-before, dependency_ordered_before and a whole
> pile of vaguely written waffle to realise that you don't know.

Are you familiar with the formalization of the C11/C++11 model by Batty
et al.?
http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
http://www.cl.cam.ac.uk/~mjb220/n3132.pdf

They also have a nice tool that can run condensed examples and show you
all allowed (and forbidden) executions (it runs in the browser, so is
slow for larger examples), including nice annotated graphs for those:
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

It requires somewhat special syntax, but the following, which should be
equivalent to your example above, runs just fine:

int main() {
  atomic_int foo = 0; 
  atomic_int bar = 0; 
  atomic_int baz = 0; 
  {{{ {
foo.store(42, memory_order_relaxed);
bar.store(1, memory_order_seq_cst);
baz.store(42, memory_order_relaxed);
  }
  ||| {
r1=baz.load(memory_order_seq_cst).readsvalue(42);
r2=foo.load(memory_order_seq_cst).readsvalue(0);
  }
  }}};
  return 0; }

That yields 3 consistent executions for me, and likewise if the last
readsvalue() is using 42 as argument.

If you add a "fence(memory_order_seq_cst);" after the store to foo, the
program can't observe != 42 for foo anymore, because the seq-cst fence
is adding a synchronizes-with edge via the baz reads-from.

I think this is a really neat tool, and very helpful to answer such
questions as in your example.

> Certainly,
> the code that arm64 GCC currently spits out would allow the assertion to fire
> on some microarchitectures.
> 
> There are also so many ways to blow your head off it's untrue. For example,
> cmpxchg takes a separate memory model parameter for failure and success, but
> then there are restrictions on the sets you can use for each.

That's in there for the architectures without a single-instruction
CAS/cmpxchg, I believe.

> It's not hard
> to find well-known memory-ordering experts shouting "Just use
> memory_model_seq_cst for everything, it's too hard otherwise".

Everyone I've heard saying this meant this as advice to people new to
synchronization or just dealing infrequently with it.  The advice is the
simple and safe fallback, and I don't think it's meant as an
acknowledgment that the model itself would be too hard.  If the
language's memory model is supposed to represent weak HW memory models
to at least some extent, there's only so much you can do in terms of
keeping it simple.  If all architectures had x86-like models, the
language's model would certainly be simpler... :)

> Then there's
> the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> atm and optimises all of the data dependencies away)

AFAIK consume memory order was added to model Power/ARM-specific
behavior.  I agree that the way the standard specifies how dependencies
are to be p